Announcement

Collapse
No announcement yet.

Benchmarking A 10-Core Tyan/IBM POWER Server For ~$300 USD

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by baryluk View Post
    In service manual I found that it operates better with 8 DIMMs. With 16 DIMMs, it will operate at lower 1066MHz, instead of 1333MHz on each centaur channel, but also connection to CPU from Centaurs will be slower, so the total available peak bandwidth will be also lower. Similarly using 32GB DIMMs (which are basically two DIMMs on one stick and operate as quad rank), will only operate at 1066MHz. And possibly even lower if you have to of each on each channel.
    No - memory frequency and bandwidth only start to drop if you start populating memory channels with 32GB DIMMs or if you use all 32 slots, if I remember correctly. The IBM manual for the IBM S812LC server (which is the same system, just rebranded) is a bit clearer on this issue: For optimal performance IBM recommends 16 DIMMs.
    I have also verified this empirically; with 8 DIMMs I got up to 70 GB/s sustained stream bandwidth, while I got up to 100 GB/s with 16.

    Originally posted by baryluk View Post
    So, I was looking at this module: M393B1K70DH0-YH9 ( https://www.samsung.com/semiconducto...3B1K70DH0-YH9/ , https://www.samsung.com/semiconducto...mm_rev12-2.pdf ) , which is available both from piospartslap cheaply (but a low quantity), and my local ebay-like site at good prices and 50+ quantities.

    However according to sites and Samsung, this module is 2Rx4 , which is not on the list. And I guess it is a bit older than manual and is probably based on 2Gbit chips?
    I'm using 16 M393B1K70CH0-YH9 (2Rx4) which work fine.

    Comment


    • #72
      Hi,

      does anybody know what is the latest version of BMC firmware and where to get it for this machine? Currently there seems to be AMI BMC firmware installed by default. Is that used also with IBM S812LC server? Is perhaps opensource implementation of BMC available?

      regards,
      Matevz

      Comment


      • #73
        There's no open-source BMC for this machine. If you google for it, you'll find a post in openbmc's github saying they do not support habanero.

        Comment


        • #74
          I have ordered one, and it is now sitting in Belgrade customs office, tracking message on ebay says they lack import documentation and they cant contact receiver (me).

          Bummer.

          Comment


          • #75
            Originally posted by illuhad View Post
            I have also verified this empirically; with 8 DIMMs I got up to 70 GB/s sustained stream bandwidth, while I got up to 100 GB/s with 16.


            I'm using 16 M393B1K70CH0-YH9 (2Rx4) which work fine.
            I'm using M393B1K70DH0-YH9. I'm using 8 sticks of those to utilize all Centaurs but still I'm able to just get around 20-25GB/s for single-threaded stream and 41-46GB/s for multi-threaded. I've updated FW to 1.0.1 -- latest Tyan provided so if everything is as expected, then this should run RAM in interleave mode. Upgrade from 4 to 8 sticks resulted in 2x the performance increase. Update from 1.0 to 1.0.1 FW does not show any performance increase.
            So I'm curious what's your problem size in stream? My array size is 80mil and total memory required is 1831MB. I remember I needed to increase that to go over CPU caches and measure real RAM bandwidth... So I'm curious if you are using default size and measure CPU <-> Centaurs cache bandwidth instead? Thanks! Karel

            Comment


            • #76
              Originally posted by kgardas View Post
              I'm using M393B1K70DH0-YH9. I'm using 8 sticks of those to utilize all Centaurs but still I'm able to just get around 20-25GB/s for single-threaded stream and 41-46GB/s for multi-threaded. I've updated FW to 1.0.1 -- latest Tyan provided so if everything is as expected, then this should run RAM in interleave mode. Upgrade from 4 to 8 sticks resulted in 2x the performance increase. Update from 1.0 to 1.0.1 FW does not show any performance increase.
              So I'm curious what's your problem size in stream? My array size is 80mil and total memory required is 1831MB. I remember I needed to increase that to go over CPU caches and measure real RAM bandwidth... So I'm curious if you are using default size and measure CPU <-> Centaurs cache bandwidth instead? Thanks! Karel
              Sorry, I can't really remember the problem size for my 8 DIMM result, as this was just a quick test I did while waiting for the other 8 DIMMs to arrive. I just wanted to put some load on the system and wasn't that interested in the results (since this was not my target configuration anyway), so it's possible that I didn't change the array size. Could also be that I remembered the result incorrectly. Perhaps if I find time I can redo the test with 8 DIMMs this evening.

              I'm quite confident about my 16 DIMM result though, which I have also tested with very large problem sizes. Since with 16 DIMMs, DRAM bandwidth is larger than Centaur<->CPU bandwidth, Centaur caches don't play a big role here anymore. You still need to worry about L3 though.

              My 16 DIMM results are also roughly in line with the anandtech results (91 GB/s):


              All my results are multi-threaded, I've never done single-threaded runs.

              Are you using PTS or manual runs? My runs were all manual, if you use PTS it could also be that PTS doesn't compile it in an optimal way like it was the case for 7-zip.

              A couple of additional points that were important for me to get peak performance:
              • use recent gcc (8.x)
              • set OMP_NUM_THREADS to the number of cores
              • set OMP_PROC_BIND=spread

              BTW, I think you can reduce your array size a bit. Official stream rules are that the array size should be at least 4 times the size of largest cache, i.e. 4*64=256MB for the 8 Core and 4*80=320 MB for the 10-core.

              Comment


              • #77
                Originally posted by illuhad View Post
                [...]
                My 16 DIMM results are also roughly in line with the anandtech results (91 GB/s):


                All my results are multi-threaded, I've never done single-threaded runs.

                Are you using PTS or manual runs? My runs were all manual, if you use PTS it could also be that PTS doesn't compile it in an optimal way like it was the case for 7-zip.

                A couple of additional points that were important for me to get peak performance:
                • use recent gcc (8.x)
                • set OMP_NUM_THREADS to the number of cores
                • set OMP_PROC_BIND=spread

                BTW, I think you can reduce your array size a bit. Official stream rules are that the array size should be at least 4 times the size of largest cache, i.e. 4*64=256MB for the 8 Core and 4*80=320 MB for the 10-core.
                I'm always using manual run since PTS is buggy and is not able to run stream bench on PPC64le. This is due to bug of assigning -march=native parameter where compiler complains about it as uknown. -mcpu=native -- is the correct value here.

                Anyway, thanks a lot for all those information provided. I've retested and my findings are
                • gcc 8.20 in comparison with 7.3.0 (Ubuntu 18.04 LTS) does not bring that much
                • changing -O to -Ofast (as on Anandtech) brings some 2-6 GB/s (on triad), on add IIRC it even leads to worse result.
                • changing OMP_NUM_THREADS from 64 down to 8 changes everything. With this I'm able to run triad up to 60GB/s.


                Thanks! Karel

                Comment


                • #78
                  Originally posted by kgardas View Post

                  I'm always using manual run since PTS is buggy and is not able to run stream bench on PPC64le. This is due to bug of assigning -march=native parameter where compiler complains about it as uknown. -mcpu=native -- is the correct value here.

                  Anyway, thanks a lot for all those information provided. I've retested and my findings are
                  • gcc 8.20 in comparison with 7.3.0 (Ubuntu 18.04 LTS) does not bring that much
                  • changing -O to -Ofast (as on Anandtech) brings some 2-6 GB/s (on triad), on add IIRC it even leads to worse result.
                  • changing OMP_NUM_THREADS from 64 down to 8 changes everything. With this I'm able to run triad up to 60GB/s.


                  Thanks! Karel
                  Great to hear that! Have you also tried OMP_PROC_BIND=spread? This also gave me some significant speedups, as it forces the threads to be pinned and distributed over the cores.

                  Comment


                  • #79
                    My server just arrived. I have only 2 RDIMM modules, can someone tell me if I need 4 or I can try it with 2? I plan to put 32 RDIMM when I make sure all works.

                    Comment


                    • #80
                      Originally posted by gnufreex View Post
                      My server just arrived. I have only 2 RDIMM modules, can someone tell me if I need 4 or I can try it with 2? I plan to put 32 RDIMM when I make sure all works.
                      Hi,

                      My understanding of the user guide (p39 and p88) is that you need at least 4 sticks to start with. However, I should not harm to try with only two.

                      I have myself an issue trying to setup a discrete graphic card in it. I tried a Quadro K600 and a radeon HD 6450 and in both case, the card is visible with lspci, the kernel module is loaded, but no screen is detected by xorg. I tried get-edid which detected only the screen plugged on the VGA output of the motherboard, but none plugged on the discrete cards (tried dvi and displayport).

                      I notice that you guys mentioned using a Radeon. Would you have any clue to give me please ?

                      Cheers,

                      Comment

                      Working...
                      X