Announcement

Collapse
No announcement yet.

Red Hat Evaluating x86-64-v3 Requirement For RHEL 10

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by torsionbar28 View Post
    RHEL targets infrastructure servers. Not desktops. I'd wager that something north of 90% of RHEL installations are running in a rack in a datacenter, on a machine with ECC memory. Nobody installs RHEL on grandma's old Packard Bell peecee.
    Still doesn't rule out Atom-branded, embedded server SoCs.

    If Redhat goes ahead with this, it'll be interesting to see if they have to walk it back, due to those machines. Intel doesn't make a whole product line like that, without sales volumes in the millions, and they're probably all running either embedded or enterprise Linux.

    Comment


    • #52
      Originally posted by torsionbar28 View Post
      We have racks and racks of Opteron 6300 servers running RHEL 9. These are 24/7 production systems. Our client doesn't want to spend the money on new hardware, CPU utilization on these machines is generally 50% or less, and the client doesn't care about power/cooling costs. Ergo, we continue to run them. FYI Opteron 6300 is x86-64 v2.

      I imagine if RHEL 10 requires v3, it may force them to buy new servers. Although RHEL 9 will be supported for many more years, so I'm sure they will just delay the upgrade to RHEL 10 if that's the case.
      seriously? 2032? What kind of business cannot save enough money to buy new servers in 10 years?

      Comment


      • #53
        Originally posted by coder View Post
        Still doesn't rule out Atom-branded, embedded server SoCs.

        If Redhat goes ahead with this, it'll be interesting to see if they have to walk it back, due to those machines. Intel doesn't make a whole product line like that, without sales volumes in the millions, and they're probably all running either embedded or enterprise Linux.
        If it's really that large of an opportunity for Red Hat, they'll just have a special flavor of RHEL for them. I'd assume Red Hat knows better than any of us how big a market that is for them and whether it's worth the effort to target or not.

        Comment


        • #54
          Originally posted by SophTherapy View Post

          seriously? 2032? What kind of business cannot save enough money to buy new servers in 10 years?
          I think the point is more that they don't see a need to do so, which increasingly becomes less of an argument as time goes on just due to how inefficient opteron is.

          Comment


          • #55
            Originally posted by qarium View Post
            ...
            right now the market as i see it is like this gamers buy 5800X3D or 7800X3D prosumers buy 7950X3D small servers buy 7950X the workstation people buy threatripper 7000 the server people buy epic... in my opinion no sane person buy intel.
            ...
            Normal people buy laptops (well normal people buy phones and begrudgingly buy a cheap laptop when the cat pisses on it). AMD chips are ideal for laptops, but for whatever reason they don't make enough of them. There's orders of magnitude more availability of intel laptops. AMD has done well leveraging their console presence to cater to the handheld market, and miniPC's, otherwise the insanely efficient 7840u parts might not have sold much at all despite being the sane way to go.

            I'd argue workstation people should buy the 7950X if they can, and consider epyc if they need pcie etc. The threadripper series is priced for a dozen tech nerds and youtubers to get hyped over, at that point you might as well go full hog and start the server dream.

            Originally posted by qarium View Post
            ...
            my prediction is this: whatever intel does with AVX10 or 10.1 or 10.2 AMD will just ignore it. and why should AMD not ignore it they already have the better implementation and also better software support.

            just get the point intel already enforce 3-4 different isa for AVX512 cpus and AVX10(256bit) cpus for intel E-Core cpus and so one and so one

            AMD instead will only have 1 single ISA for all their cpu cores means ZEN5+ZEN5c this will give intel great pain to support it.

            and just keep in mind intel already lost the asymetric cpu scheduler war

            why not just ignore intel ? they deserve to be ignored to be honest.
            I don't know if that's likely. Half the point of intel trailblazing new ISA's is that they win the benchmarks that focus on them (if no benchmarks exist, they'll create them). A couple benchmarks that magically don't support AVX512 but do support 512 bit AVX10 and hey presto intel are winning guys look! If AVX10.1.2.3+++ can be rolled into the AMD design easily, they probably will in a pretty timely manner.​

            Comment


            • #56
              Originally posted by geerge View Post
              I'd argue workstation people should buy the 7950X if they can, and consider epyc if they need pcie etc. The threadripper series is priced for a dozen tech nerds and youtubers to get hyped over, at that point you might as well go full hog and start the server dream.​
              Well yes and no, Threadripper still has an advantage over Epyc in that its single core clock is higher than Epyc. Epyc is definitely better for server, but if you need insanely fast cores while still having a decent amount of them then thats what Threadripper is for.

              It is a very weird middle ground, having very high (but not the highest) clock speed + a decent amount of cores/pCIE is typically overkill however I can imagine cases where you do need that high core speed probably with tasks that aren't easily parallelizable while still needing to do a lot of them.

              Comment


              • #57
                Originally posted by debrouxl View Post
                GCC 13 series, -O3, same build flags other than -march=ivybridge -mtune=ivybridge vs. -march=broadwell -mtune=broadwell .
                For "science", it'd be interesting to know their relative performance when both compiled & tuned for ivybridge. It'd be nice to separate out the impacts of implementation differences from the ISA differences.

                Originally posted by debrouxl View Post
                ​The BDW build uses some FMA3 instructions while the IVB build obviously doesn't; however, a sprinkling of FMA alone isn't supposed to make that huge of a speed difference, is it really ?
                I found this:

                "The other major addition to the execution engine is support for Intel's AVX2 instructions, including FMA (Fused Multiply-Add). Ports 0 & 1 now include newly designed 256-bit FMA units. As each FMA operation is effectively two floating point operations, these two units double the peak floating point throughput of Haswell compared to Sandy/Ivy Bridge. A side effect of the FMA units is that you now get two ports worth of FP multiply units, which can be a big boon to legacy FP code.



                Source: https://www.anandtech.com/show/6355/...architecture/8


                In Broadwell, the few microarchitecture improvements they made include:
                • Faster divider: lower latency & higher throughput
                • AVX multiply latency has decreased from 5 to 3
                • Bigger TLB (1.5k vs 1k entries)
                • Slightly improved branch prediction (as always)
                • Larger scheduler (64 vs 60)
                Source: https://www.anandtech.com/show/10158...e5-v4-review/3

                Also, they reduced & fine-tuned clock-throttling of AVX workloads. I wonder whether/how much these optimizations benefited your FMA case.

                Originally posted by debrouxl View Post
                ​There was indeed a bit of false sharing at some point, but no longer. I noticed the poor scaling it caused when testing on a 2S machine for the first time, and fixed it before ever testing on a 4S machine, mainly by using per-thread duplicates and reducing at the end.
                I once wrote a benchmark to try and quantify the effect of false sharing, but it seemed harder to measure than I expected. To really see it, I couldn't just have 2 threads hammering on a single cacheline, I actually needed to cycle between multiple cachelines. Ping-ponging between two gave the most dramatic effect, but it still wasn't as bad on Sandybridge as I'd have expected.
                Last edited by coder; 21 January 2024, 04:14 PM.

                Comment


                • #58
                  Originally posted by coder View Post
                  For "science", it'd be interesting to know their relative performance when both compiled & tuned for ivybridge. It'd be nice to separate out the impacts of implementation differences from the ISA differences.
                  I've now been able to test the IVB-tuned binary on the BDW: according to the logs, after 15' of run time, it was a bit less than 7% slower than the BDW-tuned binary built by the same build system invocation a month ago. However, since it's nearly a month later, with environment changes such as a reboot and a kernel version change, I benchmarked the BDW-tuned binary again. Well, today, it has exactly the same speed as the IVB-tuned binary !

                  I double-checked the results. The binaries are different, they have different size to begin with, and anyway, one has FMA instructions and the other doesn't. They're statically linked, which partially shields them from changes in the environment.

                  Comment


                  • #59
                    Originally posted by debrouxl View Post
                    I've now been able to test the IVB-tuned binary on the BDW: according to the logs, after 15' of run time, it was a bit less than 7% slower than the BDW-tuned binary built by the same build system invocation a month ago. However, since it's nearly a month later, with environment changes such as a reboot and a kernel version change, I benchmarked the BDW-tuned binary again. Well, today, it has exactly the same speed as the IVB-tuned binary !
                    That's very interesting. Thanks for following up!

                    Comment


                    • #60
                      Originally posted by coder View Post
                      That's very interesting. Thanks for following up!
                      I'd say that this means (confirms) that the bottleneck isn't the FP computation, right ?

                      Making proper benchmarks on a non-shared, dedicated computer with 6 or more memory channels per socket is obviously a long-time todo/wish list item. However, although their price tags have significantly decreased lately with the advent of SPR platforms, 2S SKX servers remain far more expensive than 2S BDX servers, which makes the purchase much harder to justify.
                      The step above would be comparing 2S x 32C EPYC Rome @ 1 DPC (e.g. EPYC 7542) and 4S 18C SKX @ 1 DPC (e.g. Xeon Gold 6150 or the even higher-end 6154) for that workload, but that's even more expensive, obviously

                      Comment

                      Working...
                      X