Announcement

Collapse
No announcement yet.

Work Revived On Parallel CPU Bring-Up To Boot Linux Faster On Large Systems/Servers

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by pipe13 View Post

    Possibly what's failing on the AMD side is the failure of any AMD user to step up with a use case. From TFA:


    6 tenths of a second. About a slow blink of an eye. What is the total boot time of your workstation?

    Now, someone at Amazon -- David Woodhouse -- thought for whatever reason that is was worth the effort for their Intel use case. Whatever reason that might be. But if no one is clamoring for this feature for Zen 3 or Zen 4, perhaps it will hurt no one for AMD to put it off until Zen 5 or Zen 6. There will be substantial architecture update with Zen 5.
    So on re-reading, it seems to affect Zen and Zen+. Zen 2 and onward seems ok with the feature enablement. BUT...since it cannot be enabled across the whole Zen line as a universal Zen enablement, off it goes. And whatever issue 1st Gen Zen has with the feature, it must be BIOS/UEFI-related if I had to take a guess at it's source. No one can troubleshoot it because with the feature enabled it fails before any meaningful cause or debugging can determine the source. Does make it kinda tricky.

    For us mere mortals, this feature is probably not for any real day-to-day use. For the Amazon and Googles who've got millions of cores I'm sure every millisecond counts when you're testing some feature and it takes 10 minutes for your giant server farm to reboot and do it's thing. For that use case, yeah, I can see how shaving off MANY seconds might help in getting you home in time for a dinner that's still hot. Someone out there will get annoyed enough to make a valiant attempt at figuring it out but how many massive datacenters are still running anything major off 1st gen Zen stuff? Probably not too many. Thus why AMD sees not much value in trying to dig into the cause. And even if they determine it's a BIOS / UEFI / CPU firmware cause, I doubt any MoBo maker's gonna issue a BIOS or CPU firmware update for it.

    If they can work around it through CPU detection a la "if CPU=Zen || Zen+ {enableParallelBringUp=false}" then yay for AMD I guess...so long as it doesn't shit the bed using that kind of kludge. The devs seem to be agreeing that it's an all or none use case for AMD. Probably a safe bet.

    Comment


    • #12
      Originally posted by NobodyXu View Post

      IMO I don't think 384/512 physical cores or even 192/256 physical cores are coming soon, since the current silicon technology has almost reached its limit, continuing scaling up is hard.

      AMD's best server CPU AMD EPYC™ 9654P has 96 physical cores here, with Ampere altra with 80 physical cores.
      I'm not aware of any one with more physical cores than them.

      In order to have 192 physical cores, which is 2x core count improvement, either the silicon technology must be able to shrink the node by 2x, which is impossible as we almost reach the limit.

      Or AMD can downscale the single core performance to add more cores to one CPU and also potentially use stacking silicons on top, which also cause heating issues and will downclock the single core CPU performance.

      That's a valid solution, maybe the right solution for some highly parallel program, but I'm not entirely sure this is the right solution for cloud.

      While the cloud mostly care about core counts as they have a lot of program/VMs running on it, single-core perf is still important since not everything can be run in parallel, single core perf still matter for these applications.

      Even if that is the right decisions, to double the physical core count is still going to be quite hard.
      Yeah, it's getting tougher and tougher to squeeze more cores into such a tiny space even at 5nm. Have you seen the size of Zen 4 and Gen 13 Intel CPUs these days? Compared to ones from a decade ago, they are MASSIVE. Seems that if they cannot fit more in less space, we're just going to see physically larger CPUs that occupy more physical surface space.

      AM5 is already a 1718 pin-count socket. You don't think by the time they figure out sub 3nm process that we'll start seeing socket pin-counts into the 3000 / 4000 thousand range? If I had to make a bet, I'd say we're at least 3-4 years away from seeing 192 core counts and probably 5 years away from seeing 256 core counts. We'll probably see CPUs that will be twice the physical size of what we have now by that time.
      Last edited by kozman; 02 February 2023, 10:36 PM.

      Comment


      • #13
        Originally posted by kozman View Post

        ... For us mere mortals, this feature is probably not for any real day-to-day use. For the Amazon and Googles who've got millions of cores I'm sure every millisecond counts when you're testing some feature and it takes 10 minutes for your giant server farm to reboot and do it's thing. For that use case, yeah, I can see how shaving off MANY seconds might help in getting you home in time for a dinner that's still hot. ...
        If Amazon and Google aren't rebooting their server farm nodes in parallel, they should fix that immediately. If they are, this seems like an improvement of less than one second in something that might take many minutes. Optimizing their kernel builds to not probe for hardware that isn't there is likely to save WAY more time.

        Comment


        • #14
          Originally posted by NobodyXu View Post

          IMO I don't think 384/512 physical cores or even 192/256 physical cores are coming soon, since the current silicon technology has almost reached its limit, continuing scaling up is hard.

          AMD's best server CPU AMD EPYC™ 9654P has 96 physical cores here, with Ampere altra with 80 physical cores.
          I'm not aware of any one with more physical cores than them.

          In order to have 192 physical cores, which is 2x core count improvement, either the silicon technology must be able to shrink the node by 2x, which is impossible as we almost reach the limit.
          8 socket Sapphire Rapids supports up to 480 cores and 960 threads using 60 core CPUs.

          Comment


          • #15
            Gentlemen, the game is now milliseconds.

            We've come a long way.

            Comment


            • #16
              Originally posted by billbo View Post

              If Amazon and Google aren't rebooting their server farm nodes in parallel, they should fix that immediately. If they are, this seems like an improvement of less than one second in something that might take many minutes. Optimizing their kernel builds to not probe for hardware that isn't there is likely to save WAY more time.
              Amazon and google do not "reboot the server farm" all at once, and they don't care about this. Hardware restarts happen gradually, giving the distributed systems running on top time to gracefully drain and migrate elsewhere. No one gives two squats about boot time, even if the whole dc loses power. When that happens traffic is drained to another nearby zone anyway, giving hardware ops time to bring things back up slowly and safely

              Comment


              • #17
                Originally posted by kozman View Post
                If I had to make a bet, I'd say we're at least 3-4 years away from seeing 192 core counts and probably 5 years away from seeing 256 core counts. We'll probably see CPUs that will be twice the physical size of what we have now by that time.
                That sounds like reasonable guess

                Comment


                • #18
                  Originally posted by Paradigm Shifter View Post
                  Ampere have a 128 core model.
                  Thanks, I didn't know that.
                  Looking at the spec, it trades off single thread perf for multi thread one as I have thought, but it seems to be still beneficial for cloud computing.

                  While that's great, I hope in the future GHA at least gives us more virtual cores if they are going to trade single-core perf for multi-core perf.
                  GHA is already quite slow with only two cores for free version, with this trade-off I think it will be even slower.

                  Originally posted by Paradigm Shifter View Post
                  And Intel are talking up their new Xeons which are efficiency-cores-only and have an absolutely colossal socket for 2024 (I'll believe it when I see it). Videocardz has come out with some pretty wild extrapolations regarding Intel of late, which are, given the issues Intel have had with even getting Sapphire Rapids out the door, a little hard to believe until solid evidence is presented (i.e.: I can place an order for said CPU and have it arrive the same month).
                  Yeah Intel is struggling against AMD in server where single-core perf does not matter as much as multi-core perf, power efficiency is very important and their new hardware accelerators hasn't been yet and can only be used in limit workflow (maths/AI).

                  Originally posted by Paradigm Shifter View Post
                  So it seems like the solution might be "make the CPUs physically larger". Which runs into potential yield issues... unless a chiplet design is utilised, of course...
                  IMO it's just way easier to have multiple CPU sockets at this point, that will give better yield, lower price and better perf.

                  Comment


                  • #19
                    Originally posted by Mark Rose View Post

                    8 socket Sapphire Rapids supports up to 480 cores and 960 threads using 60 core CPUs.
                    I were talking about physical cores on a single CPU, but I do agree having multiple CPUs on the same motherboard scales much better than adding more physical cores to one CPU.

                    Comment

                    Working...
                    X