Announcement

Collapse
No announcement yet.

Google Has A Problem With Linux Server Reboots Too Slow Due To Too Many NVMe Drives

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by caligula View Post
    Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
    60 seconds are not a "few seconds" for important services that must be online.

    With fast reboot times we get services online faster. The more uptime, the better. Simple, no? Even on clustered scenarios, we make the synchronization faster as well.

    The question I make to you is: Why would we want a slower restart time if we can make it faster?

    Comment


    • #12
      Originally posted by caligula View Post
      servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
      Depends. If there is some urgent kernel update that runs into the limitations of kernel live patching, then you have to reboot all of your servers in short order. And while servers are rebooting, your server capacity is either much reduced, or you have to draw out your update process over a long time.

      Originally posted by paulocoghi View Post
      60 seconds are not a "few seconds" for important services that must be online.
      Eh, for important services you usually avoid SPOF by running multiple servers anyway, and can reboot them one after another.

      Comment


      • #13
        Originally posted by caligula View Post
        Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
        That's YOUR usecase.

        If someone is investing resources to solve what he believes being an issue, then maybe for its usecase it IS an issue.

        Stop thinking that your PC and your servers are the whole universe and that there is anything else.

        Comment


        • #14
          Originally posted by caligula View Post
          Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
          My guess, it wasn't a deal breaker for Google, but rather someone scratching their heads wondering why some servers reboot slower than others. If there's a better way to do it, why not?

          Comment


          • #15
            Originally posted by paulocoghi View Post

            60 seconds are not a "few seconds" for important services that must be online.

            With fast reboot times we get services online faster. The more uptime, the better. Simple, no? Even on clustered scenarios, we make the synchronization faster as well.

            The question I make to you is: Why would we want a slower restart time if we can make it faster?
            Those services are not tied to a physical machine, they just move to another running instance.

            Comment


            • #16
              Ah remember the old days IBM p43 servers booting just within 7 minutes. Those fond memories.

              Comment


              • #17
                Originally posted by caligula View Post
                Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
                I don't know what Google does for their servers, but my servers reboot daily for updates and I'd gladly take any free speed improvements.

                Comment


                • #18
                  Originally posted by abufrejoval View Post
                  That reminds me of a question I asked myself only some days ago, when I was considering what do do with a Linux box that showed the symptoms of "locked-in syndrome": while it had lost network connectivity and wasn't reacting to any keyboard input, some parts of the kernel must have still been alive, because I could see the reguluar blips from the disk activity indicator that hinted at sync() being called every couple of seconds.

                  The question was: is it better to force a power-off or use the reset button as it clearly wasn't responding to the short power-off?

                  For SATA HDDs and SSDs the choice used to be clear, AFAIK there is no reset line on the SATA bus so a reset would ensure that there wasn't going to be any device internal data corruption with in-flight buffers being partially written or similar.

                  But with these NVMe devices I was thinking that there is a reset line on the PCIe bus, which could wreak havoc if the SSD didn't "manage" that intelligently. In that case a power-down might be better, if there was at least some kind of power-fail protection on the device (not that I noticed any caps on the PCB).

                  I'd be happy to be enlightened and get some background on how reset is being dealt with on NVMe and hardware RAID controllers on a PC bus (yeah, I still have some to manage the spinning rust).
                  Yeah I get the general impression with Linux is that as an "OS" (and I am using that term loosely) it didn't really embrace the async paradigm, a lot of code appears to be written to just block until it gets some response.

                  I notice this somewhat frequently, for example one off the top of my head is that NetworkManager when its connecting/disconnecting to different networks it appears to block/freeze the UI. Being predominantly written in C also didn't help in this regard because doing this kind of programming in C is really hard (languages like C++/Rust allow you to provide higher level abstractions in libraries to simplify this a lot).
                  Last edited by mdedetrich; 29 March 2022, 05:30 AM.

                  Comment


                  • #19
                    Originally posted by DRanged View Post
                    Ah remember the old days IBM p43 servers booting just within 7 minutes. Those fond memories.
                    Ah yes. It's not specific to IBM, though. Just throw in a bunch of add-in cards. Especially on pre-UEFI BIOSes, as the BIOS executes the code from all the option ROMs in sequence, that's where most of the time is spent. All the more reason for things like Coreboot.

                    Comment


                    • #20
                      Servers even today perform a lot of time-consuming checks in cold/warm reboot. I'm familiar with Dell and Fujitsu servers in particular, and they can easily take a few minutes to reach bootloader.

                      Add a minute to shutdown part, and you truly have time to brew and pour your coffee during reboot...

                      Comment

                      Working...
                      X