Announcement

Collapse
No announcement yet.

Google Has A Problem With Linux Server Reboots Too Slow Due To Too Many NVMe Drives

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • paulocoghi
    replied
    Originally posted by chithanh View Post
    Eh, for important services you usually avoid SPOF by running multiple servers anyway, and can reboot them one after another.
    I agree and this is the reason I mentioned clustered scenarios as well, since it's still better to the health of any cluster to have a member back online earlier, even when all services are clustered and high-available.

    If we can reboot faster without compromising safety, why on earth someone would choose the slow option?

    Originally posted by bug77 View Post

    Those services are not tied to a physical machine, they just move to another running instance.
    I am aware of this and still, at the same time, in favor of faster reboot times. Again, why not?

    Leave a comment:


  • dekernel
    replied
    Originally posted by Espionage724 View Post
    I don't know what Google does for their servers, but my servers reboot daily for updates and I'd gladly take any free speed improvements.
    Guess I would be interested in knowing just what OS you are using that requires daily updates because even Windows 2003 could go a full week before required reboots from the Windows Updates.

    Leave a comment:


  • Serafean
    replied
    Rebooting my home server takes about 5 minutes. Anything to speed that up is welcome...

    Originally posted by mdedetrich View Post
    I notice this somewhat frequently, for example one off the top of my head is that NetworkManager when its connecting/disconnecting to different networks it appears to block/freeze the UI. Being predominantly written in C also didn't help in this regard because doing this kind of programming in C is really hard (languages like C++/Rust allow you to provide higher level abstractions in libraries to simplify this a lot).
    That would be a problem of the UI... NetworkManager is a daemon, and communicates asynchronously over D-BUS. They are different programs.

    Leave a comment:


  • direc85
    replied
    Servers even today perform a lot of time-consuming checks in cold/warm reboot. I'm familiar with Dell and Fujitsu servers in particular, and they can easily take a few minutes to reach bootloader.

    Add a minute to shutdown part, and you truly have time to brew and pour your coffee during reboot...

    Leave a comment:


  • uxmkt
    replied
    Originally posted by DRanged View Post
    Ah remember the old days IBM p43 servers booting just within 7 minutes. Those fond memories.
    Ah yes. It's not specific to IBM, though. Just throw in a bunch of add-in cards. Especially on pre-UEFI BIOSes, as the BIOS executes the code from all the option ROMs in sequence, that's where most of the time is spent. All the more reason for things like Coreboot.

    Leave a comment:


  • mdedetrich
    replied
    Originally posted by abufrejoval View Post
    That reminds me of a question I asked myself only some days ago, when I was considering what do do with a Linux box that showed the symptoms of "locked-in syndrome": while it had lost network connectivity and wasn't reacting to any keyboard input, some parts of the kernel must have still been alive, because I could see the reguluar blips from the disk activity indicator that hinted at sync() being called every couple of seconds.

    The question was: is it better to force a power-off or use the reset button as it clearly wasn't responding to the short power-off?

    For SATA HDDs and SSDs the choice used to be clear, AFAIK there is no reset line on the SATA bus so a reset would ensure that there wasn't going to be any device internal data corruption with in-flight buffers being partially written or similar.

    But with these NVMe devices I was thinking that there is a reset line on the PCIe bus, which could wreak havoc if the SSD didn't "manage" that intelligently. In that case a power-down might be better, if there was at least some kind of power-fail protection on the device (not that I noticed any caps on the PCB).

    I'd be happy to be enlightened and get some background on how reset is being dealt with on NVMe and hardware RAID controllers on a PC bus (yeah, I still have some to manage the spinning rust).
    Yeah I get the general impression with Linux is that as an "OS" (and I am using that term loosely) it didn't really embrace the async paradigm, a lot of code appears to be written to just block until it gets some response.

    I notice this somewhat frequently, for example one off the top of my head is that NetworkManager when its connecting/disconnecting to different networks it appears to block/freeze the UI. Being predominantly written in C also didn't help in this regard because doing this kind of programming in C is really hard (languages like C++/Rust allow you to provide higher level abstractions in libraries to simplify this a lot).
    Last edited by mdedetrich; 29 March 2022, 05:30 AM.

    Leave a comment:


  • Espionage724
    replied
    Originally posted by caligula View Post
    Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
    I don't know what Google does for their servers, but my servers reboot daily for updates and I'd gladly take any free speed improvements.

    Leave a comment:


  • DRanged
    replied
    Ah remember the old days IBM p43 servers booting just within 7 minutes. Those fond memories.

    Leave a comment:


  • bug77
    replied
    Originally posted by paulocoghi View Post

    60 seconds are not a "few seconds" for important services that must be online.

    With fast reboot times we get services online faster. The more uptime, the better. Simple, no? Even on clustered scenarios, we make the synchronization faster as well.

    The question I make to you is: Why would we want a slower restart time if we can make it faster?
    Those services are not tied to a physical machine, they just move to another running instance.

    Leave a comment:


  • bug77
    replied
    Originally posted by caligula View Post
    Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
    My guess, it wasn't a deal breaker for Google, but rather someone scratching their heads wondering why some servers reboot slower than others. If there's a better way to do it, why not?

    Leave a comment:

Working...
X