Announcement

Collapse
No announcement yet.

Google Has A Problem With Linux Server Reboots Too Slow Due To Too Many NVMe Drives

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cjcox
    replied
    Originally posted by milkylainen View Post
    Umm. But WHY is it taking 4.5 seconds?
    There is nothing that should take 4.5 seconds on an NVMe drive.
    Even if it has a power hold-up cache, it would be flushed rather immediately.
    Buggy drive or bugged specification?
    My guess is that initialization is done sequentially and you won't see the problem unless you have tons of NVMe drives on a single OS instance. But, I'm just guessing.

    But there's a lot of that sort of stuff in the kernel, but maybe they've gotten around or avoided those things.



    Leave a comment:


  • DRanged
    replied
    Originally posted by logical View Post

    How about an IBM RS/6000 J40 (microchannel, PowerPC) A full boot with 8 processors took about 45 minutes. A quick boot was 15 minutes. :-)
    Still have a RS/6000 320 and a RS/6000 380 Haven't booted them for a while now. Might give it try to see if they still work with easter . Those are abt. 30 years old now. Will sound like a starfighter taking off.

    Leave a comment:


  • sinepgib
    replied
    Originally posted by juarezr View Post

    It looks to me like an issue that is very complex and hard to get right.

    I'm also unsure that only async is the right approach considering the long term. Some questions about priority and concurrency remain yet:
    • What about a driver/task that is using data in an async NVme drive? It will fail? It will force the shutdown process back to a serial process?
    • What is the correct order of shutdown on every Linux box?
      • Apps, users, networks, filesystems, then keyboard/user-input and for last display?
      • Apps, users, keyboard/user-input, filesystems, networks, display, and finally filesystems?
      • Or maybe it will depend if:
        • the system is a smaller Raspberry PI or a larger weather forecast cluster node?
        • the architecture is ARM, x86, MIPS, or RISCV?
        • the system is optimized to sleep/hibernate rather than shutdown/restart?
    Maybe at some time in the future, people will have a passionate discussion about async shutdown vs dependency-based shutdown. Exactly it was the case on "systemd" vs "upstart".

    Commonly the order of deactivation is the inverse order of activation. But in a modern SO like Linux, that has become way too complex because many drivers, hardware, and dependencies can be plugged in at any time by users and processes.

    Also, the myriad of architectures and use cases that Linux supports, ranging from the smallest IoT and embedded devices to enormous servers and clusters requires that the order and the concurrency may be tackled in different ways, depending on the setup.

    I won't be surprised if people further propose a "micro-systemd" or a "mini-upstart" inside the kernel for handling concurrent boot and shutdown before the kernel delegates control to PID 1.
    Async and dependency-based are complementary AFAICT. You do have dependency chains in systemd, for example. But because things are async, when one service needs to wait for other service to go offline first, you can continue working on the services you can shutdown now.
    Regarding "micro-systemd", I don't really see that happening. The kernel is more or less self-contained, which means you can handle that in simpler, less flexible ways internally, and can still rely in pid 1 to shutdown services themselves.

    Leave a comment:


  • sinepgib
    replied
    Originally posted by CommunityMember View Post

    In the hyperscaler cloud space one will find the hyperscaler reboots servers (including bare metal servers) all the time, and every minute lost can be (substantial) lost revenue at hyperscale (along with annoyed customers who want their systems available now, now now).
    And even if it's not for us, what is wrong with them scratching their own itches? As long as it's not in detriment of others, isn't that part of the point of using open source?

    Leave a comment:


  • Developer12
    replied
    Originally posted by cl333r View Post

    Uhm, why?
    NVMe device can't add any delay if it doesn't exist. This patch just reduces the delay.

    Leave a comment:


  • cl333r
    replied
    Originally posted by bug77 View Post

    More like, with this patch, your PC will not shut down 4.5s slower.
    Uhm, why?

    Leave a comment:


  • bug77
    replied
    Originally posted by cl333r View Post
    I just realized that with this patch if I buy a second NVMe my PC will shutdown 4.5 seconds faster. So it's not just about cloud servers.
    More like, with this patch, your PC will not shut down 4.5s slower.

    Leave a comment:


  • juarezr
    replied
    Originally posted by mdedetrich View Post

    Yeah I get the general impression with Linux is that as an "OS" (and I am using that term loosely) it didn't really embrace the async paradigm, a lot of code appears to be written to just block until it gets some response.

    I notice this somewhat frequently, for example one off the top of my head is that NetworkManager when its connecting/disconnecting to different networks appears to block/freeze the UI. Being predominantly written in C also didn't help in this regard because doing this kind of programming in C is really hard (languages like C++/Rust allow you to provide higher-level abstractions in libraries to simplify this a lot).
    It looks to me like an issue that is very complex and hard to get right.

    I'm also unsure that only async is the right approach considering the long term. Some questions about priority and concurrency remain yet:
    • What about a driver/task that is using data in an async NVme drive? It will fail? It will force the shutdown process back to a serial process?
    • What is the correct order of shutdown on every Linux box?
      • Apps, users, networks, filesystems, then keyboard/user-input and for last display?
      • Apps, users, keyboard/user-input, filesystems, networks, display, and finally filesystems?
      • Or maybe it will depend if:
        • the system is a smaller Raspberry PI or a larger weather forecast cluster node?
        • the architecture is ARM, x86, MIPS, or RISCV?
        • the system is optimized to sleep/hibernate rather than shutdown/restart?
    Maybe at some time in the future, people will have a passionate discussion about async shutdown vs dependency-based shutdown. Exactly it was the case on "systemd" vs "upstart".

    Commonly the order of deactivation is the inverse order of activation. But in a modern SO like Linux, that has become way too complex because many drivers, hardware, and dependencies can be plugged in at any time by users and processes.

    Also, the myriad of architectures and use cases that Linux supports, ranging from the smallest IoT and embedded devices to enormous servers and clusters requires that the order and the concurrency may be tackled in different ways, depending on the setup.

    I won't be surprised if people further propose a "micro-systemd" or a "mini-upstart" inside the kernel for handling concurrent boot and shutdown before the kernel delegates control to PID 1.

    Leave a comment:


  • M@GOid
    replied
    "Google Has A Problem With Linux Server Reboots Too Slow Due To Too Many NVMe Drives"


    Me, looking at my machine with a single SATA SSD and without any M.2 slots... Those guys are showing off...

    Leave a comment:


  • scottishduck
    replied
    Originally posted by caligula View Post
    Wow, so many comments and nobody is wondering why few seconds matter. After all you'll reboot a desktop only once per day, servers have uptime of months. So shouldn't matter even if it takes few hours to boot.
    If you have, let’s say, 500k servers. The shaving off 30 seconds from one shutdown saves about half a year of time overall

    Leave a comment:

Working...
X