Linux 5.15 Addressing Scalability Issue That Caused Huge IBM Servers 30+ Minutes To Boot

mdedetrich replied

02 September 2021, 03:25 PM
Originally posted by ThoreauHD View Post

What a fun conversation. I haven't seen mainframe dick slapping in years.

Well apparently nowadays its distributed dickslapping with high service level agreement
Leave a comment:
sinepgib replied

02 September 2021, 01:25 PM
Originally posted by MadeUpName View Post

Software of all flavours is full of kludges put in place to get around vendor F'ups. Look at every thing from firmware to bios's to to X.

Yes, but Linux does have standards, and those fixes are usually to work around hardware issues, not software misconfiguration, and they usually go around a lot of back and forth to justify the addition.
Lastly, reducing unnecessary contention is not a kludge, from any reasonable POV.
Looking at the patches they seem sensible. The only thing that may be questionable is whether negative dentry caching is needed when they could just use filesystem events notifications to check if an entry was added.
Leave a comment:
ThoreauHD replied

02 September 2021, 12:30 PM
What a fun conversation. I haven't seen mainframe dick slapping in years.
Likes 2
Leave a comment:
MadeUpName replied

02 September 2021, 11:36 AM
Originally posted by sinepgib View Post

A fix got accepted into the kernel, which means maintainers considered there actually was something wrong with it, so I guess they already accepted Linux is not perfect. As no other OS is either. So you're wrong.

Software of all flavours is full of kludges put in place to get around vendor F'ups. Look at every thing from firmware to bios's to to X.
Leave a comment:
sinepgib replied

02 September 2021, 09:31 AM
Originally posted by MadeUpName View Post

IBM isn't the only company in the world that makes big ass severs. Until you show me another vendor with similar problems I blame IBM.

A fix got accepted into the kernel, which means maintainers considered there actually was something wrong with it, so I guess they already accepted Linux is not perfect. As no other OS is either. So you're wrong.
Likes 2
Leave a comment:
Vorpal replied

02 September 2021, 07:49 AM
Originally posted by coder111 View Post

Depends on your problem. Sometimes you need low latency random access on 64 TB of data...

And IMO distributed systems are often overrated/overused. A cluster is hard to develop for and hard to manage. And when things go wrong, hard to find bottlenecks, hard to fix then, hard to recover from failures. Also, when network latency kicks in or the cluster software isn't done 100% right, some problems might need a LOT of machines in a cluster to achieve satisfactory performance. And then depending on the cluster size, it might end up more expensive than a single machine. Besides, if you factor in the development/maintenance cost for a cluster vs 1 machines (say with another hot spare), you might end up with a total cost that is significantly more expensive.

My advice- if you don't need massive horizontal scalability, and if you don't need massive reliability- don't develop distributed systems.

A couple of counter points.
There are other types of distributed systems, not just for data centres. I have worked on development for distributed systems in embedded applications. One example is the CAN bus in a modern car with many attached microcontrollers. Another case I also worked on was multiple communicating robots. Both of these have very different requirements than the data centre case, but are considered distributed systems. I suspect that you were considering a much narrower scope, but correct definitions and terminology is important.

Even horizontal scaling is relatively easy if done well, with an environment built for that. Consider Erlang, which I have used in a horizontal scaling situation. Thanks to the functional message passing programming paradigm and the OTP libraries, horizontal scaling comes almost for free. Reliability (while not quite as easy) is also greatly simplified.

That said, I never had to deal with purchasing the hardware in question, so I will have to take your word for it when it comes to the cost. And in the embedded setting there are really good reasons to work on a distributed system (i.e. you need multiple physical vehicles/devices, and you want to save on wiring within a vehicle).
Likes 3
Leave a comment:
uxmkt replied

02 September 2021, 06:59 AM
Originally posted by indepe View Post

I'd guess being able to boot a mainframe with "several hundred CPUs and 64TB of RAM" in under 5 minutes is quite an achievement, though. (Without knowing how long other OS's would take...)

I am pretty sure DOS would boot in an instant. Wait, you didn't say you wanted to use more than 640K... ;-)

Originally posted by pipe13

First show me another vendor with 5 nines availability.

Not too hard; today, the bigger problem is the software which does not get close to that.
Leave a comment:
coder111 replied

02 September 2021, 06:17 AM
Originally posted by Vorpal View Post

I'm curious: What is the point of systems like this? Why not go for a cluster of cheaper servers?

Depends on your problem. Sometimes you need low latency random access on 64 TB of data...

And IMO distributed systems are often overrated/overused. A cluster is hard to develop for and hard to manage. And when things go wrong, hard to find bottlenecks, hard to fix then, hard to recover from failures. Also, when network latency kicks in or the cluster software isn't done 100% right, some problems might need a LOT of machines in a cluster to achieve satisfactory performance. And then depending on the cluster size, it might end up more expensive than a single machine. Besides, if you factor in the development/maintenance cost for a cluster vs 1 machines (say with another hot spare), you might end up with a total cost that is significantly more expensive.

My advice- if you don't need massive horizontal scalability, and if you don't need massive reliability- don't develop distributed systems.
Likes 3
Leave a comment:
Vorpal replied

02 September 2021, 05:58 AM
I'm curious: What is the point of systems like this? Why not go for a cluster of cheaper servers? My (limited) understanding is that that is what companies like Google, Facebook, Amazon etc do.

Unless you really need 64 TB ram on a single CPU, why not split it across multiple machines and design around that? After all, those companies I mentioned above also manage extremely impressive uptimes.
Leave a comment:
zboszor replied

02 September 2021, 05:33 AM
Originally posted by partcyborg View Post

Lol 5 nines is ~5min of downtime per YEAR. Good luck getting that with a 30min boot time

~5 min unplanned downtime per year. Planned outage is not counted in availability.
Likes 6
Leave a comment:

Announcement

Linux 5.15 Addressing Scalability Issue That Caused Huge IBM Servers 30+ Minutes To Boot

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: