This Linux SGI server with 4096 cores, is not a true SMP server. The biggest SMP servers today have ~64 cpus. Everything bigger than 64 cpus, are basically a cluster with horrible latency, making them unusable for SMP workloads.
IBM and Oracle and HP are fighting fiercely trying to get the best world record benchmark, for instance in TPC-C. The biggest mature Unix Enterprise servers these guys have, are something like 32-64 cpus. The biggest z10 IBM mainframe have 64 cpus. The biggest z196 IBM Mainframe have 24 cpus.
Why dont they just insert 128cpus? Or even 256cpus? Or even 16.384 cpus? Answer: no, these Mainframe/Unix servers are SMP servers. The biggest SMP servers have ~64 cpus today. An SMP server, is basically a single fat huge server weighing tons and costing many millions of USD. We talk about scaling vertically, "scale up".
If we talk about HPC servers, horizontal scaling, "scale out", then we talk about clusters. Anything bigger than 64 cpus, are basically a cluster. A bunch of PCs sitting on a fast switch. Look at the benchmarks for the SGI Altix server; very impressive results indeed. But, all these workloads are cluster benchmarks. Not SMP benchmarks. If you build a cluster, why stop at 4096 cores? You can just change the number "4096" to "16.384" and recompile, and then you can brag about "Linux scales to 16.384 cores, which is far more than the biggest IBM servers with 64 cpus".
Linux does very well in HPC servers, and scales to many many thousands of cpus. But on SMP servers, Linux scales very bad, maybe ~8 cpus.
The biggest Linux SMP server, should be an normal 8-socket x86 server. I dont know of any 16-socket x86 server. You can populate that with 8 core Nehalem cpus, getting 64 cores. Thus, the biggest Linux SMP server should be the normal 8 cpu x86 servers that you can purchase today. For instance Oracle X4800 8-socket server, which supports 64 cores, and 1TB RAM. If you want a 16-socket x86 server, then you need special made chip sets, etc. It will be very expensive. And if you want 64 socket cpus, then it will be VERY expensive, costing tens of millions of USD. To scale well on SMP servers, is very difficult to do well. Very difficult. For instance, IBM has tried to do this for many decades, and now scales to 64 cpus.
Here is a new big Linux server, it scales to up to 8.192 cores, using a single Linux image. The solution is called vSMP.
Thus it is a cluster. As someone explained:
"I tried running a nicely parallel shared memory workload (75% efficiency on 24 cores in a 4 socket opteron box) on a 64 core ScaleMP box with 8 2-socket boards linked by infiniband. Result: horrible. It might look like a shared memory, but access to off-board bits has huge latency."
Regarding ZFS, it is built for server usage. With many disks. And built to scale well up to many disks. BTRFS lags behind when you start to use many disks, it does not scale as well as ZFS. Benchmarks proves this.
IBM and Oracle and HP are fighting fiercely trying to get the best world record benchmark, for instance in TPC-C. The biggest mature Unix Enterprise servers these guys have, are something like 32-64 cpus. The biggest z10 IBM mainframe have 64 cpus. The biggest z196 IBM Mainframe have 24 cpus.
Why dont they just insert 128cpus? Or even 256cpus? Or even 16.384 cpus? Answer: no, these Mainframe/Unix servers are SMP servers. The biggest SMP servers have ~64 cpus today. An SMP server, is basically a single fat huge server weighing tons and costing many millions of USD. We talk about scaling vertically, "scale up".
If we talk about HPC servers, horizontal scaling, "scale out", then we talk about clusters. Anything bigger than 64 cpus, are basically a cluster. A bunch of PCs sitting on a fast switch. Look at the benchmarks for the SGI Altix server; very impressive results indeed. But, all these workloads are cluster benchmarks. Not SMP benchmarks. If you build a cluster, why stop at 4096 cores? You can just change the number "4096" to "16.384" and recompile, and then you can brag about "Linux scales to 16.384 cores, which is far more than the biggest IBM servers with 64 cpus".
Linux does very well in HPC servers, and scales to many many thousands of cpus. But on SMP servers, Linux scales very bad, maybe ~8 cpus.
The biggest Linux SMP server, should be an normal 8-socket x86 server. I dont know of any 16-socket x86 server. You can populate that with 8 core Nehalem cpus, getting 64 cores. Thus, the biggest Linux SMP server should be the normal 8 cpu x86 servers that you can purchase today. For instance Oracle X4800 8-socket server, which supports 64 cores, and 1TB RAM. If you want a 16-socket x86 server, then you need special made chip sets, etc. It will be very expensive. And if you want 64 socket cpus, then it will be VERY expensive, costing tens of millions of USD. To scale well on SMP servers, is very difficult to do well. Very difficult. For instance, IBM has tried to do this for many decades, and now scales to 64 cpus.
Here is a new big Linux server, it scales to up to 8.192 cores, using a single Linux image. The solution is called vSMP.
Depending on the cores per chip and the generation you use, you can have from 2,048 to 8,192 cores in a single image.
...
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit," Fultheim tells El Reg,
...
vSMP Foundation for Cluster is used to take multiple server images and plunk them on a single server image running one copy of a Linux operating system; you use vSMP and that operating system instead of a cluster manager to run workloads.
...
The vSMP hypervisor that glues systems together is not for every workload, but on workloads where there is a lot of message passing between server nodes – financial modeling, supercomputing, data analytics, and similar parallel workloads. Shai Fultheim, the company's founder and chief executive officer, says ScaleMP has over 300 customers now. "We focused on HPC as the low-hanging fruit," Fultheim tells El Reg,
...
vSMP Foundation for Cluster is used to take multiple server images and plunk them on a single server image running one copy of a Linux operating system; you use vSMP and that operating system instead of a cluster manager to run workloads.
"I tried running a nicely parallel shared memory workload (75% efficiency on 24 cores in a 4 socket opteron box) on a 64 core ScaleMP box with 8 2-socket boards linked by infiniband. Result: horrible. It might look like a shared memory, but access to off-board bits has huge latency."
Regarding ZFS, it is built for server usage. With many disks. And built to scale well up to many disks. BTRFS lags behind when you start to use many disks, it does not scale as well as ZFS. Benchmarks proves this.
Comment