The Linux Kernel Is Preparing To Enable 5-Level Paging By Default

JustRob replied

23 September 2019, 01:30 PM
Originally posted by AsuMagic View Post

[insert xkcd joke]

The "Oracle StorageTek SL8500 Modular Library System" supports 2.1EBs (and you can chain them together), CleverSafe claims 10EBs, then there is this old quote that's easy to find: "The Large Hadron Collider generates around 15 petabytes of data every year. AT&T transfers approximately 20 petabytes of data through its network every day.".

Some of the comments are what has perpetuated the myth that human's only use 10% of their brains, those people also don't use 0.01% of the storage currently in use.
Leave a comment:
coder replied

19 September 2019, 01:33 AM
Originally posted by caligula View Post

You can decrease the speed to conserve (dynamic) power.

The "one-time power-efficiency dividend" I mentioned is a result of lowering the interface speed. As for the DRAM, itself, it requires dynamic refresh - and that takes power. Just by having the dies, you need to power them. The more dies or the larger they are, the more idle power they'll require - stacking only saves interface power.

Originally posted by caligula View Post

Of course you do. Having more memory channels implies you have more DRAM sockets, too.

I thought you were talking about more channels as a side-effect of HBM2. I agree with that, because HBM2 channels (in GPUs, at least) are narrow, and the interfaces are wider due to being in-package. So, you somewhat naturally get more channels as a side-effect.

However, if you're talking about adding more channels for out-of-package memory, then I reject that scenario. Doing so has the inevitable consequences of:
increasing power consumption, by having to drive more memory

increasing system memory costs, by requiring more DIMMs

increasing board cost, by requiring more traces, layers, and possibly DIMM slots

increasing CPU/package cost, by requiring more memory controllers (needs more silicon) and requiring more pins.

It's an expensive and power-intensive way to add bandwidth, it doesn't scale well, and it doesn't apply to your laptop use case.

Originally posted by caligula View Post

it's technically 100% feasible to have 3 kg laptops instead of 1,2 kg ultrabooks.

These exist, but they're expensive, not very popular, and provide poor performance (or battery life) on battery. Over the years, I'm pretty sure I've seen mobile workstations with E5 Xeons, but I'm currently not finding any that are based on Xeon W or ThreadRipper - probably because both companies now offer so many cores in their mainstream desktop socket. And, BTW, they're almost certainly more than 3 kg.
Leave a comment:
caligula replied

18 September 2019, 11:56 AM
Originally posted by coder View Post

Node shrinks make transistors cheaper and more power-efficient - stacking does not.

Well, the original discussion revolved around capacity, not price. Sure, larger capacity results in more expensive designs.

Originally posted by coder View Post

You do get a one-time power-efficiency dividend with stacking, but as DRAM dies still burn power, even if stacking would somehow let you have more of them (just for the sake of argument), those capacity increases would not be applicable to power-constrained use cases, like laptops.

You can decrease the speed to conserve (dynamic) power. It's a tradeoff, but maybe a larger memory capacity might be desired property at some point. After all more memory means you need to swap out stuff less often. Larger memory capacities also enable building systems that store some data offline, e.g. 3d xpoint bcache/swap.

Originally posted by coder View Post

Only as a side-effect of HBM2, but you don't get any more capacity from doing that.

Of course you do. Having more memory channels implies you have more DRAM sockets, too. 4 sockets can provide twice as much memory as 2. For example many desktop systems support up to 16 or 32 gigabytes of RAM with 2 DDR4 sockets and up to 32 or 64 GB with 4 sockets and up to 64-128 GB with 8 sockets. There's plenty of space inside the chassis. I don't know why the laptops are becoming smaller each year, but it's technically 100% feasible to have 3 kg laptops instead of 1,2 kg ultrabooks. My first laptops were even larger than that. Just enlarge the chassis and put more memory sockets inside. It doesn't automatically lead to designs where you need heavyweight external batteries.
Leave a comment:
chithanh replied

16 September 2019, 06:15 AM
Originally posted by HyperDrive View Post

And Power ISA 3.0 (POWER9) introduced radix tree page tables because, guess what, hashed page tables suck for cache locality.

If and how much exactly this is better squarely depends on the type of workload. AI/HPC in particular benefits from radix tree page tables and POWER is now heavily marketed towards it, so the decision is understandable. And it's not like the 5-level paging was without performance issues (as mentioned in the article), how much the latest implementation is able to solve remains to be seen.
Likes 1
Leave a comment:
aaronw replied

16 September 2019, 04:46 AM
With RDMA and Infiniban this becomes less of an issue. Lustre also supports mmap file I/O. According to this page:

Lustre Clients Overview - Lustre Wiki

http://wiki.lustre.org/Lustre_Clients_Overview

The full POSIX test suite passes in an identical manner to a local EXT4 file system, with limited exceptions on Lustre clients. In a cluster, most operations are atomic so that clients never see stale data or metadata. The Lustre software supports mmap() file I/O.
Likes 1
Leave a comment:
coder replied

16 September 2019, 01:01 AM
Originally posted by aaronw View Post

Many high-performance computing systems use the Lustre filesystem (used in over 60 of the top 100 fastest supercomputers. There are also a number of other high-performance distributed filesystems available for Linux.

I have doubts about support for mmap(), among distributed filesystems. That's one reason I asked.
Likes 1
Leave a comment:
aaronw replied

16 September 2019, 12:07 AM
I don't know which filesystem is used the most but there are a number of file systems that can scale. I also know XFS can scale into the petabyte range. EXT4 is not used since it doesn't do well over 16TB. Many high-performance computing systems use the Lustre filesystem (used in over 60 of the top 100 fastest supercomputers. There are also a number of other high-performance distributed filesystems available for Linux.
Leave a comment:
coder replied

15 September 2019, 08:54 PM
Originally posted by aaronw View Post

While addressable DRAM is nowhere close to the 256TiB limit, this becomes important for memory mapping large data sets. Paging is also used for memory-mapped files, for example.

Out of curiosity, what filesystems are typically used for this?

Originally posted by aaronw View Post

There doesn't have to be physical memory to back every page entry.

Yes, that's largely the distinction between physical and virtual addresses.
Leave a comment:
coder replied

15 September 2019, 08:17 PM
Originally posted by caligula View Post

Not necessarily. Just more modules. Currently I think 2-3 generations of process shrinking are technically feasible. Each process shrink may double the capacity.

Okay, so I'll agree that 4-8x might be plausible.

Originally posted by caligula View Post

On top of that you have 3d stacking

Node shrinks make transistors cheaper and more power-efficient - stacking does not. You do get a one-time power-efficiency dividend with stacking, but as DRAM dies still burn power, even if stacking would somehow let you have more of them (just for the sake of argument), those capacity increases would not be applicable to power-constrained use cases, like laptops.

Originally posted by caligula View Post

high end laptops could use 10 instead of 2 memory channels in the future.

Only as a side-effect of HBM2, but you don't get any more capacity from doing that.

Originally posted by caligula View Post

It's also possible that they'll come up with something other than DRAM, some QLC NAND / DRAM hybrid perhaps.

QLC NAND is ridiculously slow, compared to DRAM. Something like 4 orders of magnitude or more. QLC writes can be almost as slow as hard disks.

AFAIK, the only tech with higher density than DRAM and performance that's anywhere close is actually the 3D XPoint that Intel is branding as Optane. So, I could see a world where we have some amount of HBM2 (or similar) stacked in the CPU package - probably anywhere from 4 to 32 GB - and then your external memory is 3D XPoint. You could use the HBM as an exclusive cache, by page-faulting the same way that we do with virtual memory. Performance-wise, perhaps it makes sense to have about 4-8 times as much of this as HBM. So, that gets you to 64-512 GB (though, at an order of magnitude slower than DRAM). Maybe a couple TB, at a stretch, but probably not for laptops.

So, in 20 years, maybe there's a path to perhaps a couple TB of what looks and feels something like RAM, in your laptop. Whether there will be use cases that would justify the cost is another matter. In workstations, this might be more like 8 TB.

Last edited by coder; 15 September 2019, 08:22 PM.
Likes 2
Leave a comment:
coder replied

15 September 2019, 07:45 PM
Originally posted by ThoreauHD View Post

I hope not, because if that's what it's about, then they're well and truly fucked. This is a race to the cpu core, not some failed SSD crap dangling off side.

I wouldn't call it crap. First gen wasn't as durable as they claimed and it's taking them longer to scale up densities, but I wouldn't count it out. It is much faster than NAND and still more economical & denser than DRAM. I think it definitely has a place in the storage hierarchy and both Intel & Micron are (independently) moving forward with the tech.

Originally posted by ThoreauHD View Post

We are at the end of the Ghz race. 5 Ghz is the cap, and it only gets worse from here with die shrinks. The new Ghz is stacking everything as close as possible, as small as possible, with the least amount of heat as possible to the cpu core.

I'm all for HBM2 or whatever, but it's pretty insane to talk about Terabytes of it stacked next to the CPU. That's not going to happen.

And HBM2 isn't a simple substitute for frequency-scaling. It will barely help some workloads. For others, you'll get a one-time boost from more bandwidth or lower latency.

But, again, it can't scale to server-level capacities. So, it's really not relevant for big memory use cases.
Likes 2
Leave a comment:

Announcement

The Linux Kernel Is Preparing To Enable 5-Level Paging By Default

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: