Announcement

**pegasus** · 03 May 2022, 01:50 PM

Fun times ahead.
I would have liked to have CXL.mem like capabilities 10 years ago already ... SAP HANA class machines are not all that cost effective

**linuxgeex** · 03 May 2022, 02:41 PM

This is long overdue, and at the same time they need to exercise the same consideration for swap-backed memory performance tiers... ie ZRAM, PCIe SSD, SATA SSD, EMMC, HDD, SMR HDD...

It makes a lot of sense to swap on the fastest devices first, then push cold pages to slower tiers as the faster tiers fill up. That's not what the kernel does. It blindly writes according to swap device priority, which results in priority inversion (hot pages to the slow device, cold pages to the fast device) when the slower device comes into use.

How does that happen? Say you have 4GB RAM with 2GB ZRAM and 8GB disk swap. Your workload requires 4GB of anon memory and a minimum of 2GB of shared page cache for decent performance. Ideally you would have 6GB of RAM, but you have 4. Initially ZRAM fills with a mix of 2:1 cold and hot pages. With a 2:1 compression ratio, you end up with 1GB ZRAM, 1GB Anon, and 2GB page cache in memory at any given point in time. Once ZRAM is full, pages start to spill to disk. The hot pages from ZRAM are loaded back into RAM and replaced with 2:1 mix of cold and hot. Each time that happens. 2/3 the hot pages migrate from ZRAM to disk.

In short order you're swapping entirely on disk, ZRAM is holding nothing but cold pages, and you'd have better performance without it, since 2GB page cache + 2GB Anon would result in 1/2 or less swapping.

That's the kernel's swap management behaviour as designed. It's not a bug. It's just not very optimal.

The workaround, as designed, is to periodically swapoff the ZRAM so that all swap is written to disk, then re-enable it, which then migrates the hot pages to ZRAM. This is expensive and bursty for IO, with related bursts of poor system performance. By having 8 small ZRAM volumes and performing these swapoff/swapon cycles one volume at a time, only when the least-full ZRAM volume is >96% full, that bursty behaviour can be kept reasonable.

ZSWAP is not a reasonable alternative. On paper ZSWAP sounds good. It avoids the priority inversion ZRAM suffers from when used for tiered swapping. It performs well when there's a small number of pages per second, but its CPU overhead rises with second order when it's stressed, and unlike ZRAM, that overhead applies even when it spills to a second tier, making the bad situation far worse.

Ideally when the kernel sees the high-priority swap is full, it should migrate the coldest block from the high-priority swap to a lower-priority one, then store the requested page in the high-priority swap. That's basically what they're going to do with tiered memory... they need to do the same with tiered swap. At the same time, they need to add support for compressed pages in disk swap so that they can perform that migration without first decompressing the cold pages!

**flower** · 03 May 2022, 04:46 PM

Originally posted by linuxgeex View Post

This is long overdue, and at the same time they need to exercise the same consideration for swap-backed memory performance tiers... ie ZRAM, PCIe SSD, SATA SSD, EMMC, HDD, SMR HDD...

Sounds good and is really needed. The problem with swap is that it always has to be copied to ram to be used. it's impossible to use it directly.

so this is a completly different problem than tiered memory. it just sounds similar

**linuxgeex** · 09 May 2022, 02:56 AM

Originally posted by flower View Post

Sounds good and is really needed. The problem with swap is that it always has to be copied to ram to be used. it's impossible to use it directly.

so this is a completly different problem than tiered memory. it just sounds similar

You're right that swap != memory, even when it is backed by a memory device like NVDIMM or NVMe.

However you're wrong about copying. With DMA the device can write it to RAM without it going through the processor, so then it behaves more like paged memory, with a higher activation latency. This has been the case since 1980's with SCSI devices supporting DMA, but it wasn't until NVMe that the device latency started to get so low that CPU overhead from copying became more than a 0.01% performance hit.

Anyhow, my point is that if they're on a mission to optimise memory latency, there's large gains to be had from optimising tiered swap's impact on memory latency.

**flower** · 09 May 2022, 03:21 AM

Originally posted by linuxgeex View Post

However you're wrong about copying. With DMA the device can write it to RAM without it going through the processor, so then it behaves more like paged memory, with a higher activation latency.

That's still a copy? Just because the cpu isn't involved doesn't mean it's not a copy.

Anyway we do agree that some kind of tiered swap support would be really nice to have

**linuxgeex** · 13 May 2022, 12:12 PM

Originally posted by flower View Post

That's still a copy? Just because the cpu isn't involved doesn't mean it's not a copy.

Not involving the CPU makes it orders of magnitude faster if the accesses are sparse, because it doesn't deplete the CPU caches. So from a performance perspective, it's very unlike the sort of copy which would be bitterly protested as different from paged memory. That is what I mean. I didn't say that it created more RAM, lol, I said that it was more like paged memory than "copied" memory.

Of course that goes in the toilet if the memory is stored on a LUKS volume or with degraded software RAID.

In many senses, it literally is paged memory when the backing device is an MTD (memory technology device) like SSD, and even more so when the MTD is something like PCIe that can expose that memory via the PCI bus as directly addressable memory. Windows, XBox, and Playstation effectively have DRM drivers using IOMMU to stream assets from PCIe to the GPU across the PCI bus. I can't wait to see how that gets used in the Linux kernel. ie Redis on PCIe-backed mmap() - over a certain size the kernel streams the memory region - fulfilled by the PCIe device - to the network device via the network driver's DMA engine. I suspect we'll see it implemented in the kernel before the end of next year. I don't know whether mobo vendors will add more PCIe hosts to accelerate this, or if they'll defer to CXL, but we definitely will see this as a way to get DPU-like performance on traditional mobos. Esp with NVidia having already put out gRAID hardware.

Announcement

Linux Developers Discuss Improvements To Memory Tiering

Linux Developers Discuss Improvements To Memory Tiering

Comment

Comment

Comment

Comment

Comment

Comment