If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.
Announcement
Collapse
No announcement yet.
Linux Kernel Prepares For Intel Xeon CPUs With On-Package HBM Memory
CXL is a huge enabling technology, that makes me wonder why AMD won't bring it to AM5 at launch with PCIe 5 as they will stay at PCIe 4.
anandtech had an article on it. AMD is apparently concentrating on their Infinity Architecture for the hpc systems. CXL doesn't solve the GPU to GPU all-way direct connections with symmetric coherency. Some discussion there mentioned it requiring 6 links per gpu. They are apparently boosting pcie4 up to 25GHz as part of their solution. Some of the discussion was about thermal issues, but perhaps they just are focusing on that symmetric coherency solution for now.
CXL will enable mapping of GPU's HBM into CPU memory space. CPU would access it through L3 cache. We'll probably get some explanation of this at the Hotchips Sapphire Rapids and Ponte Vecchio presentations in August.
CXL would be a huge bottleneck for HBM. A 32 or 64GB/s link would be nothing in the context of 2TB/s.
CXL would be a huge bottleneck for HBM. A 32 or 64GB/s link would be nothing in the context of 2TB/s.
The HBM would be direct attached to the XPU, for full bandwidth, but, yes, if the CPU must access it, it will be limited by the PCIE bus.
However, if CXL moves on-chip the bus widths can be large.
anandtech had an article on it. AMD is apparently concentrating on their Infinity Architecture for the hpc systems. CXL doesn't solve the GPU to GPU all-way direct connections with symmetric coherency. Some discussion there mentioned it requiring 6 links per gpu. They are apparently boosting pcie4 up to 25GHz as part of their solution. Some of the discussion was about thermal issues, but perhaps they just are focusing on that symmetric coherency solution for now.
Thanks for the link! Hm, I got the impression that CXL also solves GPU-to-GPU coherency problems, here is a link discussing just that (albeit speculatively as Intel wasn't talking explicitly about it at that time, but the technology should allow it): https://wccftech.com/intel-xe-coherent-multi-gpu-cxl/
That article also talks about the implications for Intel CPU + Xe GPU over CXL which would give Intel a huge advantage. I'd like to see benchmarks of such a system in action.
Thanks for the link! Hm, I got the impression that CXL also solves GPU-to-GPU coherency problems, here is a link discussing just that (albeit speculatively as Intel wasn't talking explicitly about it at that time, but the technology should allow it): https://wccftech.com/intel-xe-coherent-multi-gpu-cxl/
That article also talks about the implications for Intel CPU + Xe GPU over CXL which would give Intel a huge advantage. I'd like to see benchmarks of such a system in action.
CXL uses a host (usually CPU) to maintain the coherency for some number of XPU/slave devices. It doesn't define any peer to peer protocol for XPUs, as is done with CCIX. However, an XPU can access another XPU's memory with CXL if it goes through the Host CPU L3.
The best presentation I've seen for cxl 1.1 is here:
The CXL interface adds both a memory and a caching protocol between a host CPU and a device. The Memory Protocol enables a device to expose memory region to ...
The presentation shows caches on the host associated with CXL and connected to L3 and Host's Home Agent.
Been there, done that. Knights Landing Xeon Phi had 16 GB of MCDRAM in-package, which could be directly-addressed or used as L3 cache. That launched in 2016. You could drop it in the same socket as Skylake SP Xeons and supposedly boot a mainline Linux kernel on it.
I wonder who will make an iGPU-equipped CPU with HBM first... The colossal bandwidth afforded should mean that every thread in an 8c16t could receive 64GB/s and still have 1TB/s available to a GPU (assuming 4x2048bit@2GHz).
stick the whole package on a motherboard, and you could have a full workstation in less space than a 6-pack.
I've been thinking about this for many years. Maybe since around when AMD launched Fury.
I think it probably makes sense for some small, high-end laptops, but that's about it. For anything bigger, you're just better off with a dGPU. The added cost and capacity constraints of HBM just aren't worth it.
The real kicker comes when you pair it with Optane DIMMs. You could configure those as swap and then 16 GB really wouldn't feel restricting. Plus, you'd get instant hibernate/wakeup.
Intel's Kaby Lake-G had their cpu, AMD's GPU and some HBM in one package.
Yeah but no. It's a wholly different animal. That was basically just a dGPU + its HBM mounted on the same substrate as the CPU die, but they were otherwise separate. The CPU still had its own normal DDR4.
The only way they were more closely-linked than if the CPU and GPU had been separate packages was the load-balancing of power utilization.
CXL will enable mapping of GPU's HBM into CPU memory space. CPU would access it through L3 cache.
It's going to be slower and more energy-intensive than accessing direct-attached RAM.
I think CXL-based memory modules will be used for special use-cases, such as when you specifically want a memory pool shared between accelerators, or maybe in a storage hierarchy above flash. I don't foresee it replacing direct-attached DRAM.
Comment