Announcement

**coder** · 17 March 2022, 07:02 AM

Originally posted by drakonas777 View Post

Perceived "horrible power efficiency" of x86 is partially caused by idiotic factory settings for CPU/platform.

True, but even their peak-efficiency point is poor, by comparison with leading ARM and Apple cores.

Originally posted by drakonas777 View Post

You should remember that Intel/AMD strive for maximum performance per silicon square millimetre, especially in consumer market for mere silicon economy and THE BENCHMARKS, of course.

This is a big reason why.

**BillBroadley** · 17 March 2022, 08:42 PM

Originally posted by coder View Post

WTF?

PCIe is generally not a bottleneck. Analysis of GPUs in PCIe 4.0 x16 slots vs. x8 or PCIe 3.0 x16 show very little performance difference, in the vast majority of games. This is one reason I think Intel's move to PCIe 5.0 (on the desktop) was silly.

I think the PCIe v5 on Intel was partially to upstage AMD for bragging rights, but does help with functionality like faster M.2 slots, easier handling of faster network connections (10, 25, 100, and 200Gbit are increasingly common), and of course future CXL capable GPUs. Keep in mind the alder lake cores in today's desktops will also make it into Xeons for server type duty. PCIe v4 can be limiting. I've got a AMD motherboard for a home file server, the south bridge is pretty limited (even though it's the 570 chipset with twice the bandwidth). If I put 2x M.2 drives, a 2x10G card, and 6xSATA it ends up significantly oversubscribed. Fortunately I can move things to PCIe cards, which are not oversubscribed.

I agree that today's GPUs are not bandwidth limited, although part of this is that it's very hard to use the bandwidth. So games generally have 100% of textures and objects inside the GPU ahead of time, because of the crude (not cache coherent) interface. With a smarter interface you could use the local VRAM as a cache for main memory and instead of crashing your game would run gradually slower as the cache hit rate drops. So being cache aware would allow useful interactions that would churn up much more bandwidth. Most of today's games upload textures, some 3d (immutable objects) or not, and once per frame upload the eye position, field of view, eye direction, and whatever triangles are unique to that frame. Thus they are relatively insensitive to bandwidth. However with increasingly interactive worlds, mutable objects, complex effects, etc it's attractive to not have texture limits or even use the GPU for game physics, even using AI or ML for smarter opponents etc. However much of this is not possible (or at least harder) with today's GPUs, but is coming. CXL is a standard to allow accelerators (including GPUs) to be memory coherent. Logic and data could float between CPU and GPU by passing a pointer and take full advantage of whatever bandwidth and caching there is. Today the GPU might as well be on the other end of a fast network connection for how separate it is from the CPU/Ram system. Zen4 and near future Intel cores (forget the name) will support CXL.

Originally posted by coder View Post

As for sending stuff "manually", I'm pretty sure that's nonsense. The way I think it's usually done is to send the device a command to fetch data at a given DRAM address, and it does the fetch. Furthermore, the commands are queued. So, you just queue it up and move on. You obviously don't want to waste a CPU thread doing "PIO" copies. This was a huge no-no in the early days of PCIe, when systems had only 1 or 2 cores. So, it can't have been the case that most copies were the equivalent of a CPU thread doing a memcpy() into GPU memory.

Heck, even old-school PCI had bus-master DMA transfers, and that was the only way to get anywhere close to the theoretical throughput of the bus. Back in the days of the ISA bus, PC chipsets had DMA engines that you could program to do the copying for you. However, I think they were pretty slow and therefore relegated mostly to things like sound.

Generally GPU memory is mapped into the main memory address space, but cache is disabled for that range. DMA is possible, but is somewhat complex, and setting up the DMA adds latency, but is more CPU friendly, and typically lower bandwidth. Depending on your needs and the size of the transfer you may also do a direct memcpy, which is lower latency, higher bandwidth, and more CPU intensive. But this is way more painful than just using a pointer. If you have multiple CPU cores it's pretty straight forward to use multiple threads and/or multiple processes to get things done. You can have the copies of the same code working on different pieces of data, or even have different pieces of code (a pipeline) working on the same data. It's all pretty easy to program and supported in multiple languages. CUDA however generally requires a rewrite, uses a different model, and you have to explicitly send everything you need to the GPU, and if it doesn't fit it dies. It's also crazy harder to debug, strace, /proc, virtual memory, lsof, cgroups, etc don't work when you are inside the GPU. Even things like ensuring which process uses how much memory (something any OS takes for granted) doesn't exist.

Sure do people write games that run great with today's GPUs ... definitely. But there's great potential in GPUs that are under utilized because it's a limited environment, doesn't run a linux kernel (or any full OS), no virtual memory, no cache coherency, etc. Future CXL attached GPU cards will fix this in the future and the Apple iGPU does this today.

**BillBroadley** · 17 March 2022, 08:59 PM

Originally posted by coder View Post

Not any more. These days, to get the highest clock speed on lightly-threaded workloads, you have to buy a CPU with more cores than you might actually want.

What's lower is the "base clock" speed, which is essentially the minimum clock speed for when all of the cores are running. More cores -> more heat & power, so they run at a lower clockspeed to lessen the impact.

Well my point is that X86's are power/heat/cooling limited thus lower performance than you'd expect when using all cores, which generally is going to be the base CPU speed. So sure an expensive (well over $1000) Threadripper pro 5995WX with 64 cores looks great on a single thread at 4.5 GHz (which is BTW less than a cheap 5600x @ 4.6 GHz) but degrades to 2.7 GHz when using all cores.

**Paradigm Shifter** · 17 March 2022, 09:14 PM

Originally posted by BillBroadley View Post

Well my point is that X86's are power/heat/cooling limited thus lower performance than you'd expect when using all cores, which generally is going to be the base CPU speed. So sure an expensive (well over $1000) Threadripper pro 5995WX with 64 cores looks great on a single thread at 4.5 GHz (which is BTW less than a cheap 5600x @ 4.6 GHz) but degrades to 2.7 GHz when using all cores.

I think the comment about clocks held the implication that the CPUs came from the same range - comparing a mid-range consumer chip to the top-end Threadripper is a little unfair.

The 5995WX doesn't have a solid price attached, but it looks like it'll be somewhere in the region of 20-25x more than the 5600X.

It's a bit sticky, too, as it will depend on what CPU you pick. Some have increased base and boost clocks on the higher core part, e.g.: 5600X (3.7GHz base, 4.6GHz boost) vs. 5800X (3.8GHz base, 4.7GHz boost) while others have a slight drop in base but a larger jump in boost, e.g.: 5950X (3.4Ghz base, 4.9GHz boost).

Another example with the P/E core options, e.g.: 12600K (3.7GHz base, 4.9GHz boost) vs. 12900K (3.2GHz base, 5.2GHz boost)... but which should still translate to higher single thread performance on the higher core count CPU.

**BillBroadley** · 17 March 2022, 09:16 PM

Originally posted by coder View Post

No, that's not true. First, modern CPU cores have hardware prefetchers that detect data-access patterns and try to have the data in cache before the CPU even asks for it. Second, memory reads are prioritized by the out-of-order scheduler and a core can have quite a few of these outstanding at any given time.

Third, memory channels are traditionally interleaved. I don't honestly know if this is still the case, but I think up through the DDR3 era, most CPUs would map 64-bits one memory channel and then the next 64-bits to the other. Maybe they actually interleaved on cache line boundaries, though. That would let them run more independently. Interestingly, I have an old AMD Phenom II board with a BIOS option to interleave them at page boundaries.

I do know that DDR5 has more channels. Each DDR5 DIMM has 2x 32-bit memory channels. So, if they're interleaved, then it must be at intervals no less than a cacheline.

Yes, modern chips have hardware prefetechers, which help with sequential access (pretty much always) or strided access (depending on the prefetcher), some good ones even support negative strides. However prefetching does not impact in any way the performance you see through multiple channels. So if your dimm latency is 60ns and you have 2 channels you can handle exactly 2 cache misses and wait 60 ns for the result. So if 8 cores have a cache miss you have to wait 240ns (60ns x 4). I've done significant exploration in this space and have written several micro benchmarks to explore this.

Similarly out of order schedulers can have multiple outstanding requests, and if those are answered by cache that's great. But nothing about the schedule can hide the impact of multiple channels. Sure the CPU might well have 100 memory requests pending, but if you have 2 channels then exactly two can be making progress at a time, all other requests will have to wait behind those 2.

Your 3rd point is a bit dated, yes memory used to be striped across channels and generally used to increase bandwidth, so a single cache line (usually on the order of 64 bytes give or take) would hit all channels. However as the CPU latency vs memory latency divide grew ever larger (it's been steadily worsening since the first CPU) there's been more complexity. Thus multiple levels of cache, prefetching, more complex branch prediction, out of order operations, etc. Recently those operating systems are aware of multiple channels, multiple sockets, and even different latencies within a single chip. For some epycs for instance some cores are closer to some memory channels than others. Programming languages and libraries are also aware so you can use a topology/locality aware memory allocator instead of just malloc. I was aware of these changes around the time of the Intel Pentium 4, where BIOSs would have a memory stripping option for older OSs that were not NUMA aware. These days OSs, BIOS, and related are generally defaulting to NUMA aware. However with or without NUMA or stripping, nothing hides the performance aspects of how many memory transactions you can have in flight, which is 1 per memory channel. You send a row, column, and get a cache line back. Generally with a modern OS and most workloads the NUMA aware (one cache line request per channel) is fastest.

DDR5 dimms are still 64 bit, but do offer 2 32 bit channels which is a nice improvement over DDR4s 1x64 bit channel. Interleaving isn't on by default and I don't think most modern motherboards would support striping, but it's possible. Generally it's a big lose since instead of having 2 requests per dimm in flight you have one, so you get half the throughput for most cases and the same performance for a pure sequential workload.

**coder** · 17 March 2022, 10:50 PM

Originally posted by BillBroadley View Post

I think the PCIe v5 on Intel was partially to upstage AMD for bragging rights, but does help with functionality like faster M.2 slots,

Oops. No, it doesn't. Alder Lake only has a x16 PCIe 5.0 link that can be bifurcated to dual x8. So, the only way it helps with M.2 is if you get a plug-in PCIe adapter card to hold M.2 drives. And that only matters when they actually exist. And then, it only matters if you're actually doing something that is limited by current PCIe 4.0 latency or bandwidth, which is pretty unlikely (at least, not on a desktop platform).

The whole argument gets even more ridiculous when you look at the amount of thermal throttling going on in PCIe 4.0 M.2 drives. With PCIe 5.0 doubling the clock rate yet again, it's only going to get hotter. At that point, thermal throttling could easily wash out any performance gains from PCIe 5.0, other than maybe in tiny, isolated bursts.

IMO, the whole thing is pretty stupid. It took over a year for PCIe 4.0 M.2 drives to hit the market that were meaningfully faster than the PCIe 3.0 top performers. And we should expect the PCIe 5.0 situation to be any different? I don't.

Originally posted by BillBroadley View Post

easier handling of faster network connections (10, 25, 100, and 200Gbit are increasingly common),

Okay, let's look at that. Since we're talking about a mainstream desktop platform, I'm going to stop at 200 Gbps (AKA dual-100 Gbps).

Ethernet Speed	PCIe 3.0 Lanes	PCIe 4.0 Lanes	PCIe 5.0 Lanes
10 Gbps	x2	x1	x1
25 Gbps	x4	x2	x1
40 Gbps	x8	x4	x2
100 Gbps	x16	x8	x4
200 Gbps	x32	x16	x8

The problem is that Intel provided support only for bifurcating the x16 link to dual x8. And at x8, you could already do 100 Gbps. In fact, Alder Lake has a PCIe 4.0 x8 link to its chipset. So, you could already put a 100 Gbps adapter in a slot hanging off that link, easily hit 100 Gbps (bidir) with room to spare, and not touch any of the PCIe 5.0 lanes.

Now, depending on what you're doing with it, that might not make sense. If you want to transfer files to/from the machine, then you're going to hit a bottleneck on storage. You'd have to RAID-0 at least 2 of the fastest SSDs to get that kind of throughput. However, maybe you're just editing videos of a NFS-mount.

Still, what's undeniable is that it's only that 200 Gbps corner-case that's really served by what Intel did here. Everything below that could've been accomodated by Rocket Lake and its PCIe 4.0.

Originally posted by BillBroadley View Post

and of course future CXL capable GPUs.

Alder Lake doesn't support CXL. It's a different protocol than PCIe. They only share the same PHY layer.

Originally posted by BillBroadley View Post

Keep in mind the alder lake cores in today's desktops will also make it into Xeons for server type duty.

The cores have nothing to do with that! The PCIe controller is a separate block, on the die. They could pair potentially any PCIe controller with any set of cores, in any of their products.

Originally posted by BillBroadley View Post

PCIe v4 can be limiting. I've got a AMD motherboard for a home file server, the south bridge is pretty limited (even though it's the 570 chipset with twice the bandwidth). If I put 2x M.2 drives, a 2x10G card, and 6xSATA it ends up significantly oversubscribed.

That's a stretch. The first M.2 drive should be directly-connected to the CPU. That puts only the second one on the chipset. I've not seen a M.2 SSD that can saturate a PCIe 4.0 x4 link. The dual 10 Gbps controller only uses about 1.6 lanes' worth. Now, where you get into trouble is that by putting the second NVMe in a x4 slot, you have to sacrifice chipset SATA ports. So, you have to add a PCIe controller card to get beyond 4 SATA ports. However, that's a chipset limitation, not a matter of PCIe bandwidth.

Now, when talking about how much bandwidth those 6 SATA ports are really using, we have to consider what's connected to them. If it's HDDs, then it's probably generous to put the max at 300 MB/sec each. That's still 1.8 GB/sec aggregate, which is about 0.9 of a PCIe 4.0 lane.

So, you can contrive some scenario where the HDDs, NVMe drives, and NICs are all reading at full speed, and maybe you're oversubscribed by 25% or so. However, you said it's a server, which means the traffic flow over the dual-10 Gbps NICs is in the opposite direction as the storage. So, no. I'm pretty sure you're not oversubscribed in actual practice.

And the point doesn't apply to Intel platforms since Rocket Lake, because their chipset link is PCIe 4.0 x8, whereas AMD's as PCIe 4.0 x4.

Originally posted by BillBroadley View Post

So games generally have 100% of textures and objects inside the GPU ahead of time, because of the crude (not cache coherent) interface. With a smarter interface you could use the local VRAM as a cache for main memory and instead of crashing your game would run gradually slower as the cache hit rate drops.

Again, you're way off the mark. First, games do stream textures and other assets into GPU memory, continually. Second, GPUs can page in from host RAM, as-needed. AMD made a big fuss about the HBCC (High-Bandwidth Cache Controller) that they introduced way back in the Vega generation, which essentially treats GPU memory as a cache. And they can even achieve cache coherency over PCIe, using PCIe atomics.

Now, where you can most readily see the impact this has is in some of the benchmarks for AMD's Radeon RX 6500XT. When people stressed its 4 GB onboard VRAM by cracking up resolution and detail level, it started having to page in lots of assets from host memory. And if you'd put this x4 card in a PCIe 3.0 slot, performance suffered badly. However, we're now talking about PCIe 3.0 x4. Not PCIe 5.0 x8 or x16.

Originally posted by BillBroadley View Post

Most of today's games upload textures, some 3d

And how do you know this?

Originally posted by BillBroadley View Post

with increasingly interactive worlds, mutable objects, complex effects, etc it's attractive to not have texture limits

Did you ever hear of the term "shader"? Do you know what it means? It came from the film production world's idea of generating texture data on the fly. If you look at the relative amount of compute vs. memory bandwith, in GPUs (even those sporting near TB/sec memory bandwidth), it's still lopsided by like an order of magnitude. Modern GPUs, themselves, aren't even designed to use primarily static textures.

Originally posted by BillBroadley View Post

Generally GPU memory is mapped into the main memory address space, but cache is disabled for that range. DMA is possible, but is somewhat complex, and setting up the DMA adds latency, but is more CPU friendly, and typically lower bandwidth. Depending on your needs and the size of the transfer you may also do a direct memcpy, which is lower latency, higher bandwidth, and more CPU intensive.

I don't need to be lectured by someone who clearly knows less about this stuff than I do. I obviously understand all the basics. The only thing I'm unclear on is the specifics. What would help is for someone who's seen the inside of a GPU driver to say "yep, the transfers are driven by the GPU" or "nope, the host's PCIe controller is what actually handles the host-directed transfers".

Originally posted by BillBroadley View Post

But this is way more painful than just using a pointer. If you have multiple CPU cores it's pretty straight forward to use multiple threads and/or multiple processes to get things done.

Page-faulting non-local memory is going to be super painful. I don't care if it's connected by PCIe or CXL. GPU programming APIs, whether for graphics or compute, all use queues to decouple the host thread from data access, so you don't waste ridiculous amount of CPU time doing PIO transfers.

Originally posted by BillBroadley View Post

GPU cards will fix this in the future and the Apple iGPU does this today.

You're just reiterating the same case for APUs that AMD tried to make more than 10 years ago. But it didn't pan out. When the PS4 and XBox One launched, there were even some warnings that games might perform much worse on PCs, because they could actually take advantage of those systems' APU architectures. Still, it didn't pan out.

I don't believe the world is going to fundamentally change the way GPUs are used. To scale performance, you're always going to have GPUs not directly-connected to the same RAM as the host CPU(s) and embracing any APIs or programming styles that fundamentally assume otherwise is going to torpedo that scalability. So, most GPU programming of the future is likely to continue down the path GPU programming has been on. Work queues are a fundamental construct of scalable, parallel performance. Whether CPU, GPU, or hybrid. What you absolutely don't want is for cores to waste precious time blocking on stuff.

Also, please fact-check yourself, before you post, so that others don't have to do it for you. You'll learn more that way, and others won't have to waste time doing it.

**coder** · 17 March 2022, 10:53 PM

Originally posted by BillBroadley View Post

So sure an expensive (well over $1000) Threadripper pro 5995WX with 64 cores looks great on a single thread at 4.5 GHz (which is BTW less than a cheap 5600x @ 4.6 GHz) but degrades to 2.7 GHz when using all cores.

Depends on what it's doing. Base clocks tend to be a minimum (except for the noted example of AVX-512 on Intel). Not all workloads will drop clocks that low. For instance, if you're compiling software, you'll probably see all-core clocks stay significantly above base.

**coder** · 17 March 2022, 11:08 PM

Originally posted by BillBroadley View Post

So if your dimm latency is 60ns and you have 2 channels you can handle exactly 2 cache misses and wait 60 ns for the result. So if 8 cores have a cache miss you have to wait 240ns (60ns x 4). I've done significant exploration in this space and have written several micro benchmarks to explore this.

The problem with micro-benchmarks is that they often stress corner cases. While this can be enlightening, what they actually tell you about real-world performance tends to be limited. For instance, in this case, you're assuming the CPU's out-of-order engine can find nothing else to do, and that the loads weren't issued ahead of when the data was needed. Both tend to be incorrect, for typical program code.

This is also where SMT helps out. If one thread is blocked on a L3 cache miss and the speculative execution engine can't find any more work to do, then at least the other thread(s) sharing the core will tend not to be blocked. CPUs like Ampere's Altra take the other approach of just having lots of smaller cores, so that some can make progress while others are blocked. Apple's approach of a small number of big cores basically means they need to minimize memory latency, because they have neither SMT nor tons of siblings to cover their cache misses.

BTW, it's possible to find heavily memory-bound workloads, but I've not generally seen them in my career. Even things I expected to benefit from more memory bandwidth/channels were hardly impacted. On desktop platforms, the main area you really see benefiting from better memory performance tends to be iGPUs. It's really not until desktop CPUs broke 8-core barrier that we've started to see significant memory bottlenecks on CPU-bound workloads.

**BillBroadley** · 18 March 2022, 02:00 AM

Originally posted by coder View Post

Helped by the fact that it's LPDDR5. Anyway, lots of the bandwidth is really there for the GPU. If you're running a graphics-heavy workload, the CPU cores won't get as much.

That's not generally supported by benchmarks. Sure, you never get linear scaling, but AMD and Intel typically add enough memory bandwidth (and cache) so the cores aren't terribly starved.

The more apt comparison would be with a laptop GPU.

If that's all that software is designed to use, then the extra memory doesn't help you.

Its CPU cores can't use all of that bandwidth! It's mainly for the GPU. It would therefore make more sense to compare it with the memory specs of comparable GPUs.

I agree that the easiest win for extra bandwidth is a GPU, which will scale well with bandwidth, and also makes use of multiple channels well. Some things (like streaming the frame buffer for display) are sequential, while others (like zbuffer access or texture lookups) are random. However GPU used for gaming are very latency tolerant. So even many times higher than normal desktop memory latency is fine, as long as you can have many outstanding transactions, thus it's common to have 100s or even 1000s of threads to hide the latency and make the most of the GPU memory bandwidth.

Even ignoring the Apple iGPU the M1 max and it's memory system is impressive. Anandtech mentioned the CPUs on the M1 max was able to hit a maximum of 240GB/sec (out of 400 peak) and over 100GB/sec from a single core.

For comparison the AMD Epyc 7763 (the latest and greatest from AMD) with 8 DDR4-3200 channels only managed 111GB/sec (as measured by Anandtech) using all cores, less than 1/2 of M1 max, and the Epyc 7763 alone (not including ram, motherboard, etc) has a TDP of 280 watts.

So the latest greatest (and generally considered ahead of Intel) AMD requires 64 cores, 280 watts, and 8 channels to get 111GB/sec observed. An Apple thin/light laptop can do the same with 10 cores, much less power, and get over twice the memory bandwidth (for CPUs only, ignoring GPUs). As impressive as that is, it's likely half of what the M1 ultra can do. Granted that's somewhat speculative until the detailed reviews come out.

**BillBroadley** · 18 March 2022, 02:04 AM

Originally posted by drakonas777 View Post

Perceived "horrible power efficiency" of x86 is partially caused by idiotic factory settings for CPU/platform. I've put my 3700X into 45W ECO mode, and it lost like ~15% performance while consuming ~2x less power. You should remember that Intel/AMD strive for maximum performance per silicon square millimetre, especially in consumer market for mere silicon economy and THE BENCHMARKS, of course.

Yes, that's the nature of the beast. In particular Intel was behind Zen3, wanted a flagship desktop for good marketing, getting marketshare back, and giving confidence to their stock holders. As a result the alder lake chips are fast and power hungry, moreso relative to TDP then any other intel chip that I'm aware of. As you mentioned backing off even a small bit on clock rate can help significantly with power and be generally unnoticeable unless you run CPU intensive tasks and use a stopwatch.

Announcement

Apple M1 Ultra With 20 CPU Cores, 64 Core GPU, 32 Core Neural Engine, Up To 128GB Memory

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment