Apple M1 Ultra With 20 CPU Cores, 64 Core GPU, 32 Core Neural Engine, Up To 128GB Memory

Paradigm Shifter replied

24 March 2022, 04:19 AM
Originally posted by coder View Post

These already exist.

Yup. Got one myself, but don't use it because the board it came with is crammed full of GPUs.
Likes 1
Leave a comment:
coder replied

18 March 2022, 05:36 AM
BillBroadley , here's more on OpenCL Shared Virtual Memory. I omitted it from the above post, because inserting too many links will typically get your post withheld for moderation.

"One of the remarkable features of OpenCL™ 2.0 is shared virtual memory (SVM). This feature enables OpenCL developers to write code with extensive use of pointer-linked data structures like linked lists or trees that are shared between the host and a device side of an OpenCL application. In OpenCL 1.2, the specification doesn't provide any guarantees that a pointer assigned on the host side can be used to access data in the kernel on the device side or vice versa. Thus, data with pointers in OpenCL 1.2 cannot be shared between the sides, and the application should be designed accordingly, for example, with indices used instead of pointers. This is an artifact of a separation of address spaces of the host and the device that is addressed by OpenCL 2.0 SVM."

https://www.intel.com/content/www/us...-overview.html
Leave a comment:
coder replied

18 March 2022, 05:33 AM
Originally posted by BillBroadley View Post

DMA sounds great, but after talking to engineers inside pathscale, qlogic, intel, and mellanox that it ends up being much more complicated than you might think and performance limitations, bugs, special cases, etc are common when you have to support a wide variety of platforms.

FYI, Sapphire Rapids Xeons will feature DSA, which is basically a glorified DMA+.

https://01.org/blogs/2019/introducin...ng-accelerator

Originally posted by BillBroadley View Post

I read a piece by (pretty sure) John Romero make an impassioned plea to AMD to make a GPU with a cache coherent GPU to connect to their cache coherent interconnect (hypertransport) and it resonated with me.

Must've been a long time ago. I gather geometry and tessellation shaders substantially reduced PCIe bottlenecks, by generating much more geometric detail directly on the GPUs. Basically doing almost for geometry what procedural shading did for textures.

That's not to say GPUs don't need tons of stuff from the host, but the more you learn about modern GPU APIs, the more you see how oriented they are towards generating as much locally on the GPU as possible.

Originally posted by BillBroadley View Post

I can also assure you that HPC folks are continuously begging for interconnects to be closer to the CPU instead of on the other end of PCI-e .

Yes, because HPC datasets are often too big for GPU memory. That's why we get byzantine interconnect typologies, like those in Nvidia's DGX systems.

Originally posted by BillBroadley View Post

What? You are saying that a GPU can have a cache miss, page fault, and ask the system for that page? That's not at all my understanding of this. It's been wished for, but isn't how it worked. Maybe you don't mean the same thing when you say page?

All I know is what they wrote when Vega launched:

https://techgage.com/article/a-look-...eon-vega-hbcc/

Originally posted by BillBroadley View Post

The VRAM is indeed a cache for system memory, but that system memory isn't shared with the CPU,

Ask bridgman, but my understanding is that AMD GPUs implement support for cache-coherent access of system memory between CPUs and GPUs.

Originally posted by BillBroadley View Post

So you can't pass pointers between CPU and GPU.

In OpenCL 2.0, "Shared Virtual Memory" was introduced, which is essentially this.

Originally posted by BillBroadley View Post

carefully managing limited resources like video memory is because of the nature of being on a PCIe bus instead of a fast low latency connection like exists between CPU sockets.

People who seem to know what they're talking about have said that it's typically faster to treat the separate NUMA domains of modern SMP CPU configurations as separate HPC nodes, and use message queues to communicate between them. Because, as fast as the inter-processor bus is, it's still a potential bottleneck.

Given that, I think work queues will always be the preferred structure for CPUs to communicate with GPUs. Maybe not vice versa, since GPUs are so much better at latency-hiding.

Originally posted by BillBroadley View Post

today's GPUs are not particularly bandwidth limited,

You were talking about games. I didn't say it's a non-issue for HPC or deep learning.

Originally posted by BillBroadley View Post

Want to take a guess on the latency to get one cache line from system ram for a CPU vs a GPU today?

You presume they even work on a per-cacheline basis. I don't know, but it could be page-based. I looked on GPUOpen.org, but the only Vega GPU docs I see are shader ISA-level.
Leave a comment:
coder replied

18 March 2022, 05:01 AM
Originally posted by BillBroadley View Post

I admit I've not been tracking the PCIe5 support, but like DDR5 it's a chicken and egg problem. Intel needs to ship something to prod the ecosystem to scaling up products to take advantage of it.

It is but it's not. Meaning that PCIe 4.0 appeared first in POWER and other server CPUs, before AMD brought it to the desktop. And when AMD did that, they had PCIe 4.0 GPUs to go with it.

PCIe 5.0 was already happening, in the server market, where it's actually needed. Several months ago, Amazon announced their next-gen Graviton CPUs have PCIe 5.0 and were already in service for months, at the time. Other server processors have announced it, including Intel's Sapphire Rapids. So, that was already fait acompli. And server CPUs with PCIe 5.0 would've been enough for ecosystem partners to do interop testing. They didn't need desktop CPUs, for that.

Originally posted by BillBroadley View Post

I'm hoping Intel's shipping of DDR5 helps decrease DDR5 prices as volumes increase

First, I didn't complain about them embracing DDR5. The data clearly shows that it's warranted, especially at the top of the Alder Lake stack.

Second, I think they got out ahead of the market, though it'd have been difficult to time perfectly, especially given the supply chain madness of late. At least they kept DDR4 as an option. Full marks for that!

Originally posted by BillBroadley View Post

I've seen various NVMe drives mentioning PCI5,

Not in any M.2 form factor, you didn't!

Originally posted by BillBroadley View Post

I wanted 2xFirecuda 530s, they peak around 7GB/sec peak sequential and only one of the 2 is direct connected to the CPU, the other is on the south bridge. 10G ethernet managed 1GB/sec if you are lucky, and would only get near that when reading from NVMe (which is caching for disks). I'm only using 5 of the 8 available SATA, each at around 267MB/sec head rate so 1335MB/sec. So adding that up I get 7GB /sec(for the NVME connected to the south bridge, 1GB/sec for ethernet, and 1.3GB/sec for the 5 disks. So just above 9GB/sec.

Again, you're just adding up bandwidth numbers without regard for the direction. PCIe is full-duplex, yet the quoted speeds are only uni-dir. If this machine is a file server, then your max-bandwidth scenario is serving files over the 10 Gig link, which means reading from storage and writing to the network adapter. So, it's actually wrong to naively add your network adapter bandwidth to the HDDs and NVMe.

Originally posted by BillBroadley View Post

My point of all that is I'd be much happier having a south bridge connected with an x4 PCIe v5 than a v4 so I don't have to juggle things around to ensure PCIe is not the bottleneck.

Your point is non-sequitur, because we were talking about Intel and Alder Lake. In that case, they already had a PCIe 4.0 x8 chipset link in Rocket Lake, which they kept in Alder Lake.

And if you look at it from a cost or energy standpoint, I think you'd find that PCIe 4.0 x8 is a better option than PCIe 5.0 x4, for the chipset link. However, I actually expected Intel to deploy PCIe 5.0 only for that chipset link, because even that is cheaper and easier (and makes more sense) than what they actually did!

Originally posted by BillBroadley View Post

I can tell you that 2x NVMe (firecuda 530s) was a quite noticeable upgrade of a SATA connected SSD that they replaced.

NVMe faster than SATA SSDs? That's the most credible thing you've said, yet!
: )

Originally posted by BillBroadley View Post

it does worry me that some motherboards put M2 connectors where there's basically zero airflow, like under the motherboard.

Even worse: some motherboards locate M.2 slots directly beneath the GPU, where they're affected by its waste heat! This is also impractical, since you have to remove the GPU to get at them.

M.2 is an annoying form factor, for desktop use. Priority was clearly given to laptops, during its design. Still, it beats the heck out of the "ruler" form factor!

intel-ssd-dc-p4500-1480x828.jpg

Originally posted by BillBroadley View Post

I wouldn't be surprised to see PCIe cards with 4x NVMe slots and airflow like a GPU.

These already exist.

Originally posted by BillBroadley View Post

First gen drives for a new technology typically have compatibility, but not optimized for it.

My take on it is that AMD jumped the gun. They got out ahead of SSDs and even GPUs, both of which were only starting to max out PCIe 3.0. That's why Intel was caught off-guard, and took another 2 generations to respond (i.e. by adding PCIe 4.0 in Rocket Lake).

Originally posted by BillBroadley View Post

I keep my systems for a long time. I'm typing this on a Xeon E3-1230 v5 bought in 2015 or so, and I did buy a M.2 NVMe (Samsung 950 pro).

I'm using a Sandybridge E5-1620 workstation to which I added an Intel NVMe 1.0 SSD, before the M.2 form factor even existed. This was actually the first PCIe 3.0 platform Intel sold. Even their desktop Sandybridges were still PCIe 2.0.

Originally posted by BillBroadley View Post

True, PCIe is duplex, my use case is mostly reads so uncached it's 5xSATA -> CPU (checksums) -> network mixed with cached (NVMe -> CPU (checksums and etc) -> 10G. But then again even if it's 7.8GB/sec per direction to the south bridge doesn't mean that real world use might be 50-60% of that before running into bottlenecks and contention. In any case would be nice to have some extra bandwidth to the south bridge so I could just ignore it.

PCIe is switched, like modern ethernet. I don't know the details, but it must have some kind of flow-control mechanism, which should enable fair queuing. Therefore, I think it's entirely reasonable to expect much nearer to theoretical bandwidth. The main point of contention is more likely to be the CPU's memory controller, IMO. Because PCIe transactions are nearly always device <-> host memory. There are ways to do device <-> device transfers, but this is quite the exception.

Originally posted by BillBroadley View Post

I was planning on 4 or more 5MP cameras streaming to the NAS (part of the justification for the storage) on a second interface (direct connected to a PoE switch). Not any significant bandwidth, but just another thing keeping various bits of the server busy.

I know there are HDD product lines marketed towards video surveillance. Not really sure what sets them apart.

Anyway, a desktop platform can easily handle an order of magnitude more cameras than that. Do you have any specific software in mind? I haven't really looked into open source solutions for this, so I'm just wondering.

You'd be well-advised to keep the cameras on a separate subnet, where they're not internet-accessible.

Originally posted by BillBroadley View Post

Ah, good to know, I'll keep an eye out for the future xeon similar to today's alder lake.

It's not a foregone conclusion there will be an E-series Alder Lake Xeon, since they're offering ECC support on the the standard desktop CPU (with the appropriate motherboard).

If you want to spend the big $$$, then you can be assured of a Sapphire Rapids Xeon W, I think towards the latter part of the year. It'll feature the same Golden Cove cores, most likely DDR5, and certainly AVX-512. Plus AMX, although that's likely to be of interest only for deep learning and image processing, due to its support for only low-precision data types (int8 and BFloat16).
Likes 1
Leave a comment:
BillBroadley replied

18 March 2022, 04:37 AM
I don't have particularly deep knowledge specific to GPUs, drivers, and related. I do have a fair bit of experience with PCIe connect high performance interconnects (Quadrics, Myrinet, IB in SDR, DDR, QDR, FDR, EDR and HDR). I've even designed and build a cluster using a very rare variant that connected IB directly to hypertransport. Including some that were the first public clusters of their kind. Sadly there's two flavors of hypertransport, one cache coherence and one not and it was the non-coherent version for the interconnect. That's where my experience with tradeoffs between DMA and memcpy and similar came from. DMA sounds great, but after talking to engineers inside pathscale, qlogic, intel, and mellanox that it ends up being much more complicated than you might think and performance limitations, bugs, special cases, etc are common when you have to support a wide variety of platforms. Because of the DMA limitations for interconnects there's a latency sensitive path (that's not DMA) that's low latency and high CPU utilization and at some (typically) tuneable threshold above which you switch to a high bandwidth, high latency, but low CPU utilization path.

I read a piece by (pretty sure) John Romero make an impassioned plea to AMD to make a GPU with a cache coherent GPU to connect to their cache coherent interconnect (hypertransport) and it resonated with me. Instead of a device that's practically write only and because of the PCIe bus only particularly efficient with large transfers. So sure you can throw 1/60th of a second of triangles/textures at a GPU and it works great, but it makes it much harder to use the substantial computation resources for other things. I can also assure you that HPC folks are continuously begging for interconnects to be closer to the CPU instead of on the other end of PCI-e .

Originally posted by coder View Post

Again, you're way off the mark. First, games do stream textures and other assets into GPU memory, continually. Second, GPUs can page in from host RAM, as-needed.

What? You are saying that a GPU can have a cache miss, page fault, and ask the system for that page? That's not at all my understanding of this. It's been wished for, but isn't how it worked. Maybe you don't mean the same thing when you say page?

Originally posted by coder View Post

AMD made a big fuss about the HBCC (High-Bandwidth Cache Controller) that they introduced way back in the Vega generation, which essentially treats GPU memory as a cache. And they can even achieve cache coherency over PCIe, using PCIe atomics.

Yes, that's exciting and a half step. The VRAM is indeed a cache for system memory, but that system memory isn't shared with the CPU, which is about half as useful. So you can't pass pointers between CPU and GPU. But you do get (at a perf penalty) access to main memory from the GPU. It's kind hacky and ugly and requires software developers to directly support it (instead of just using a pointer). So sure, it's about half as cool/useful as a cache coherent GPU.

Originally posted by coder View Post

Page-faulting non-local memory is going to be super painful. I don't care if it's connected by PCIe or CXL. GPU programming APIs, whether for graphics or compute, all use queues to decouple the host thread from data access, so you don't waste ridiculous amount of CPU time doing PIO transfers.

Much of the complexity of CUDA/graphics drivers, even thinking about DMA vs PIO, carefully managing limited resources like video memory is because of the nature of being on a PCIe bus instead of a fast low latency connection like exists between CPU sockets. Similarly these PCIe limitations are why Multi-GPUs for gaming or visualization is becoming less common over time, and why when they were used the connection between GPUs was a special cable and not PCIe. It's also why when collaborating with IBM that Nvidia chose NVlink to connect to IBMs CPU instead of PCIe. It's also why Intel, AMD, and others are moving towards CXL for accelerators over next gen PCIe. So yes I agree today's GPUs are not particularly bandwidth limited, but the complexity, pain, and weird programming models (compared to a CPU) are part of why GPUs aren't as useful as they could be.

Originally posted by coder View Post

You're just reiterating the same case for APUs that AMD tried to make more than 10 years ago. But it didn't pan out. When the PS4 and XBox One launched, there were even some warnings that games might perform much worse on PCs, because they could actually take advantage of those systems' APU architectures. Still, it didn't pan out.

Well obviously the PS5/XboxX are huge commercial successes. Sure games that work well also run well on gaming PCs, which typically are more expensive, larger, and hotter. But sure it shows that games today do not take advantage of any magic that doesn't work over PCIe. They just spit a frame worth over PCIe and it pops up on your TV.

Seems plausible that the absence of PCIe does allow the console to do more with less and leverage GPUs in way that normal gaming machines do not. Thinks like passing a pointer to a complex data structure to allow the GPU to offload some physics calculations. Unfortunately I've no specific knowledge about that though.

Originally posted by coder View Post

I don't believe the world is going to fundamentally change the way GPUs are used. To scale performance, you're always going to have GPUs not directly-connected to the same RAM as the host CPU(s) and embracing any APIs or programming styles that fundamentally assume otherwise is going to torpedo that scalability. So, most GPU programming of the future is likely to continue down the path GPU programming has been on. Work queues are a fundamental construct of scalable, parallel performance. Whether CPU, GPU, or hybrid. What you absolutely don't want is for cores to waste precious time blocking on stuff.

Well cache coherency does not imply APU or shared memory. But it would be nice if my TB/sec memory system on a graphics card could easily and transparently use system memory, without me having to tell the OS ignore these few GB for the video card because the video card is so stupid it doesn't understand cache coherency. Doesn't it seem reasonable that cache coherency that handles CPUs, various level caches (shared and unshared), system memory, different memory latencies (NUMA) etc with hardware assist has quite a bit to offer GPUs that currently rely on pure software to implement a similar system. Want to take a guess on the latency to get one cache line from system ram for a CPU vs a GPU today?

Originally posted by coder View Post

Also, please fact-check yourself, before you post, so that others don't have to do it for you. You'll learn more that way, and others won't have to waste time doing it.

Sorry, I extrapolated more than I should have on the GPU front.
Leave a comment:
BillBroadley replied

18 March 2022, 03:37 AM
Originally posted by coder View Post

Oops. No, it doesn't. Alder Lake only has a x16 PCIe 5.0 link that can be bifurcated to dual x8. So, the only way it helps with M.2 is if you get a plug-in PCIe adapter card to hold M.2 drives. And that only matters when they actually exist. And then, it only matters if you're actually doing something that is limited by current PCIe 4.0 latency or bandwidth, which is pretty unlikely (at least, not on a desktop platform).

I admit I've not been tracking the PCIe5 support, but like DDR5 it's a chicken and egg problem. Intel needs to ship something to prod the ecosystem to scaling up products to take advantage of it. I'm hoping Intel's shipping of DDR5 helps decrease DDR5 prices as volumes increase so I can get something more reasonably priced later this year. I've seen various NVMe drives mentioning PCI5, but I'm sure they are shipping or what their actual real world performance is. In my case I wanted 2xFirecuda 530s, they peak around 7GB/sec peak sequential and only one of the 2 is direct connected to the CPU, the other is on the south bridge. 10G ethernet managed 1GB/sec if you are lucky, and would only get near that when reading from NVMe (which is caching for disks). I'm only using 5 of the 8 available SATA, each at around 267MB/sec head rate so 1335MB/sec. So adding that up I get 7GB /sec(for the NVME connected to the south bridge, 1GB/sec for ethernet, and 1.3GB/sec for the 5 disks. So just above 9GB/sec. Even on the "nice" version of the server motherboard using an x570 (instead of the cheaper x550) I get a PCIe4 with 4 lanes is good for 7.8GB/sec. Sure if I move the M.2 NVMe from the south bridge (on the motherboard) to a PCIe card that fixes the problem, and is what I did.

My point of all that is I'd be much happier having a south bridge connected with an x4 PCIe v5 than a v4 so I don't have to juggle things around to ensure PCIe is not the bottleneck. Doubly so since often without special tuning and artificial workloads getting 50% of the available bandwidth of any link is not unusual. I don't think a home NAS with 5 disks is all that odd, not trying to dream up some fake corner case.

Originally posted by coder View Post

The whole argument gets even more ridiculous when you look at the amount of thermal throttling going on in PCIe 4.0 M.2 drives. With PCIe 5.0 doubling the clock rate yet again, it's only going to get hotter. At that point, thermal throttling could easily wash out any performance gains from PCIe 5.0, other than maybe in tiny, isolated bursts.

Heh, didn't I already mention the poor design and lack of cooling common in PC cases? I do wonder if the next gen in NVMe will be worth it and noticeable for normal users with normal workloads. I can tell you that 2x NVMe (firecuda 530s) was a quite noticeable upgrade of a SATA connected SSD that they replaced. In any case I am expecting that heat sinks will become larger, it does worry me that some motherboards put M2 connectors where there's basically zero airflow, like under the motherboard. I wouldn't be surprised to see PCIe cards with 4x NVMe slots and airflow like a GPU.

Originally posted by coder View Post

IMO, the whole thing is pretty stupid. It took over a year for PCIe 4.0 M.2 drives to hit the market that were meaningfully faster than the PCIe 3.0 top performers. And we should expect the PCIe 5.0 situation to be any different? I don't.

Heh, I expect exactly that. First gen drives for a new technology typically have compatibility, but not optimized for it. Sometimes it's even a hidden (or at least not talked about) bridge chip that makes it compatible but no faster than the older technology. It's typically the second gen drives that make good use of it, and might well take a year later. I don't mind too much though, I keep my systems for a long time. I'm typing this on a Xeon E3-1230 v5 bought in 2015 or so, and I did buy a M.2 NVMe (Samsung 950 pro).

Originally posted by coder View Post

The cores have nothing to do with that! The PCIe controller is a separate block, on the die. They could pair potentially any PCIe controller with any set of cores, in any of their products.

Agreed, however I was just mentioning that the same cores will see server duty, which justifies some aspects of the chip that might not be justified by desktop use. Sadly Intel is disabling the AVX512 which was a surprise, but they are also enabling ECC even on consumer CPUs. That hits close to home since I bought my Xeon E3-1230 specifically to get ECC and was pleased it was actually cheaper than the i7 of the time (but running 166 MHz less or something).

Originally posted by coder View Post

That's a stretch. The first M.2 drive should be directly-connected to the CPU. That puts only the second one on the chipset. I've not seen a M.2 SSD that can saturate a PCIe 4.0 x4 link. The dual 10 Gbps controller only uses about 1.6 lanes' worth. Now, where you get into trouble is that by putting the second NVMe in a x4 slot, you have to sacrifice chipset SATA ports. So, you have to add a PCIe controller card to get beyond 4 SATA ports. However, that's a chipset limitation, not a matter of PCIe bandwidth.

Here's the URL for a nice diagram of the setup:

ASRock Rack X570D4U-2L2T Review an AMD Ryzen Server Motherboard

https://www.servethehome.com/asrock-rack-x570d4u-2l2t-review-amd-ryzen-server-motherboard/

In our ASRock Rack X570D4U-2L2T review, we see the company's latest server motherboard with 10GbE for the AMD Ryzen platform

It's already got the magic to allow 8x SATA and 2xM.2, at least I think so. Might have an footnote somewhere mentioning disabling the SATA. I only needed 5xSATA and moved the M.2 to a PCIe slot, so I didn't dig into it.

I can dig the cheaper motherboard variant with half the bandwidth to the south bridge if you want. But even real world bandwidth (i.e. head rate of a 5 SATA drives not 600MB/sec * 8 available SATA ports) can easily add up to more than PCIe4 x 4.

Originally posted by coder View Post

Now, when talking about how much bandwidth those 6 SATA ports are really using, we have to consider what's connected to them. If it's HDDs, then it's probably generous to put the max at 300 MB/sec each. That's still 1.8 GB/sec aggregate, which is about 0.9 of a PCIe 4.0 lane.

Yup, I've got 5 drives, 267MB/sec max head rate (I'd only expect see close to that on pure sequential access).

Originally posted by coder View Post

So, you can contrive some scenario where the HDDs, NVMe drives, and NICs are all reading at full speed, and maybe you're oversubscribed by 25% or so. However, you said it's a server, which means the traffic flow over the dual-10 Gbps NICs is in the opposite direction as the storage. So, no. I'm pretty sure you're not oversubscribed in actual practice.

True, PCIe is duplex, my use case is mostly reads so uncached it's 5xSATA -> CPU (checksums) -> network mixed with cached (NVMe -> CPU (checksums and etc) -> 10G. But then again even if it's 7.8GB/sec per direction to the south bridge doesn't mean that real world use might be 50-60% of that before running into bottlenecks and contention. In any case would be nice to have some extra bandwidth to the south bridge so I could just ignore it.

I was planning on 4 or more 5MP cameras streaming to the NAS (part of the justification for the storage) on a second interface (direct connected to a PoE switch). Not any significant bandwidth, but just another thing keeping various bits of the server busy.

Originally posted by coder View Post

And the point doesn't apply to Intel platforms since Rocket Lake, because their chipset link is PCIe 4.0 x8, whereas AMD's as PCIe 4.0 x4.

Ah, good to know, I'll keep an eye out for the future xeon similar to today's alder lake. I tried to find a 7313p + motherboard but they were going for WAY over MSRP and added up to a eye watering total for a home NAS.
Leave a comment:
coder replied

18 March 2022, 03:20 AM
Originally posted by BillBroadley View Post

Sure OoO CPUs try to find useful work to do, a quad issue CPU running at 3.33 GHz needs to find 800 instructions to run during a 60ns cache miss.

If the load gets scheduled early, that buys you more time until you block on it. So, a reorder window like Golden Cove's 512-entry could be enough to cover it. Not saying it typically would, but it could hide ~1k cycles of latency in the best possible case.

Originally posted by BillBroadley View Post

Sure some codes are that friendly, but many real world things are not.

Yes, of course. That's why cache hierarchies are always getting bigger and deeper, and why we get smart prefetchers on top of that. And SMT, on top of that. And still, you do get stall-causing cache misses.

It's a mistake, however, to think that hardware exists in isolation. Hardware and software have been co-evolving all along. Memory allocators are tuned to return what's likely to be hot in the cache hierarchy and to make consecutive allocations more coherent. Compilers are getting smarter about scheduling instructions so that CPU front-ends don't bottleneck. Languages are getting features like C99's restricted pointers. There are many library optimizations being done to deliver better cache performance, etc.

Hardware, for its part, has been tuned to optimize real application performance. That means cache parameters like size, associativity, and replacement policy aren't pulled out of nowhere, they're the result of careful performance analysis, tuning, and simulation. Same with branch-predictors, prefetchers, etc.

Originally posted by BillBroadley View Post

One of the hazards of reading to much into benchmarks is that they are often more cache friend than real world applications.

SPECmarks use real-world applications! Not enough diversity for everyone, or else something like Phoronix Test Suite could simply be pared down to a couple dozen tests. However, I'm sure PTS could be substantially cut down, without losing much. As an aside, it'd be interesting to analyze the cross-correlation between them.

Originally posted by BillBroadley View Post

Intel and AMD obviously do use SMT, but seems like the advantages decrease by the day as new security problems and performance decreasing mitigations are added. In fact I think there security problem/mitigation that hit in the last week or two.

SMT is only a security liability if you allow threads from different processes to share a core. This is entirely fixable within the OS scheduler.

https://www.phoronix.com/scan.php?pa...-Scheduling-v9

Originally posted by BillBroadley View Post

Generally SPEC results are posted for the best performance configurations.

Official ones, yes. But other people can run them and post their own "estimated" scores.
Leave a comment:
coder replied

18 March 2022, 02:48 AM
Originally posted by BillBroadley View Post

Even ignoring the Apple iGPU the M1 max and it's memory system is impressive. Anandtech mentioned the CPUs on the M1 max was able to hit a maximum of 240GB/sec (out of 400 peak) and over 100GB/sec from a single core.

For comparison the AMD Epyc 7763 (the latest and greatest from AMD) with 8 DDR4-3200 channels only managed 111GB/sec (as measured by Anandtech) using all cores, less than 1/2 of M1 max, and the Epyc 7763 alone (not including ram, motherboard, etc) has a TDP of 280 watts.

You're going down a rabbit hole, here. Don't conflate what the CPU can do with what's necessary for good performance. All we can really say is that it's interesting that it can do that, but we don't actually know how material it is for performance.

This gets at the whole tenor of your posts about the M1. It's like you're looking at each aspect, putting it in the best possible light, and claiming the absence of it is a huge liability for everything else. You don't know that. I don't know that. It would take some carefully-conducted experiments that we might or might not have the means to conduct, for us to actually ascertain how much benefit the particular feature or spec confers, and in what sorts of different cases.

The M1 CPUs are cool. No doubt about that. I wouldn't even take issue with someone calling them a technical tour de force. However, I'd advise sticking to the facts of what's known about them, and not trying to compare them to products that are wholly different in kind. For instance, they're APUs and have memory subsystems clearly designed to meet the extreme demands of the GPU portion. Yet you compare their memory subsystem only to CPUs, rather than comparable GPUs or even console APUs. I think that's misleading.

One could make the same sorts of memory bandwidth comparisons between videogame console APUs and server CPUs. The clear implication would be that consoles have comparable horsepower to these huge server processors. It's utterly untrue, however.
Leave a comment:
BillBroadley replied

18 March 2022, 02:45 AM
Originally posted by coder View Post

The problem with micro-benchmarks is that they often stress corner cases. While this can be enlightening, what they actually tell you about real-world performance tends to be limited. For instance, in this case, you're assuming the CPU's out-of-order engine can find nothing else to do, and that the loads weren't issued ahead of when the data was needed. Both tend to be incorrect, for typical program code.

Agreed on microbenchmarks, and it's specific latencies, bandwidth, and parallelism that I'm trying to tease out. Anandtech does a pretty good job for single threads, but don't ever seem to publish the same values when more than one thread or core is hitting any of the shared caches or main memory. Sure OoO CPUs try to find useful work to do, a quad issue CPU running at 3.33 GHz needs to find 800 instructions to run during a 60ns cache miss. Sure some codes are that friendly, but many real world things are not. One of the hazards of reading to much into benchmarks is that they are often more cache friend than real world applications. This kind of bottleneck is most obvious when there's many cores per channel, like a Ryzen 9 5950X with 16 cores on 2 channels (8 cores per channel), or the 64 core Epycs with with 8 channels (8 cores per channel).

Originally posted by coder View Post

This is also where SMT helps out. If one thread is blocked on a L3 cache miss and the speculative execution engine can't find any more work to do, then at least the other thread(s) sharing the core will tend not to be blocked. CPUs like Ampere's Altra take the other approach of just having lots of smaller cores, so that some can make progress while others are blocked. Apple's approach of a small number of big cores basically means they need to minimize memory latency, because they have neither SMT nor tons of siblings to cover their cache misses.

Heh, indeed, SMT helps get work done while the other thread is stalled. I've been a fan since I saw a paper that simulated what a future Alpha CPU would do with SMT, the perf simulation looked pretty awesome, sadly alpha died before it was finished. I do wonder if Apple will add it at some future point. But then again the apple cores are pretty efficient, have good single thread performance, good memory bandwidth (alone or as a group). So I wouldn't say they are particularly lacking. Intel and AMD obviously do use SMT, but seems like the advantages decrease by the day as new security problems and performance decreasing mitigations are added. In fact I think there security problem/mitigation that hit in the last week or two.

Not sure I buy the apple needing low latency, sure SMT helps, but overall I wouldn't expect it to make a big difference.

Originally posted by coder View Post

BTW, it's possible to find heavily memory-bound workloads, but I've not generally seen them in my career. Even things I expected to benefit from more memory bandwidth/channels were hardly impacted. On desktop platforms, the main area you really see benefiting from better memory performance tends to be iGPUs. It's really not until desktop CPUs broke 8-core barrier that we've started to see significant memory bottlenecks on CPU-bound workloads.

Sure, iGPUs are easy. I've seen benchmarks with AMD APUs that show almost linear perf improvements with DDR4-2800,3200,3600 or similar. I also agree that it's hard to tell what's bandwidth intensive by intuition. Even things like producing 4K UHD stream you would think would be bandwidth intensive, often there's enough work involved with compression that the bottleneck is CPU not memory bandwidth. My main experience with these types of problems is running simulations (weather, molecular dynamics, gravity, etc.) much like Spec's CPU benchmarks. I'm pretty fond of Spec for several reasons: they are real codes doing real work, written by people trying to get science done, they have output, and publish compiler flags, dependent libraries, hardware, and software configurations to allow reproduction. But like all benchmarks are only meaningful to the degree that you will run similar codes.

Generally SPEC results are posted for the best performance configurations. So the 7313P (16 cores with high bandwidth per core) published results using 32 copies, which obviously uses SMT. But the 7763 (64 cores with low bandwidth per core) published only a 64 copies, presumably because SMT didn't help with performance. I also noticed that Epyc 7763 gets a base of 311 and the 7313p gets a base score of 181. So AMD gets less than a 2x performance with 4x the cores.

Last edited by BillBroadley; 18 March 2022, 04:43 AM.
Leave a comment:
BillBroadley replied

18 March 2022, 02:04 AM
Originally posted by drakonas777 View Post

Perceived "horrible power efficiency" of x86 is partially caused by idiotic factory settings for CPU/platform. I've put my 3700X into 45W ECO mode, and it lost like ~15% performance while consuming ~2x less power. You should remember that Intel/AMD strive for maximum performance per silicon square millimetre, especially in consumer market for mere silicon economy and THE BENCHMARKS, of course.

Yes, that's the nature of the beast. In particular Intel was behind Zen3, wanted a flagship desktop for good marketing, getting marketshare back, and giving confidence to their stock holders. As a result the alder lake chips are fast and power hungry, moreso relative to TDP then any other intel chip that I'm aware of. As you mentioned backing off even a small bit on clock rate can help significantly with power and be generally unnoticeable unless you run CPU intensive tasks and use a stopwatch.
Leave a comment:

Previous 1 2 3 4 5 template Next

Announcement

Apple M1 Ultra With 20 CPU Cores, 64 Core GPU, 32 Core Neural Engine, Up To 128GB Memory

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: