Announcement
Collapse
No announcement yet.
Apple M1 Ultra With 20 CPU Cores, 64 Core GPU, 32 Core Neural Engine, Up To 128GB Memory
Collapse
X
-
Originally posted by coder View PostWTF?
PCIe is generally not a bottleneck. Analysis of GPUs in PCIe 4.0 x16 slots vs. x8 or PCIe 3.0 x16 show very little performance difference, in the vast majority of games. This is one reason I think Intel's move to PCIe 5.0 (on the desktop) was silly.
I agree that today's GPUs are not bandwidth limited, although part of this is that it's very hard to use the bandwidth. So games generally have 100% of textures and objects inside the GPU ahead of time, because of the crude (not cache coherent) interface. With a smarter interface you could use the local VRAM as a cache for main memory and instead of crashing your game would run gradually slower as the cache hit rate drops. So being cache aware would allow useful interactions that would churn up much more bandwidth. Most of today's games upload textures, some 3d (immutable objects) or not, and once per frame upload the eye position, field of view, eye direction, and whatever triangles are unique to that frame. Thus they are relatively insensitive to bandwidth. However with increasingly interactive worlds, mutable objects, complex effects, etc it's attractive to not have texture limits or even use the GPU for game physics, even using AI or ML for smarter opponents etc. However much of this is not possible (or at least harder) with today's GPUs, but is coming. CXL is a standard to allow accelerators (including GPUs) to be memory coherent. Logic and data could float between CPU and GPU by passing a pointer and take full advantage of whatever bandwidth and caching there is. Today the GPU might as well be on the other end of a fast network connection for how separate it is from the CPU/Ram system. Zen4 and near future Intel cores (forget the name) will support CXL.
Originally posted by coder View PostAs for sending stuff "manually", I'm pretty sure that's nonsense. The way I think it's usually done is to send the device a command to fetch data at a given DRAM address, and it does the fetch. Furthermore, the commands are queued. So, you just queue it up and move on. You obviously don't want to waste a CPU thread doing "PIO" copies. This was a huge no-no in the early days of PCIe, when systems had only 1 or 2 cores. So, it can't have been the case that most copies were the equivalent of a CPU thread doing a memcpy() into GPU memory.
Heck, even old-school PCI had bus-master DMA transfers, and that was the only way to get anywhere close to the theoretical throughput of the bus. Back in the days of the ISA bus, PC chipsets had DMA engines that you could program to do the copying for you. However, I think they were pretty slow and therefore relegated mostly to things like sound.
Sure do people write games that run great with today's GPUs ... definitely. But there's great potential in GPUs that are under utilized because it's a limited environment, doesn't run a linux kernel (or any full OS), no virtual memory, no cache coherency, etc. Future CXL attached GPU cards will fix this in the future and the Apple iGPU does this today.
- Likes 2
Comment
-
Originally posted by coder View Post
Not any more. These days, to get the highest clock speed on lightly-threaded workloads, you have to buy a CPU with more cores than you might actually want.
What's lower is the "base clock" speed, which is essentially the minimum clock speed for when all of the cores are running. More cores -> more heat & power, so they run at a lower clockspeed to lessen the impact.
- Likes 1
Comment
-
Originally posted by BillBroadley View PostWell my point is that X86's are power/heat/cooling limited thus lower performance than you'd expect when using all cores, which generally is going to be the base CPU speed. So sure an expensive (well over $1000) Threadripper pro 5995WX with 64 cores looks great on a single thread at 4.5 GHz (which is BTW less than a cheap 5600x @ 4.6 GHz) but degrades to 2.7 GHz when using all cores.
It's a bit sticky, too, as it will depend on what CPU you pick. Some have increased base and boost clocks on the higher core part, e.g.: 5600X (3.7GHz base, 4.6GHz boost) vs. 5800X (3.8GHz base, 4.7GHz boost) while others have a slight drop in base but a larger jump in boost, e.g.: 5950X (3.4Ghz base, 4.9GHz boost).
Another example with the P/E core options, e.g.: 12600K (3.7GHz base, 4.9GHz boost) vs. 12900K (3.2GHz base, 5.2GHz boost)... but which should still translate to higher single thread performance on the higher core count CPU.
- Likes 1
Comment
-
Originally posted by coder View Post
No, that's not true. First, modern CPU cores have hardware prefetchers that detect data-access patterns and try to have the data in cache before the CPU even asks for it. Second, memory reads are prioritized by the out-of-order scheduler and a core can have quite a few of these outstanding at any given time.
Third, memory channels are traditionally interleaved. I don't honestly know if this is still the case, but I think up through the DDR3 era, most CPUs would map 64-bits one memory channel and then the next 64-bits to the other. Maybe they actually interleaved on cache line boundaries, though. That would let them run more independently. Interestingly, I have an old AMD Phenom II board with a BIOS option to interleave them at page boundaries.
I do know that DDR5 has more channels. Each DDR5 DIMM has 2x 32-bit memory channels. So, if they're interleaved, then it must be at intervals no less than a cacheline.
Similarly out of order schedulers can have multiple outstanding requests, and if those are answered by cache that's great. But nothing about the schedule can hide the impact of multiple channels. Sure the CPU might well have 100 memory requests pending, but if you have 2 channels then exactly two can be making progress at a time, all other requests will have to wait behind those 2.
Your 3rd point is a bit dated, yes memory used to be striped across channels and generally used to increase bandwidth, so a single cache line (usually on the order of 64 bytes give or take) would hit all channels. However as the CPU latency vs memory latency divide grew ever larger (it's been steadily worsening since the first CPU) there's been more complexity. Thus multiple levels of cache, prefetching, more complex branch prediction, out of order operations, etc. Recently those operating systems are aware of multiple channels, multiple sockets, and even different latencies within a single chip. For some epycs for instance some cores are closer to some memory channels than others. Programming languages and libraries are also aware so you can use a topology/locality aware memory allocator instead of just malloc. I was aware of these changes around the time of the Intel Pentium 4, where BIOSs would have a memory stripping option for older OSs that were not NUMA aware. These days OSs, BIOS, and related are generally defaulting to NUMA aware. However with or without NUMA or stripping, nothing hides the performance aspects of how many memory transactions you can have in flight, which is 1 per memory channel. You send a row, column, and get a cache line back. Generally with a modern OS and most workloads the NUMA aware (one cache line request per channel) is fastest.
DDR5 dimms are still 64 bit, but do offer 2 32 bit channels which is a nice improvement over DDR4s 1x64 bit channel. Interleaving isn't on by default and I don't think most modern motherboards would support striping, but it's possible. Generally it's a big lose since instead of having 2 requests per dimm in flight you have one, so you get half the throughput for most cases and the same performance for a pure sequential workload.
Comment
-
Originally posted by BillBroadley View PostI think the PCIe v5 on Intel was partially to upstage AMD for bragging rights, but does help with functionality like faster M.2 slots,
The whole argument gets even more ridiculous when you look at the amount of thermal throttling going on in PCIe 4.0 M.2 drives. With PCIe 5.0 doubling the clock rate yet again, it's only going to get hotter. At that point, thermal throttling could easily wash out any performance gains from PCIe 5.0, other than maybe in tiny, isolated bursts.
IMO, the whole thing is pretty stupid. It took over a year for PCIe 4.0 M.2 drives to hit the market that were meaningfully faster than the PCIe 3.0 top performers. And we should expect the PCIe 5.0 situation to be any different? I don't.
Originally posted by BillBroadley View Posteasier handling of faster network connections (10, 25, 100, and 200Gbit are increasingly common),
Ethernet Speed PCIe 3.0 Lanes PCIe 4.0 Lanes PCIe 5.0 Lanes 10 Gbps x2 x1 x1 25 Gbps x4 x2 x1 40 Gbps x8 x4 x2 100 Gbps x16 x8 x4 200 Gbps x32 x16 x8
The problem is that Intel provided support only for bifurcating the x16 link to dual x8. And at x8, you could already do 100 Gbps. In fact, Alder Lake has a PCIe 4.0 x8 link to its chipset. So, you could already put a 100 Gbps adapter in a slot hanging off that link, easily hit 100 Gbps (bidir) with room to spare, and not touch any of the PCIe 5.0 lanes.
Now, depending on what you're doing with it, that might not make sense. If you want to transfer files to/from the machine, then you're going to hit a bottleneck on storage. You'd have to RAID-0 at least 2 of the fastest SSDs to get that kind of throughput. However, maybe you're just editing videos of a NFS-mount.
Still, what's undeniable is that it's only that 200 Gbps corner-case that's really served by what Intel did here. Everything below that could've been accomodated by Rocket Lake and its PCIe 4.0.
Originally posted by BillBroadley View Postand of course future CXL capable GPUs.
Originally posted by BillBroadley View PostKeep in mind the alder lake cores in today's desktops will also make it into Xeons for server type duty.
Originally posted by BillBroadley View PostPCIe v4 can be limiting. I've got a AMD motherboard for a home file server, the south bridge is pretty limited (even though it's the 570 chipset with twice the bandwidth). If I put 2x M.2 drives, a 2x10G card, and 6xSATA it ends up significantly oversubscribed.
Now, when talking about how much bandwidth those 6 SATA ports are really using, we have to consider what's connected to them. If it's HDDs, then it's probably generous to put the max at 300 MB/sec each. That's still 1.8 GB/sec aggregate, which is about 0.9 of a PCIe 4.0 lane.
So, you can contrive some scenario where the HDDs, NVMe drives, and NICs are all reading at full speed, and maybe you're oversubscribed by 25% or so. However, you said it's a server, which means the traffic flow over the dual-10 Gbps NICs is in the opposite direction as the storage. So, no. I'm pretty sure you're not oversubscribed in actual practice.
And the point doesn't apply to Intel platforms since Rocket Lake, because their chipset link is PCIe 4.0 x8, whereas AMD's as PCIe 4.0 x4.
Originally posted by BillBroadley View PostSo games generally have 100% of textures and objects inside the GPU ahead of time, because of the crude (not cache coherent) interface. With a smarter interface you could use the local VRAM as a cache for main memory and instead of crashing your game would run gradually slower as the cache hit rate drops.
Now, where you can most readily see the impact this has is in some of the benchmarks for AMD's Radeon RX 6500XT. When people stressed its 4 GB onboard VRAM by cracking up resolution and detail level, it started having to page in lots of assets from host memory. And if you'd put this x4 card in a PCIe 3.0 slot, performance suffered badly. However, we're now talking about PCIe 3.0 x4. Not PCIe 5.0 x8 or x16.
Originally posted by BillBroadley View PostMost of today's games upload textures, some 3d
Originally posted by BillBroadley View Postwith increasingly interactive worlds, mutable objects, complex effects, etc it's attractive to not have texture limits
Originally posted by BillBroadley View PostGenerally GPU memory is mapped into the main memory address space, but cache is disabled for that range. DMA is possible, but is somewhat complex, and setting up the DMA adds latency, but is more CPU friendly, and typically lower bandwidth. Depending on your needs and the size of the transfer you may also do a direct memcpy, which is lower latency, higher bandwidth, and more CPU intensive.
Originally posted by BillBroadley View PostBut this is way more painful than just using a pointer. If you have multiple CPU cores it's pretty straight forward to use multiple threads and/or multiple processes to get things done.
Originally posted by BillBroadley View PostGPU cards will fix this in the future and the Apple iGPU does this today.
I don't believe the world is going to fundamentally change the way GPUs are used. To scale performance, you're always going to have GPUs not directly-connected to the same RAM as the host CPU(s) and embracing any APIs or programming styles that fundamentally assume otherwise is going to torpedo that scalability. So, most GPU programming of the future is likely to continue down the path GPU programming has been on. Work queues are a fundamental construct of scalable, parallel performance. Whether CPU, GPU, or hybrid. What you absolutely don't want is for cores to waste precious time blocking on stuff.
Also, please fact-check yourself, before you post, so that others don't have to do it for you. You'll learn more that way, and others won't have to waste time doing it.
Comment
-
Originally posted by BillBroadley View PostSo sure an expensive (well over $1000) Threadripper pro 5995WX with 64 cores looks great on a single thread at 4.5 GHz (which is BTW less than a cheap 5600x @ 4.6 GHz) but degrades to 2.7 GHz when using all cores.
Comment
-
Originally posted by BillBroadley View PostSo if your dimm latency is 60ns and you have 2 channels you can handle exactly 2 cache misses and wait 60 ns for the result. So if 8 cores have a cache miss you have to wait 240ns (60ns x 4). I've done significant exploration in this space and have written several micro benchmarks to explore this.
This is also where SMT helps out. If one thread is blocked on a L3 cache miss and the speculative execution engine can't find any more work to do, then at least the other thread(s) sharing the core will tend not to be blocked. CPUs like Ampere's Altra take the other approach of just having lots of smaller cores, so that some can make progress while others are blocked. Apple's approach of a small number of big cores basically means they need to minimize memory latency, because they have neither SMT nor tons of siblings to cover their cache misses.
BTW, it's possible to find heavily memory-bound workloads, but I've not generally seen them in my career. Even things I expected to benefit from more memory bandwidth/channels were hardly impacted. On desktop platforms, the main area you really see benefiting from better memory performance tends to be iGPUs. It's really not until desktop CPUs broke 8-core barrier that we've started to see significant memory bottlenecks on CPU-bound workloads.
Comment
-
Originally posted by coder View PostHelped by the fact that it's LPDDR5. Anyway, lots of the bandwidth is really there for the GPU. If you're running a graphics-heavy workload, the CPU cores won't get as much.
That's not generally supported by benchmarks. Sure, you never get linear scaling, but AMD and Intel typically add enough memory bandwidth (and cache) so the cores aren't terribly starved.
The more apt comparison would be with a laptop GPU.
If that's all that software is designed to use, then the extra memory doesn't help you.
Its CPU cores can't use all of that bandwidth! It's mainly for the GPU. It would therefore make more sense to compare it with the memory specs of comparable GPUs.
Even ignoring the Apple iGPU the M1 max and it's memory system is impressive. Anandtech mentioned the CPUs on the M1 max was able to hit a maximum of 240GB/sec (out of 400 peak) and over 100GB/sec from a single core.
For comparison the AMD Epyc 7763 (the latest and greatest from AMD) with 8 DDR4-3200 channels only managed 111GB/sec (as measured by Anandtech) using all cores, less than 1/2 of M1 max, and the Epyc 7763 alone (not including ram, motherboard, etc) has a TDP of 280 watts.
So the latest greatest (and generally considered ahead of Intel) AMD requires 64 cores, 280 watts, and 8 channels to get 111GB/sec observed. An Apple thin/light laptop can do the same with 10 cores, much less power, and get over twice the memory bandwidth (for CPUs only, ignoring GPUs). As impressive as that is, it's likely half of what the M1 ultra can do. Granted that's somewhat speculative until the detailed reviews come out.
Comment
-
Originally posted by drakonas777 View PostPerceived "horrible power efficiency" of x86 is partially caused by idiotic factory settings for CPU/platform. I've put my 3700X into 45W ECO mode, and it lost like ~15% performance while consuming ~2x less power. You should remember that Intel/AMD strive for maximum performance per silicon square millimetre, especially in consumer market for mere silicon economy and THE BENCHMARKS, of course.
Comment
Comment