Originally posted by coder
View Post
Announcement
Collapse
No announcement yet.
Apple M1 Ultra With 20 CPU Cores, 64 Core GPU, 32 Core Neural Engine, Up To 128GB Memory
Collapse
X
-
- Likes 1
-
BillBroadley , here's more on OpenCL Shared Virtual Memory. I omitted it from the above post, because inserting too many links will typically get your post withheld for moderation.
"One of the remarkable features of OpenCLâ„¢ 2.0 is shared virtual memory (SVM). This feature enables OpenCL developers to write code with extensive use of pointer-linked data structures like linked lists or trees that are shared between the host and a device side of an OpenCL application. In OpenCL 1.2, the specification doesn't provide any guarantees that a pointer assigned on the host side can be used to access data in the kernel on the device side or vice versa. Thus, data with pointers in OpenCL 1.2 cannot be shared between the sides, and the application should be designed accordingly, for example, with indices used instead of pointers. This is an artifact of a separation of address spaces of the host and the device that is addressed by OpenCL 2.0 SVM."
https://www.intel.com/content/www/us...-overview.html
Leave a comment:
-
Originally posted by BillBroadley View PostDMA sounds great, but after talking to engineers inside pathscale, qlogic, intel, and mellanox that it ends up being much more complicated than you might think and performance limitations, bugs, special cases, etc are common when you have to support a wide variety of platforms.
Originally posted by BillBroadley View PostI read a piece by (pretty sure) John Romero make an impassioned plea to AMD to make a GPU with a cache coherent GPU to connect to their cache coherent interconnect (hypertransport) and it resonated with me.
That's not to say GPUs don't need tons of stuff from the host, but the more you learn about modern GPU APIs, the more you see how oriented they are towards generating as much locally on the GPU as possible.
Originally posted by BillBroadley View PostI can also assure you that HPC folks are continuously begging for interconnects to be closer to the CPU instead of on the other end of PCI-e .
Originally posted by BillBroadley View PostWhat? You are saying that a GPU can have a cache miss, page fault, and ask the system for that page? That's not at all my understanding of this. It's been wished for, but isn't how it worked. Maybe you don't mean the same thing when you say page?
Originally posted by BillBroadley View PostThe VRAM is indeed a cache for system memory, but that system memory isn't shared with the CPU,
Originally posted by BillBroadley View PostSo you can't pass pointers between CPU and GPU.
Originally posted by BillBroadley View Postcarefully managing limited resources like video memory is because of the nature of being on a PCIe bus instead of a fast low latency connection like exists between CPU sockets.
Given that, I think work queues will always be the preferred structure for CPUs to communicate with GPUs. Maybe not vice versa, since GPUs are so much better at latency-hiding.
Originally posted by BillBroadley View Posttoday's GPUs are not particularly bandwidth limited,
Originally posted by BillBroadley View PostWant to take a guess on the latency to get one cache line from system ram for a CPU vs a GPU today?
Leave a comment:
-
Originally posted by BillBroadley View PostI admit I've not been tracking the PCIe5 support, but like DDR5 it's a chicken and egg problem. Intel needs to ship something to prod the ecosystem to scaling up products to take advantage of it.
PCIe 5.0 was already happening, in the server market, where it's actually needed. Several months ago, Amazon announced their next-gen Graviton CPUs have PCIe 5.0 and were already in service for months, at the time. Other server processors have announced it, including Intel's Sapphire Rapids. So, that was already fait acompli. And server CPUs with PCIe 5.0 would've been enough for ecosystem partners to do interop testing. They didn't need desktop CPUs, for that.
Originally posted by BillBroadley View PostI'm hoping Intel's shipping of DDR5 helps decrease DDR5 prices as volumes increase
Second, I think they got out ahead of the market, though it'd have been difficult to time perfectly, especially given the supply chain madness of late. At least they kept DDR4 as an option. Full marks for that!
Originally posted by BillBroadley View PostI've seen various NVMe drives mentioning PCI5,
Originally posted by BillBroadley View PostI wanted 2xFirecuda 530s, they peak around 7GB/sec peak sequential and only one of the 2 is direct connected to the CPU, the other is on the south bridge. 10G ethernet managed 1GB/sec if you are lucky, and would only get near that when reading from NVMe (which is caching for disks). I'm only using 5 of the 8 available SATA, each at around 267MB/sec head rate so 1335MB/sec. So adding that up I get 7GB /sec(for the NVME connected to the south bridge, 1GB/sec for ethernet, and 1.3GB/sec for the 5 disks. So just above 9GB/sec.
Originally posted by BillBroadley View PostMy point of all that is I'd be much happier having a south bridge connected with an x4 PCIe v5 than a v4 so I don't have to juggle things around to ensure PCIe is not the bottleneck.
And if you look at it from a cost or energy standpoint, I think you'd find that PCIe 4.0 x8 is a better option than PCIe 5.0 x4, for the chipset link. However, I actually expected Intel to deploy PCIe 5.0 only for that chipset link, because even that is cheaper and easier (and makes more sense) than what they actually did!
Originally posted by BillBroadley View PostI can tell you that 2x NVMe (firecuda 530s) was a quite noticeable upgrade of a SATA connected SSD that they replaced.
: )
Originally posted by BillBroadley View Postit does worry me that some motherboards put M2 connectors where there's basically zero airflow, like under the motherboard.
M.2 is an annoying form factor, for desktop use. Priority was clearly given to laptops, during its design. Still, it beats the heck out of the "ruler" form factor!
Originally posted by BillBroadley View PostI wouldn't be surprised to see PCIe cards with 4x NVMe slots and airflow like a GPU.
Originally posted by BillBroadley View PostFirst gen drives for a new technology typically have compatibility, but not optimized for it.
Originally posted by BillBroadley View PostI keep my systems for a long time. I'm typing this on a Xeon E3-1230 v5 bought in 2015 or so, and I did buy a M.2 NVMe (Samsung 950 pro).
Originally posted by BillBroadley View PostTrue, PCIe is duplex, my use case is mostly reads so uncached it's 5xSATA -> CPU (checksums) -> network mixed with cached (NVMe -> CPU (checksums and etc) -> 10G. But then again even if it's 7.8GB/sec per direction to the south bridge doesn't mean that real world use might be 50-60% of that before running into bottlenecks and contention. In any case would be nice to have some extra bandwidth to the south bridge so I could just ignore it.
Originally posted by BillBroadley View PostI was planning on 4 or more 5MP cameras streaming to the NAS (part of the justification for the storage) on a second interface (direct connected to a PoE switch). Not any significant bandwidth, but just another thing keeping various bits of the server busy.
Anyway, a desktop platform can easily handle an order of magnitude more cameras than that. Do you have any specific software in mind? I haven't really looked into open source solutions for this, so I'm just wondering.
You'd be well-advised to keep the cameras on a separate subnet, where they're not internet-accessible.
Originally posted by BillBroadley View PostAh, good to know, I'll keep an eye out for the future xeon similar to today's alder lake.
If you want to spend the big $$$, then you can be assured of a Sapphire Rapids Xeon W, I think towards the latter part of the year. It'll feature the same Golden Cove cores, most likely DDR5, and certainly AVX-512. Plus AMX, although that's likely to be of interest only for deep learning and image processing, due to its support for only low-precision data types (int8 and BFloat16).
- Likes 1
Leave a comment:
-
I don't have particularly deep knowledge specific to GPUs, drivers, and related. I do have a fair bit of experience with PCIe connect high performance interconnects (Quadrics, Myrinet, IB in SDR, DDR, QDR, FDR, EDR and HDR). I've even designed and build a cluster using a very rare variant that connected IB directly to hypertransport. Including some that were the first public clusters of their kind. Sadly there's two flavors of hypertransport, one cache coherence and one not and it was the non-coherent version for the interconnect. That's where my experience with tradeoffs between DMA and memcpy and similar came from. DMA sounds great, but after talking to engineers inside pathscale, qlogic, intel, and mellanox that it ends up being much more complicated than you might think and performance limitations, bugs, special cases, etc are common when you have to support a wide variety of platforms. Because of the DMA limitations for interconnects there's a latency sensitive path (that's not DMA) that's low latency and high CPU utilization and at some (typically) tuneable threshold above which you switch to a high bandwidth, high latency, but low CPU utilization path.
I read a piece by (pretty sure) John Romero make an impassioned plea to AMD to make a GPU with a cache coherent GPU to connect to their cache coherent interconnect (hypertransport) and it resonated with me. Instead of a device that's practically write only and because of the PCIe bus only particularly efficient with large transfers. So sure you can throw 1/60th of a second of triangles/textures at a GPU and it works great, but it makes it much harder to use the substantial computation resources for other things. I can also assure you that HPC folks are continuously begging for interconnects to be closer to the CPU instead of on the other end of PCI-e .
Originally posted by coder View PostAgain, you're way off the mark. First, games do stream textures and other assets into GPU memory, continually. Second, GPUs can page in from host RAM, as-needed.
Originally posted by coder View PostAMD made a big fuss about the HBCC (High-Bandwidth Cache Controller) that they introduced way back in the Vega generation, which essentially treats GPU memory as a cache. And they can even achieve cache coherency over PCIe, using PCIe atomics.
Originally posted by coder View PostPage-faulting non-local memory is going to be super painful. I don't care if it's connected by PCIe or CXL. GPU programming APIs, whether for graphics or compute, all use queues to decouple the host thread from data access, so you don't waste ridiculous amount of CPU time doing PIO transfers.
Originally posted by coder View PostYou're just reiterating the same case for APUs that AMD tried to make more than 10 years ago. But it didn't pan out. When the PS4 and XBox One launched, there were even some warnings that games might perform much worse on PCs, because they could actually take advantage of those systems' APU architectures. Still, it didn't pan out.
Seems plausible that the absence of PCIe does allow the console to do more with less and leverage GPUs in way that normal gaming machines do not. Thinks like passing a pointer to a complex data structure to allow the GPU to offload some physics calculations. Unfortunately I've no specific knowledge about that though.
Originally posted by coder View PostI don't believe the world is going to fundamentally change the way GPUs are used. To scale performance, you're always going to have GPUs not directly-connected to the same RAM as the host CPU(s) and embracing any APIs or programming styles that fundamentally assume otherwise is going to torpedo that scalability. So, most GPU programming of the future is likely to continue down the path GPU programming has been on. Work queues are a fundamental construct of scalable, parallel performance. Whether CPU, GPU, or hybrid. What you absolutely don't want is for cores to waste precious time blocking on stuff.
Originally posted by coder View PostAlso, please fact-check yourself, before you post, so that others don't have to do it for you. You'll learn more that way, and others won't have to waste time doing it.
Leave a comment:
-
Originally posted by coder View PostOops. No, it doesn't. Alder Lake only has a x16 PCIe 5.0 link that can be bifurcated to dual x8. So, the only way it helps with M.2 is if you get a plug-in PCIe adapter card to hold M.2 drives. And that only matters when they actually exist. And then, it only matters if you're actually doing something that is limited by current PCIe 4.0 latency or bandwidth, which is pretty unlikely (at least, not on a desktop platform).
My point of all that is I'd be much happier having a south bridge connected with an x4 PCIe v5 than a v4 so I don't have to juggle things around to ensure PCIe is not the bottleneck. Doubly so since often without special tuning and artificial workloads getting 50% of the available bandwidth of any link is not unusual. I don't think a home NAS with 5 disks is all that odd, not trying to dream up some fake corner case.
Originally posted by coder View PostThe whole argument gets even more ridiculous when you look at the amount of thermal throttling going on in PCIe 4.0 M.2 drives. With PCIe 5.0 doubling the clock rate yet again, it's only going to get hotter. At that point, thermal throttling could easily wash out any performance gains from PCIe 5.0, other than maybe in tiny, isolated bursts.
Originally posted by coder View PostIMO, the whole thing is pretty stupid. It took over a year for PCIe 4.0 M.2 drives to hit the market that were meaningfully faster than the PCIe 3.0 top performers. And we should expect the PCIe 5.0 situation to be any different? I don't.
Originally posted by coder View PostThe cores have nothing to do with that! The PCIe controller is a separate block, on the die. They could pair potentially any PCIe controller with any set of cores, in any of their products.
Originally posted by coder View PostThat's a stretch. The first M.2 drive should be directly-connected to the CPU. That puts only the second one on the chipset. I've not seen a M.2 SSD that can saturate a PCIe 4.0 x4 link. The dual 10 Gbps controller only uses about 1.6 lanes' worth. Now, where you get into trouble is that by putting the second NVMe in a x4 slot, you have to sacrifice chipset SATA ports. So, you have to add a PCIe controller card to get beyond 4 SATA ports. However, that's a chipset limitation, not a matter of PCIe bandwidth.
In our ASRock Rack X570D4U-2L2T review, we see the company's latest server motherboard with 10GbE for the AMD Ryzen platform
It's already got the magic to allow 8x SATA and 2xM.2, at least I think so. Might have an footnote somewhere mentioning disabling the SATA. I only needed 5xSATA and moved the M.2 to a PCIe slot, so I didn't dig into it.
I can dig the cheaper motherboard variant with half the bandwidth to the south bridge if you want. But even real world bandwidth (i.e. head rate of a 5 SATA drives not 600MB/sec * 8 available SATA ports) can easily add up to more than PCIe4 x 4.
Originally posted by coder View PostNow, when talking about how much bandwidth those 6 SATA ports are really using, we have to consider what's connected to them. If it's HDDs, then it's probably generous to put the max at 300 MB/sec each. That's still 1.8 GB/sec aggregate, which is about 0.9 of a PCIe 4.0 lane.
Originally posted by coder View PostSo, you can contrive some scenario where the HDDs, NVMe drives, and NICs are all reading at full speed, and maybe you're oversubscribed by 25% or so. However, you said it's a server, which means the traffic flow over the dual-10 Gbps NICs is in the opposite direction as the storage. So, no. I'm pretty sure you're not oversubscribed in actual practice.
I was planning on 4 or more 5MP cameras streaming to the NAS (part of the justification for the storage) on a second interface (direct connected to a PoE switch). Not any significant bandwidth, but just another thing keeping various bits of the server busy.
Originally posted by coder View PostAnd the point doesn't apply to Intel platforms since Rocket Lake, because their chipset link is PCIe 4.0 x8, whereas AMD's as PCIe 4.0 x4.
Leave a comment:
-
Originally posted by BillBroadley View PostSure OoO CPUs try to find useful work to do, a quad issue CPU running at 3.33 GHz needs to find 800 instructions to run during a 60ns cache miss.
Originally posted by BillBroadley View PostSure some codes are that friendly, but many real world things are not.
It's a mistake, however, to think that hardware exists in isolation. Hardware and software have been co-evolving all along. Memory allocators are tuned to return what's likely to be hot in the cache hierarchy and to make consecutive allocations more coherent. Compilers are getting smarter about scheduling instructions so that CPU front-ends don't bottleneck. Languages are getting features like C99's restricted pointers. There are many library optimizations being done to deliver better cache performance, etc.
Hardware, for its part, has been tuned to optimize real application performance. That means cache parameters like size, associativity, and replacement policy aren't pulled out of nowhere, they're the result of careful performance analysis, tuning, and simulation. Same with branch-predictors, prefetchers, etc.
Originally posted by BillBroadley View PostOne of the hazards of reading to much into benchmarks is that they are often more cache friend than real world applications.
Originally posted by BillBroadley View PostIntel and AMD obviously do use SMT, but seems like the advantages decrease by the day as new security problems and performance decreasing mitigations are added. In fact I think there security problem/mitigation that hit in the last week or two.
Originally posted by BillBroadley View PostGenerally SPEC results are posted for the best performance configurations.
Leave a comment:
-
Originally posted by BillBroadley View PostEven ignoring the Apple iGPU the M1 max and it's memory system is impressive. Anandtech mentioned the CPUs on the M1 max was able to hit a maximum of 240GB/sec (out of 400 peak) and over 100GB/sec from a single core.
For comparison the AMD Epyc 7763 (the latest and greatest from AMD) with 8 DDR4-3200 channels only managed 111GB/sec (as measured by Anandtech) using all cores, less than 1/2 of M1 max, and the Epyc 7763 alone (not including ram, motherboard, etc) has a TDP of 280 watts.
This gets at the whole tenor of your posts about the M1. It's like you're looking at each aspect, putting it in the best possible light, and claiming the absence of it is a huge liability for everything else. You don't know that. I don't know that. It would take some carefully-conducted experiments that we might or might not have the means to conduct, for us to actually ascertain how much benefit the particular feature or spec confers, and in what sorts of different cases.
The M1 CPUs are cool. No doubt about that. I wouldn't even take issue with someone calling them a technical tour de force. However, I'd advise sticking to the facts of what's known about them, and not trying to compare them to products that are wholly different in kind. For instance, they're APUs and have memory subsystems clearly designed to meet the extreme demands of the GPU portion. Yet you compare their memory subsystem only to CPUs, rather than comparable GPUs or even console APUs. I think that's misleading.
One could make the same sorts of memory bandwidth comparisons between videogame console APUs and server CPUs. The clear implication would be that consoles have comparable horsepower to these huge server processors. It's utterly untrue, however.
Leave a comment:
-
Originally posted by coder View PostThe problem with micro-benchmarks is that they often stress corner cases. While this can be enlightening, what they actually tell you about real-world performance tends to be limited. For instance, in this case, you're assuming the CPU's out-of-order engine can find nothing else to do, and that the loads weren't issued ahead of when the data was needed. Both tend to be incorrect, for typical program code.
Originally posted by coder View PostThis is also where SMT helps out. If one thread is blocked on a L3 cache miss and the speculative execution engine can't find any more work to do, then at least the other thread(s) sharing the core will tend not to be blocked. CPUs like Ampere's Altra take the other approach of just having lots of smaller cores, so that some can make progress while others are blocked. Apple's approach of a small number of big cores basically means they need to minimize memory latency, because they have neither SMT nor tons of siblings to cover their cache misses.
Not sure I buy the apple needing low latency, sure SMT helps, but overall I wouldn't expect it to make a big difference.
Originally posted by coder View PostBTW, it's possible to find heavily memory-bound workloads, but I've not generally seen them in my career. Even things I expected to benefit from more memory bandwidth/channels were hardly impacted. On desktop platforms, the main area you really see benefiting from better memory performance tends to be iGPUs. It's really not until desktop CPUs broke 8-core barrier that we've started to see significant memory bottlenecks on CPU-bound workloads.
Generally SPEC results are posted for the best performance configurations. So the 7313P (16 cores with high bandwidth per core) published results using 32 copies, which obviously uses SMT. But the 7763 (64 cores with low bandwidth per core) published only a 64 copies, presumably because SMT didn't help with performance. I also noticed that Epyc 7763 gets a base of 311 and the 7313p gets a base score of 181. So AMD gets less than a 2x performance with 4x the cores.Last edited by BillBroadley; 18 March 2022, 04:43 AM.
Leave a comment:
-
Originally posted by drakonas777 View PostPerceived "horrible power efficiency" of x86 is partially caused by idiotic factory settings for CPU/platform. I've put my 3700X into 45W ECO mode, and it lost like ~15% performance while consuming ~2x less power. You should remember that Intel/AMD strive for maximum performance per silicon square millimetre, especially in consumer market for mere silicon economy and THE BENCHMARKS, of course.
Leave a comment:
Leave a comment: