Announcement

**L_A_G** · 16 July 2020, 10:50 AM

Originally posted by sdack View Post

No, you've got it backwards. CPUs need caches, because the memory cannot keep up. This is why you're seeing L1, L2 and L3 caches. The A64FX, which is just an Arm CPU, but one with 48 cores, no longer needs an L3 cache. Getting rid of the L3 is what allows the chip to have so many cores. The A64FX uses 10B transistors, which is about as many as the AMD Ryzen 9. Why bother with improving cache design when you can throw some of it out and be rid of it? So imagine a Ryzen 9 without needing an L3 cache and instead having more cores ...

You do realize you're just re-phrasing what I said? CPUs use SRAM-based caches because SRAM is far superior to DRAM in terms of bandwidth and particularly latency, giving them a big boost in terms of performance. It's superior, but much more expensive so it's not used as a cure-all, but rather an on-die performance booster for use cases where it's beneficial.

As for big L3 caches, they're not going anywhere in general purpose CPUs. AMD's most recent Threadripper and Epyc CPUs have as much as 256MB of L3 cache. While the A64FX is technically a general purpose CPU, it's actually meant for the kinds of compute jobs where you process large amounts of sequentially stored data and that can be done very well on GPUs, accelerator cards like Xeon Phi's and various kinds of MAC ASICs like currently in-vogue neural net accelerators. This hardware doesn't use big SRAM caches anyway so general purpose CPUs adapted to perform the same jobs aren't going to have them either.

What's good for hardware that does heavy compute jobs for things like physical simulations, neural net learning and processing, etc. isn't always going going to be any good for hardware that does every day computing. Unlike in the past when things like caches, instruction level parallelism, branch prediction, TLBs, vector instruction units and so on would eventually filter trough to consumer hardware and improve it, that's just not the case when compute hardware is improved by tailoring it for workloads that aren't representative of what consumer hardware spends it's days doing.

I did my master's thesis in scientific compute and wrote my project in CUDA on GPUs so I know how heavily tailored the code and the hardware is to achieve the kinds of results that people demand for these jobs today. A GPU is terrible for the kinds of things of things a general purpose CPU is used for and the same to an increasing extent applies to CPUs adapted to perform the same jobs as GPUs.

**sdack** · 16 July 2020, 12:15 PM

Originally posted by L_A_G View Post

AMD's most recent Threadripper and Epyc CPUs have as much as 256MB of L3 cache.

You have over-growing caches on CPU dies on one side, which could be used for more cores, and external storage such as NVMe SSDs approaching speeds of several GB/sec on the other side. Do give this a second thought. When you cannot see how that is a bad sign and that it is time for a new main memory design then you've wasted your time on your master.

**sdack** · 16 July 2020, 01:05 PM

Originally posted by pal666 View Post

that would make hbm latency lower than l3, which surely isn't the case

Yes, I know. It is however the only data that I can find so far. I am aware that with it's design and the sustained transfer rates it doesn't show any signs of latency. It simply delivers data in 1ns, and yes, this is faster than L3 caches. I've been trying to find exact latency specifications for HBM, but I haven't found any. All I keep finding are it's bandwidth and that it uses a much simpler design, requiring simpler electronics and a much lower frequency. So I do believe they've kept it very simple, made it very fast and just did not put much electronics in between, which would cause it to have a high latency.

Caches on the other hand need additional logic for their associativity and their validation process (miss'n'hit), which creates latency and you don't have this with HBM, because it is your main memory. HBM then needs to sit very close to the CPU die on an interposer and thereby alone becomes a very strong competitor to the on-die caches.

I would find it odd if they went for just 500MHz, but then introduced latency. So while I wish I could give you exact latency numbers, an affirmative nod or something or anything more specific, do I currently not believe that there is any latency to be found with HBM.

**L_A_G** · 16 July 2020, 01:24 PM

Originally posted by sdack View Post

You have over-growing caches on CPU dies on one side, which could be used for more cores, and external storage such as NVMe SSDs approaching speeds of several GB/sec on the other side. Do give this a second thought. When you cannot see how that is a bad sign and that it is time for a new main memory design then you've wasted your time on your master.

The reason why CPU makers like Intel and AMD use ever bigger and bigger on-die SRAM caches is because it yields results in the use case scenarios that general purpose CPUs are used for. Applications may have gotten more heavily multithreaded than before, but there's a limit to how heavily those use cases can be multi-threaded and these uses suffer from heavily diminishing returns once you go beyond 6 cores.

When talking about latency for CPUs the metric used is cycles and when operating out of registers latency is effectively zero. Then when you move into caches you start from a few cycles to a couple of dozen cycles. Once you leave cache things get positively slow in a hurry. Even with TLBs for speeding up virtual memory access to DRAM is going to be in excess of 100 cycles and can be over 300 if you have a TLB miss. After that once you move into non-volatile memory things get very slow in a hurry even with fast flash memory. Out-of-order execution does exist to fill the scheduling "bubbles" that this causes, but even that relies on a very high cache hit-rate to function effectively.

To put it bluntly: What matters most to a general purpose CPU is how many cycles access takes and in that regard SRAM beats everything else hands down. DRAM is more or less an order of magnitude slower while non-volatile memory like flash is, even at the best of times, again an order of magnitude slower. How many GB/s a type of memory can pull of consecutively stored data (that's what the advertising figures are as it's when they're at their absolute best) doesn't really matter when a CPU will mostly be pulling 64k cache lines worth of data semi-randomly from back and forth across memory.

Modern CPU design has heavily utilized performance modelling with very accurate models and real-world representative workloads to inform design decisions ever since the research that lead to the first RISC CPUs back in the 1980s. Cache setups, not just in the sizes of different levels, but also level-specific setups (on Zen L2 has a complex 8 way set associative setup while L3 has a much simpler "victim cache" setup) are very heavily informed by modelling that allows the manufacturer to get the best performance for realistic workloads under the available silicon budget.

That may have been a bit bewildering to read, but I will point out that I have a degree in computer engineering, (again) wrote my thesis in scientific compute and it's pretty much what I work with day-to-day.

**sdack** · 16 July 2020, 01:49 PM

Originally posted by L_A_G View Post

The reason why ...

No, you only refuse to believe it, because it would blow your mind. You cannot imagine and therefore not accept how the industry could abandon an established design to replace it with a much simpler one and end up breaking down barriers. You're just stuck in your believes.

**L_A_G** · 16 July 2020, 02:46 PM

Originally posted by sdack View Post

No, you only refuse to believe it, because it would blow your mind. You cannot imagine and therefore not accept how the industry could abandon an established design to replace it with a much simpler one and end up breaking down barriers. You're just stuck in your believes.

Umm... It's an established design based on proven computer modelling and real world performance. I gave you a detailed technical explanation and you didn't even try to counter it, instead choosing to try and counter with what can, at best, be described as wishful thinking.

This is on the level of basic physics, engineering and even thermodynamics-defying crowdfunded crap like Fontus*, Waterseer** and Solar Roadways***

*A "self-filling water bottle" which is just a portable effect dehumidifier
**Same thing except as a bigger fixed installation which is literally an off-the-shelf dehumidifier with a filter due to how the water in those things is so filthy manufacturer expressly recommend against drinking it
***Probably best explained by Dave Jones

**sdack** · 16 July 2020, 03:26 PM

Originally posted by L_A_G View Post

I gave you a detailed technical explanation ...

No, what you gave me is the view is somebody who doesn't want to leave his comfort zone, but hold on to his believes.

Don't worry, the change won't happen fast. You'll get enough time to adjust.

"640KB of RAM is enough. RISC is too primitive. Flash memory is too slow." ... It has people like you in every decade.

**L_A_G** · 16 July 2020, 03:56 PM

Originally posted by sdack View Post

No, what you gave me is the view is somebody who doesn't want to leave his comfort zone, but hold on to his believes.

I explained the clear engineering reasons why things are done the way they are and you act like it's something other than engineering that drives the design of these things. You don't design these kinds of things based on what you think what could be cool, you design them based on what actually works and then what works best. In this case it's to use SRAM based caches where it's what produces the best results.

Don't worry, the change won't happen fast. You'll get enough time to adjust.

Sure, looking at the U.S and much of the world has been going for the last decade it's obvious we're heading down the road of the movie Idiocracy, but that's hardly a good thing unless you despise intelligence and intelligent decisionmaking.

**sdack** · 16 July 2020, 04:37 PM

Originally posted by L_A_G View Post

I explained the clear engineering reasons why things are done the way they are ...

That's exactly what you did. Your failure to understand why it needs change and how change happens is the reason why you can only explain how they are.

**L_A_G** · 10 August 2020, 09:30 AM

Originally posted by sdack View Post

That's exactly what you did. Your failure to understand why it needs change and how change happens is the reason why you can only explain how they are.

You still haven't made a technical argument as to why it would be better while I've explained thoroughly why it would be worse. Your arguments, if you can even call them ones, are just lame attempts at fearmongering and the same kind of idiocy that's used to sell scams and dumb ideas like brexit and most of Trump's policies.

Announcement

JEDEC Publishes DDR5 Standard - Launching At 4.8 Gbps, Better Power Efficiency

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment