Announcement

**F.Ultra** · 26 April 2022, 02:08 PM

Originally posted by willmore View Post

Well, yes, that's exactly it. This particular program just doesn't need much cache as it's designed to work well on very small processors without any cache. Its working set probably fits in the L1 of most desktop processors these days. Certainly within the L2. If anything the 5800x3d will run this slower (because of clocks and increased L3 latency)--which is completely expected.

Yeah, I wasn't familiar with the inner workings of LZ4 so had to look it up and a dictionary of only 64kb won't have much use for those extra 64M L3 and as you say, lower frequency and higher latency of L3 means that it will run slower.

What OP doesn't seem to understand is that if LZ4 would have been designed specifically for this CPU then both speed and compression ratio would have been much better than what we have now, however then it would also had performed abysmal on the most common systems out there, rendering it quite useless for the use case that it was developed for.

**F.Ultra** · 26 April 2022, 02:11 PM

Originally posted by atomsymbol

In my opinion, it is inevitable that staking of "chiplets" on top of each other will become common practice sometime in the future. If this is the case then 5800X3D is a precursor/test of things that are bound to happen. It is possible that a future "chiplet" won't contain any L3 cache and all of the L3 cache will be stacked on top of a "cores+L1+L2 chiplet". GAA (gate-all-around) transistors are stacking [multiple transistors sharing a common gate] on top of each other. There are multiple options of where 3D stacking can be utilized in the future, both at the microscopic level and at the macroscopic level, and it is hard to predict which option will be the most cost-effective.

Like the IBM Telum wich have virtual L3 and L4 cache from the shared L2.

**F.Ultra** · 26 April 2022, 07:41 PM

Originally posted by atomsymbol

I was unable to find data about the actual/effective size of L2/L3/L4 Telum caches in benchmarks (1) utilizing a single Telum CPU core and (2) utilizing all 8 Telum CPU cores. Do you happen to have a link to such an article or a video?

I don't think you will be able to find a single benchmark for a piece of hw like this, no IT magazine are interested and no independent will have the monetary aims to purchase such a beast. The sizes are public though, each core have 256kb L1 and 32MB L2. Each CPU have 8 cores and the L2 of each core can be merged into a 256MB virtual L3 per CPU. And 8 CPU:s can interact to create a 2GB virtual L4 cache.

A virtual L4 is most likely completely out of the question for commodity hardware like Intel or AMD but interconnecting the L2-caches internally in a CPU to create a virtual L3 should be possible unless IBM sits on some patents here of course.

**arQon** · 27 April 2022, 03:27 AM

Originally posted by willmore View Post

This particular program just doesn't need much cache as it's designed to work well on very small processors without any cache.

Kinda hate to nitpick this, but that's exactly wrong.

In a cacheless system all memory accesses are equally performant, so a dictionary of anything up to the entirety of free RAM is just as fast as a tiny one. (Other than the cost of searching it for the longest match).
lz4 is fast *because* (among other things) the dictionary isn't large, but "designed for cacheless" is actually the opposite of that.

**willmore** · 27 April 2022, 07:33 AM

Originally posted by arQon View Post

Kinda hate to nitpick this, but that's exactly wrong.

In a cacheless system all memory accesses are equally performant, so a dictionary of anything up to the entirety of free RAM is just as fast as a tiny one. (Other than the cost of searching it for the longest match).
lz4 is fast *because* (among other things) the dictionary isn't large, but "designed for cacheless" is actually the opposite of that.

LZ4 doesn't use a dictionary--it only makes references to the input stream and the output stream. And, yes, of course, these will benefit from some caching as almost everything does, but it quickly hits a point where more cache just doesn't speed it up--much more quickly than a program with a larger working set. For one, a very limited amount of the input and output buffers can (and is in typical encoded data) accessed and it's only accessed a small number of times--so if the second reference was free, there's still much less speedup than if it were accessed a large number of times. Which is why I say it's designed to work well on small cacheless processors--as the cache doesn't speed things up much. Now an I-cache would, but I was thinking in terms of the D-cache. Even your tiny 32 bit microprocessor probably has some form of I-cache just because flash isn't fast enough in cycle time to keep up with even an 80MHz core. They always bank the flash or make reads >1 instruction. So finding a turely 'cacheless' core these days is really difficult.

I'm an engineer and not a straight up 'scientist', so I'm used to dealing with "nearly 0' and 'almost infinite' cases, so this didn't bother me. Sorry to upset the pedants.

**jaxa** · 27 April 2022, 10:46 AM

Originally posted by ddriver View Post

And now do a 16 core 192 mb cache 5999x - am4 can handle it! Even with dual mem channels only, this thing will be a beast for hpc - a 1000$ seems a fair price.

Not happening. Wait for Zen 4 on AM5 which should have a 16-core model with some amount of V-Cache... eventually. And it will be much more of a beast for HPC with PCIe 5.0, a few more lanes, DDR5, etc.

**xorbe** · 27 April 2022, 03:57 PM

Originally posted by ddriver View Post

And now do a 16 core 192 mb cache 5999x - am4 can handle it! Even with dual mem channels only, this thing will be a beast for hpc - a 1000$ seems a fair price.

Originally posted by jaxa View Post

Not happening.

Even if AMD could have wedged 2x96MB into an 5950X3D AM4 cpu within the 105w power/heat budget while maintaining high clocks ... it would step on 12/16c Threadripper space, and even fight the lowest end EPYCs, so that's a no-go in the product planning I'd guess. I'm also guessing the 5800X3D is going to be quite limited production, but I could be wrong.

Both companies top half dozen CPU are plenty for gaming, imho. To me, the main story for the 5800X3D is the better minimums / latency that a larger cache can afford, that's why I bought one.

Announcement

AMD Ryzen 7 5800X3D On Linux: Not For Gaming, But Very Exciting For Other Workloads

Comment

Comment

Comment

Comment

Comment

Comment

Comment