Announcement

**herman** · 07 August 2020, 03:37 PM

Originally posted by vladpetric View Post

No general disagreements with you - but I for one have been really disappointed with rpi4's low performance (including for my own benchmarks).

Given that Apple's a closed shop, I think that for an ARM revolution to take place, we need a great chip design that is not Apple's. I'm hoping that ARM themselves do that, but with so much ownership changes ... I'm not too hopeful. Of course, this is just me speculating.

Apple's most likely competitors will be Microsoft, Samsung, Qualcomm, Nvidia (if they buy ARM). Maybe even Intel will grow desperate and finally get back into the ARM business, haha. If Apple's laptops have the performance they are promising, then everyone will want to get in on the new ARM-based designs.

**vladpetric** · 07 August 2020, 03:57 PM

Originally posted by schmidtbag View Post

Your comment is a bit of a joke:
ARM isn't built to be performant. It's built to be efficient. Compare performance-per-watt (for both idle and load) and suddenly, those Intel chips don't look so great. The G6400 is marketed as 58W, and we all know Intel under-estimates their TDP. Worst-case scenario, the entire RPi4 uses 15W. There's a reason Intel gave up in the mobile market - they weren't able to compete with ARM's power draw without being slower.
Also, clock speed is a big part of it. That's 1.5GHz against 4GHz (when looking at the G6400). That's more than twice the clock speed at 4x the power consumption.

What you're doing is like mocking an economy car because it isn't fast like a sports car or powerful like a truck, failing to understand that it wasn't built to do either.

Yet there is more than 2.66x performance difference between the G6400 and the RPi4.

Also, the biggest part of power consumption is the dynamic part (resulting from switching). In a simplified form, that power is proportional to C * V^2 * f, where C is the dynamic power dissipation capacitance, V is supply voltage, and f is frequency. See page 4 of https://www.ti.com/lit/an/scaa035b/scaa035b.pdf

Increase the frequency and Voltage of the RPi4 and things might not look so well. Though that's a hypothetical - there are hard limits as to how much you can overclock the RPi4.

A good processor design can scale down quite easily. This is where your analogy breaks - a sports car won't be economical, and an economy car won't get you that acceleration and top speed. But with good processors you can have both. You can have a superfast processor that runs really fast at peak demand, but slows down when idle and doesn't kill your battery. Best example, IMO? Apple's ARM procs.

Finally, for performance per watt you need to have a third party measure both the performance and the Watts, and then publish the numbers. It's not something that one can do a back-of-the-envelope calculation for, with the TDP (a max value). Feel free to quote such numbers though, I'm actually quite curious.

**vladpetric** · 07 August 2020, 04:14 PM

Originally posted by atomsymbol

I believe more in the potential usefulness of JIT compilation (in a CPU) than in value prediction.

Predicting multiple branches per cycle isn't fundamentally different from predicting a single branch per cycle. A branch predictor can output 2+ bits per cycle instead of just 1 bit per cycle. Current CPUs (Zen, Skylake-derived) can predict two branches per cycle, although there are some weird irregularities to the implementation so it isn't actually a generic/universal two-branches-per-cycle predictor (at least on Zen, I didn't test Skylake).

Another reason why we don't have larger instruction windows, in addition to the reason you mentioned, is that a larger instruction window increases the probability of there being multiple memory access instructions in the window ("the memory wall"). Intel Sunny Cove has 4 AGUs and might be able (?) to sustain 2 memory reads and 2 memory writes per cycle. I am probably going to wait until Intel releases SunnyCove-derived desktop CPUs (maybe 2020-sep-02) and AMD releases Zen 3 before buying a new CPU.

I think it is probable that if memory renaming (MR) gets implemented in x86 CPUs somewhere in the future then a CPU will have to have just a couple of MR-capable very-large cores plus many smaller cores that are not MR-capable - and/or CPU cores will have to be stacked on top of each other.

Multiple writes to the same register R per cycle rename the register R - multiple writes to the same memory location L per cycle rename the memory location L (if L is in L1D cache).

I don't know an article about memory renaming.

Memory renaming does not make sense today, for example because a single instruction fetch window is too small to contain multiple iterations of the same loop.

About JITs - I attended a talk in 1997 in which Java people claimed that they were going to beat C++ speed by doing clever things with their JITs. Well, that didn't work at all ...

In order for JITs to take over the world, I think we'd need a paradigm shift of sorts (an Einstein-like person to figure out how to make JIT compilation much much better than it is today).

Predicting multiple branches per cycle is not the issue here. The issue is that when you have an instruction window of ~256 micro ops, what's the likelihood that the tail is on the right path? Keep in mind that the average number of instructions per basic block, in SPEC cpu benchmarks, is single digits (so there's plenty of branches).

The memory wall is also about the latency of memory versus CPU core (roughly in the order of 100nanos these days, whereas a processor can easily do 4 GHz, which means 400 CPU cycles to get to RAM). And you can have multiple memory accesses in the window - the lower level caches (and load/store queues) can handle those well.

Finally, for renaming - we rename registers because there's few of them and they get reused (by necessity). As such, there's many WAW and WAR false dependencies (unlike the RAW real dependencies). I really don't think that there's enough WAWs and WARs in memory, for memory renaming to be worth it.

There is actually one situation where you might want to rename memory - the stack. Well, that can be done purely at register level though (and yes, I honestly hope it gets done, for reasons which should be obvious if you skim through the following):

https://iscaconf.org/isca2005/papers/02B-03.PDF

**CommunityMember** · 07 August 2020, 04:36 PM

Originally posted by herman View Post

Maybe even Intel will grow desperate and finally get back into the ARM business, haha.

Intel is known to have an ARM (32-bit) architectural license. The rumor is that they do not have a 64-bit license, but we don't actually know that (all we really know for sure is that the last announced 64-bit architectural license was not Intel, which says nothing about some previous 64-bit license, or some un-announced license).

**NotMine999** · 07 August 2020, 04:43 PM

Originally posted by Baguy View Post

How about pitting the pi up against a 6W Pentium N5000 (basically a slightly better atom) or Cherry trail chip

I would like to see RPi tested against any of the quad core processors from the Intel "Gemini Lake" series.

You mentioned N5000, which could be hard to find, but I am thinking a J5005, J4105, and just for fun, a J4005.

Yes, they are "refreshed" parts with slightly better clock speeds compared to N5000, N4100, and N4000. Still, I think the newer J-series parts are more likely available as SoCs on motherboards in the marketplace.

ASRock builds a few models of boards with the J-series Gemini Lake SoCs. I own a few and use 1 daily for Kodi @1920x1080, but with an add-on video card as the Intel graphics functions playing back full-motion 1920x1080x30 and 1920x1080x60 video have always disappointed me - YMMV. I would not play games using a SoC like these, but I think they would be quite adequate for typical office tasks.

**starshipeleven** · 07 August 2020, 05:02 PM

I assume that the SDcard did affect the performance, with the latest firmware updates, Raspi can boot from USB 3.0 so you can store the OS on a SSD which is better on all metrics (and also not prone to corrupt randomly like SDs used as OS drives).

**edwaleni** · 07 August 2020, 05:21 PM

Originally posted by hotaru View Post

why run 32-bit on the Pi 4 instead of 64-bit? 64-bit is faster for a lot of workloads due to having more registers.

I was going to ask the same thing. ARM64 OS is faster on the Pi4 on certain operations with more than 1Gb of RAM.

**CochainComplex** · 07 August 2020, 05:32 PM

Ok. Intel needs to get benchmarked against Raspi to keep on winning.... : )

**cjcox** · 07 August 2020, 05:34 PM

Originally posted by schmidtbag View Post

ARM isn't built to be performant. It's built to be efficient. Compare performance-per-watt (for both idle and load) and suddenly, those Intel chips don't look so great. The G6400 is marketed as 58W, and we all know Intel under-estimates their TDP. Worst-case scenario, the entire RPi4 uses 15W. There's a reason Intel gave up in the mobile market - they weren't able to compete with ARM's power draw without being slower.
Also, clock speed is a big part of it. That's 1.5GHz against 4GHz (when looking at the G6400). That's more than twice the clock speed at 4x the power consumption.

What you're doing is like mocking an economy car because it isn't fast like a sports car or powerful like a truck, failing to understand that it wasn't built to do either.

Excellent points. I mean everyone talks about how evil crazy great their higher end Ryzen is, but .... you know sometimes having an iGPU isn't a bad thing. That is to say, there can be more than one reason for making part decisions.

**schmidtbag** · 07 August 2020, 06:28 PM

Originally posted by vladpetric View Post

Yet there is more than 2.66x performance difference between the G6400 and the RPi4.

And how is that so bad when you consider the advantages the RPi4 has?

Increase the frequency and Voltage of the RPi4 and things might not look so well. Though that's a hypothetical - there are hard limits as to how much you can overclock the RPi4.

You're right, ARM chips aren't known for overclocking well. But you're not supposed to do push them to the performance of desktop x86 chips. Using the economy car example again, there's only so much a turbo or supercharger is going to do to make that car have enough power without blowing up.
Nobody in their right mind would buy an ARM chip in the hopes of having competitive processing power with a laptop or desktop x86 CPU. You buy ARM because it sips power with reasonable performance. I don't get why you're so focused on performance when that's not the point of going for ARM.

A good processor design can scale down quite easily. This is where your analogy breaks - a sports car won't be economical, and an economy car won't get you that acceleration and top speed. But with good processors you can have both. You can have a superfast processor that runs really fast at peak demand, but slows down when idle and doesn't kill your battery. Best example, IMO? Apple's ARM procs.

There is no such thing as a one-size-fits-all approach, not even in processors. So no actually, my analogy works fine. As I already told you - Intel tried to compete with ARM but x86 downscales terribly once you get below 20W. Not even AMD is truly competitive in this segment yet and their design so far is more efficient than Intel's. Your example of a "superfast processor" is either not necessary, viable, or practical in ARM's target markets (like phones, routers, or web servers).
The reason Apple's architecture works well is because of their own added instructions and a LOT of OS-level optimizations. Since they control the platform, they don't have to do generic builds of anything; they can fine-tune things with mediocre hardware. I'm sure Apple will be adding on more cores to their desktop ARM CPUs instead of more MHz.

Finally, for performance per watt you need to have a third party measure both the performance and the Watts, and then publish the numbers. It's not something that one can do a back-of-the-envelope calculation for, with the TDP (a max value). Feel free to quote such numbers though, I'm actually quite curious.

Michael has done tests in the past showing performance-per-watt (maybe not with ARM+x86 in the same article, but you can get the data from both) using an external watt meter. But, many ARM SBC users get 15W power bricks and they work perfectly fine. If you don't overclock or use something power-hungry like wifi or a USB HDD, you can easily get by with a 10W adapter. I personally have powered a system with a 5W adapter, but all it had plugged into it was a keyboard, mouse, and ethernet.

Announcement

How A Raspberry Pi 4 Performs Against Intel's Latest Celeron, Pentium CPUs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment