Announcement

**PerformanceExpert** · 19 February 2022, 01:22 PM

Originally posted by vladpetric View Post

Yeah, so geekbench is only good as a first order approximation. But for that it's fine.

Don't mean to be splitting hairs but the aggressive power saving tradeoffs in a mobile chip are probably deemed not worth the effort for a desktop (no battery) application. After all, when the Fridges + ACs consume thousands of watts, whether your CPU idles at 30 W or subwatt is not going to make a difference. In addition to that, higher power allows for sustained high peak performance (the variation in mobile CPU performance is not typically caught by something like geekbench, IIUC).

Indeed, by definition benchmarks are approximations. The goal is just to show how large the gains have been in mobiles in the last 5 years.

There is definitely scope to significantly improve power efficiency of PCs and it is just as worthwhile as was to switch to LEDs for lighting or ensure TVs use 0.5W on standby. It doesn't move the needle much for each user, however if each PC in the world saves 50W per hour during the working day then the total electricity saved is as much as eg. Netherlands or Belgium uses each year (or a third of the UK). I think we'll see servers move first since the gains are more obvious when you have many thousands of servers (like AWS).

**vladpetric** · 19 February 2022, 02:09 PM

Originally posted by PerformanceExpert View Post

Indeed, by definition benchmarks are approximations. The goal is just to show how large the gains have been in mobiles in the last 5 years.

There is definitely scope to significantly improve power efficiency of PCs and it is just as worthwhile as was to switch to LEDs for lighting or ensure TVs use 0.5W on standby. It doesn't move the needle much for each user, however if each PC in the world saves 50W per hour during the working day then the total electricity saved is as much as eg. Netherlands or Belgium uses each year (or a third of the UK). I think we'll see servers move first since the gains are more obvious when you have many thousands of servers (like AWS).

To some degree the shift to small form factor (with associated energy saving) is happening already. Not for gaming computers, but for everybody else.

Not my intention to derail the conversation, but assuming you're North American, it'd be much better to improve insulation standards and paint buildings with highly reflecting white paint. Sure it's good to save tens of watts with a computer, much better to save hundreds/thousands with building standards.

**PerformanceExpert** · 19 February 2022, 07:40 PM

Originally posted by vladpetric View Post

To some degree the shift to small form factor (with associated energy saving) is happening already. Not for gaming computers, but for everybody else.

Not my intention to derail the conversation, but assuming you're North American, it'd be much better to improve insulation standards and paint buildings with highly reflecting white paint. Sure it's good to save tens of watts with a computer, much better to save hundreds/thousands with building standards.

Yes, but the market is fairly small. There are some standard silent PCs but they use slow cores and are way overpriced. I built my own SFF PC and while as powerful as a big PC box, it is significantly smaller and quieter, so worth the premium.

I'm not an American, but improving building standards would be a good idea in much of the world. You can certainly make major savings with insulation, solar panels etc, however it also costs a lot (especially when retrofitting old houses). So it is still worth doing all the cheap efficiency improvements even if they seem small.

**vladpetric** · 19 February 2022, 09:06 PM

Originally posted by PerformanceExpert View Post

I'm not an American, but improving building standards would be a good idea in much of the world. You can certainly make major savings with insulation, solar panels etc, however it also costs a lot (especially when retrofitting old houses). So it is still worth doing all the cheap efficiency improvements even if they seem small.

Believe me then when I tell you then that implementing EU-style regulations on construction and insulation (just catching up ...) would bring huge benefits

**coder** · 20 February 2022, 11:40 AM

Originally posted by PerformanceExpert View Post

The simpler benchmarks show that the kernel and drivers use about twice the memory - while many kernel structures and page tables are obviously larger, it's hard to believe there are 31 million pointers (extra 120 Mbytes for 64-bit) in an idle kernel...

I know it wouldn't fully explain the discrepancy, but maybe there are changes like the minimum granularity of heap allocations. Is there any chance the 64-bit mode is also using larger pages?

**coder** · 20 February 2022, 11:52 AM

Originally posted by vladpetric View Post

Most of the time my surprise was quite negative, as daxpy loops are pretty rare. Though feel free to give me an example of successful vectorization of ARM benchmark code.

Not speaking specifically about ARM, but I believe one reason compilers tend not to do a better job of auto-vectorization is the lack of information they have about which loops are hot-spots. With PGO, this should greatly improve. How much impact it'll have is not a question I've seen enough evidence to answer.

That being said, I've done enough code optimization with "SIMD" extensions to know that properly vectorizing some things requires large-scale changes to the code structure or even algorithms. So, I think the gains are mostly limited to easier cases.

**vladpetric** · 20 February 2022, 12:20 PM

Originally posted by coder View Post

Not speaking specifically about ARM, but I believe one reason compilers tend not to do a better job of auto-vectorization is the lack of information they have about which loops are hot-spots. With PGO, this should greatly improve. How much impact it'll have is not a question I've seen enough evidence to answer.

That being said, I've done enough code optimization with "SIMD" extensions to know that properly vectorizing some things requires large-scale changes to the code structure or even algorithms. So, I think the gains are mostly limited to easier cases.

Typically the compiler considers as hot the inner-most loops, and by-and-large it's not a bad heuristic at all for vectorization. Could the compiler end up over-vectorizing this way, as in vectorizing cold code? Sure, but honestly who cares, as the downsides from doing that are minimal.

The issues, IMO, are as follows:

1. It's really hard to vectorize loops with the x86 (SSE et al.) and ARM (NEON) vector instruction sets.

2. Even if the compiler manages to vectorize a loop, it needs to prove that it's safe to do so for any correct execution of the program. That primarily means figuring out stuff such as loop-carried dependencies (i.e., does one iteration of the loop affect another?). That may seem like a workable problem, but one needs to keep in mind that in C(++) in the general case any pointer can alias with any other pointer. With Fortran you don't have that problem.

And no, I'm not advocating for writing code in Fortran. Just saying that there's a better match between Fortran and the vectorized instructions.

3. The cost of vectorization for integer instructions. Over here, a little bit of microarchitecture knowledge helps. It's best to think of vector instructions as running on an in-order machine (this is an abstraction/over-simplification, but for the most part it is true). The generic integer instructions run dynamically scheduled (out of order), with a superscalar width (typically 4 can be issued per cycle). As the compiler needs to insert additional instructions to manipulate vector parts (e.g., shuffle, though not only), it's fairly hard to come up with an integer vectorization scheme that beats the regular un-vectorized code.

**coder** · 20 February 2022, 12:38 PM

Originally posted by vladpetric View Post

The purpose of register renaming is to get rid of WAR and WAW false dependencies, so that you can effectively use a larger instruction window for dynamic (out of order) scheduling. Register renaming by itself does not address spills. That's an extension to register renaming which I proposed a while back https://repository.upenn.edu/cis_papers/217/ and AMD 3000 series seems to implement https://www.agner.org/forum/viewtopic.php?t=41

I've read that Apple does move-elimination at the rename stage, and probably a couple other optimizations you mentioned. If you haven't seen it, here's a compilation of what's known about their M1. I think you might find it very interesting:

Maynard Handley's Apple M1 Explainer

Originally posted by vladpetric View Post

most of ARM designs are toy-ish (as in, not high performance computing,

Their Neoverse V-series cores are pitched at HPC applications.

Originally posted by vladpetric View Post

I doubt they would go that far in trying to extract performance.

Apple has shown there are big perf/W wins to be had from greater microarchitecture sophistication, so even non-HPC applications can benefit.

For ARM (proper), a lot of it comes down to perf/mm^2, since physical size affects costs and they have to cater to a wider range of markets than many other cores out there. Even their X-series cores are basically just A7xx cores with some enlarged structures (e.g. buffers, caches, windows, etc.). Still, I think competitive pressures will force them to go further in pushing IPC.

**vladpetric** · 20 February 2022, 12:48 PM

Originally posted by coder View Post

I've read that Apple does move-elimination at the rename stage, and probably a couple other optimizations you mentioned. If you haven't seen it, here's a compilation of what's known about their M1. I think you might find it very interesting:

Maynard Handley's Apple M1 Explainer

Their Neoverse V-series cores are pitched at HPC applications.

Apple has shown there are big perf/W wins to be had from greater microarchitecture sophistication, so even non-HPC applications can benefit.

For ARM (proper), a lot of it comes down to perf/mm^2, since physical size affects costs and they have to cater to a wider range of markets than many other cores out there. Even their X-series cores are basically just A7xx cores with some enlarged structures (e.g. buffers, caches, windows, etc.). Still, I think competitive pressures will force them to go further in pushing IPC.

Thanks for the link, I'll definitely read it. Someone else mentioned why it's not worth doing zero-cycle rename for spills (communicating store/load pairs on the stack), namely that it's not worth doing it for ARM 64, with 30 gprs, and legacy ARM 32 (with 14 GPRs, and thus many more spills) is generally irrelevant, as all apps are required to be built for 64 bit these days anyway. And I agree.

IPC can vary hugely between a processor like the one in an RPi (granted, those are pretty old, even in the rpi4) and a good core design like the M1X (I have seen differences as high as 9x ... yes, seriously). The problem is that unlike clock cycle, IPC is not going to be a constant, and it's oftentimes hard to wrap one's head around it.

**coder** · 20 February 2022, 12:53 PM

Originally posted by L_A_G View Post

I'm something of a neat freak who does things like loop unrolling without even thinking about it and my code vectorizes pretty well.

Loop-unrolling is one of the worst things you can do to code, from a readability and maintainability standpoint. For me, it's a last-resort.

It also ties the compilers hands, somewhat. For instance, if your unrolled version requires too many registers, then you could end up hurting IPC. Or, if there's spare capacity beyond what your unrolled version uses, then that makes it harder for the compiler adjust your loop to exploit that capacity without going overboard.

It definitely helps to understand how to unroll loops, if you want the compiler to have a fighting chance to do it for you. And one thing I've started experimenting with is using __builtin_expect() on loop condition statements, to give the compiler more hints like what it would tend to get from PGO. __builtin_unreachable() is another I've been advised can facilitate certain compiler optimizations.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

And, of course, learn when to use restrict!

https://en.cppreference.com/w/c/language/restrict

Announcement

Further Investigating The Raspberry Pi 32-bit vs. 64-bit Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment