Announcement

**vladpetric** · 20 February 2022, 01:03 PM

Originally posted by coder View Post

Loop-unrolling is one of the worst things you can do to code, from a readability and maintainability standpoint. For me, it's a last-resort.

It also ties the compilers hands, somewhat. For instance, if your unrolled version requires too many registers, then you could end up hurting IPC. Or, if there's spare capacity beyond what your unrolled version uses, then that makes it harder for the compiler adjust your loop to exploit that capacity without going overboard.

It definitely helps to understand how to unroll loops, if you want the compiler to have a fighting chance to do it for you. And one thing I've started experimenting with is using __builtin_expect() on loop condition statements, to give the compiler more hints like what it would tend to get from PGO. __builtin_unreachable() is another I've been advised can facilitate certain compiler optimizations.

https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html

And, of course, use restrict!

https://en.cppreference.com/w/c/language/restrict

Yeah, for x86 (with 15 gprs) in particular register spills matter a lot. Though maybe I wouldn't go as far as saying that it's the worst optimization. I think that there are plenty of cases where the compiler unrolls without causing additional spills, and that can be totally fine.

I've seen people coming primarily from a big Chicagoan hedge fund that proclaim that inlining is the end of it all for optimization. And over-inlining (especially of error handling code, that essentially never gets called) leads to lots of register spills. Typically over-inlining is done by hand though (people abusing the inline keyword and also the force inline attribute).

Finally, do you have a good explanation for the semantics of __builtin_unreachable()? Semantically, and separately from an optimization perspective if you know, is there a difference between that and #if 0? (much appreciated btw)

**coder** · 20 February 2022, 01:06 PM

Originally posted by vladpetric View Post

I don't have a problem with that at all, assuming that you are correct about ARM 32 being obsolete. Are you implying that most of the apps on my phone and Android 10 system code are all 64 bits? It's a recently bought Android 10.

ARM is phasing out AArch32. Of their new ARMv9 cores, only the A710 still supports it. And even that was allegedly done mostly for the Chinese market.

All the way back in 2019, the Google Play Store stopped accepting 32-bit only apps.

**coder** · 20 February 2022, 01:18 PM

Originally posted by PerformanceExpert View Post

if each PC in the world saves 50W per hour during the working day then the total electricity saved is as much as eg. Netherlands or Belgium uses each year (or a third of the UK).

A lot of that isn't even the CPU. Anandtech is showing Ryzen 5950X with package power of < 20 W @ idle. Lower-end CPUs and their APUs will be even lower.

https://www.anandtech.com/show/16214...5700x-tested/8

But the typical PC user is most likely using a laptop of some sort, which obviously idles in the sub-10W range.

BTW, it's hard to talk about about wasted electricity without mentioning crypto.

**coder** · 20 February 2022, 01:23 PM

Originally posted by PerformanceExpert View Post

Yes, but the market is fairly small. There are some standard silent PCs but they use slow cores and are way overpriced. I built my own SFF PC and while as powerful as a big PC box, it is significantly smaller and quieter, so worth the premium.

NUCs and mini-PCs are gaining a lot of popularity among business users. Less so for home users, but they're an option that didn't really exist 10 years ago. If we had more mini-PCs with M1-class horsepower, then it's probably put a serious dent in the desktop market.

**PerformanceExpert** · 20 February 2022, 01:35 PM

Originally posted by coder View Post

A lot of that isn't even the CPU. Anandtech is showing Ryzen 5950X with package power of < 20 W @ idle. Lower-end CPUs and their APUs will be even lower.

https://www.anandtech.com/show/16214...5700x-tested/8

But the typical PC user is most likely using a laptop of some sort, which obviously idles in the sub-10W range.

BTW, it's hard to talk about about wasted electricity without mentioning crypto.

My 3700X idles at 40W (at socket) which is apparently as good as it gets for a desktop. Many articles show 50-70W idle for Zen 2 and there are ones that are almost 100W idle. This is outrageously high since laptops can do many times better.

Crypto is a whole other matter of course, but so are 300+W video cards that are only marginally faster than a 200W version. There is a lot of wasted energy in the PC industry.

**coder** · 20 February 2022, 01:38 PM

Originally posted by vladpetric View Post

one needs to keep in mind that in C(++) in the general case any pointer can alias with any other pointer. With Fortran you don't have that problem.

And no, I'm not advocating for writing code in Fortran. Just saying that there's a better match between Fortran and the vectorized instructions.

That's what C99's restrict keyword is for. Sadly, C++ doesn't officially support it, but all C++ compilers have offered something equivalent for ages.

Originally posted by vladpetric View Post

As the compiler needs to insert additional instructions to manipulate vector parts (e.g., shuffle, though not only), it's fairly hard to come up with an integer vectorization scheme that beats the regular un-vectorized code.

Integer vectorization tends to be tricky, due to limitations of the instruction set. Just about every time I've vectorized integer code by hand, I've had to make tweaks to the precision & range of my data. In a couple cases, I even had to switch from 2's-complement representation to using a bias, to get around the fact that some instruction I needed didn't support signed operand.

The last thing I hand-vectorized was image scaling with area sampling, and that was completely nuts in terms of how much I had to deconstruct the entire algorithm to achieve good data coherence and compute density. There were simpler approaches that could've used scatter-gather instructions, but my implementation is much more efficient. And, as I mentioned above, careful precision-management was necessary here, as well.

**coder** · 20 February 2022, 01:44 PM

Originally posted by vladpetric View Post

IPC can vary hugely between a processor like the one in an RPi (granted, those are pretty old, even in the rpi4) and a good core design like the M1X (I have seen differences as high as 9x ... yes, seriously). The problem is that unlike clock cycle, IPC is not going to be a constant, and it's oftentimes hard to wrap one's head around it.

It's funny to me how some people use benchmarks to infer things about IPC. What they really tend to mean is "useful work per clock", and you can do that even by doing things like using wider vectors that actually have lower throughput. Or doing things that have nothing to do with instructions, such as boosting your cache sizes to reduce the amount of stalls on memory ops.

I guess "useful work per clock" is what really counts, but I'd prefer if the world reserved "IPC" to actually describe peak instruction throughput of the ALU. I know that ship has sailed...

Obviously, there's inter-process communication, where I probably first encountered that initialism.

**vladpetric** · 20 February 2022, 01:48 PM

Originally posted by coder View Post

That's what C99's restrict keyword is for. Sadly, C++ doesn't officially support it, but all C++ compilers have offered something equivalent for ages.

Sure, I've been using __restrict__ for a while now. But even with abusive use of it, gcc doesn't vectorize some loops that fortran does ...

**vladpetric** · 20 February 2022, 01:58 PM

Originally posted by coder View Post

It's funny to me how some people use benchmarks to infer things about IPC. What they really tend to mean is "useful work per clock", and you can do that even by doing things like using wider vectors that actually have lower throughput. Or doing things that have nothing to do with instructions, such as boosting your cache sizes to reduce the amount of stalls on memory ops.

I guess "useful work per clock" is what really counts, but I'd prefer if the world reserved "IPC" to actually describe peak instruction throughput of the ALU. I know that ship has sailed...

Obviously, there's inter-process communication, where I probably first encountered that initialism.

This is the definition from Hennessy and Patterson (1st chapter if I remember correctly, though I have not read the last version). Peak values are not that helpful, because they are very rarely reached. It is an unwritten rule of marketing to quote the biggest number you can ... the joke in comp arch is that peak values are limits you're guaranteed to never exceed.

Anyway, if you have a simple single-threaded program, where you care about end-to-end execution (just to keep things simple), the number of instructions for a run is fixed, and the clock is fixed too, then you can calculate the IPC fairly easily.

Is it a perfect metric? Hardly. The principal caveat is what you described. But with exactly the same program (vectorized or not vectorized for both runs) it is comparable.

**coder** · 20 February 2022, 02:00 PM

Originally posted by vladpetric View Post

Yeah, for x86 (with 15 gprs) in particular register spills matter a lot. Though maybe I wouldn't go as far as saying that it's the worst optimization. I think that there are plenty of cases where the compiler unrolls without causing additional spills, and that can be totally fine.

I was specifically talking about hand-unrolling, there. Given what a mess of the code it tends to make, it felt a bit ironic to me that L_A_G at the same time mentioned being a neat-freak. I mean, I get that you need to be well-organized to successfully pull off loop-unrolling by hand.

Originally posted by vladpetric View Post

I've seen people coming primarily from a big Chicagoan hedge fund that proclaim that inlining is the end of it all for optimization.

That's oddly specific. Also, I tend to think of hedge funds as having a handful of employees, but my knowledge of such things is very dated.

Originally posted by vladpetric View Post

And over-inlining (especially of error handling code, that essentially never gets called) leads to lots of register spills.

Depends on the type of code, but my experience is that a lot of error handling tends to be looking at the results of functions that aren't usually inlinable.

Originally posted by vladpetric View Post

Typically over-inlining is done by hand though (people abusing the inline keyword and also the force inline attribute).

Of those, only the force-inline attribute should be what forces the compiler's hand. Even if you use the inline keyword, it's still up to the optimizer whether to actually inline or not.

My rule of thumb is that I don't encourage inlining of anything with a variable-iteration loop or that can call more than one function that's not likely to get inlined. It's sort of a gut-feel thing, but you have to ask whether the function call overhead is significant relative to what the function does, as well as whether the added visibility the compiler could get would enable any significant optimizations.

I find it interesting that C++ compilers will even do things like optimize away temporary objects that internally use heap allocation. A classic example of this would be string concatenation.

Originally posted by vladpetric View Post

Finally, do you have a good explanation for the semantics of __builtin_unreachable()? Semantically, and separately from an optimization perspective if you know, is there a difference between that and #if 0? (much appreciated btw)

I'm not a good authority on this. I know enough to use it, but you should check out the GCC docs and whatever else you can find from people who seem to actually know something about GCC's internals.

Announcement

Further Investigating The Raspberry Pi 32-bit vs. 64-bit Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment