Glibc 2.29 Released With getcpu() On Linux, New Optimizations

coder replied

04 February 2019, 07:32 AM
Originally posted by Weasel View Post

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.

Ehm, confusing OoO with superscalar. They're orthogonal. You can have a superscalar CPU that's in-order (see VLIW, for more extreme examples of this), or an OoO CPU that's single-dispatch (though I don't have an example in mind). More importantly, your superscalar CPU will also need a multi-ported L1 cache, to support multiple, concurrent loads.

Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.
Likes 1
Leave a comment:
coder replied

04 February 2019, 07:24 AM
Originally posted by cbxbiker61 View Post

Actually I do...it moves large amounts of data around in memory with memcpy/memmove. 64bit instructions do these ops more efficiently. Look at the GLibC source code.

Okay, I'll explain myself. As I've said, ISAs usually change more than just the size of registers, when they're extended to 64-bit. You first compiled your program for ARMv7-A, then ARMv8-A, correct? Do you know all the ways in which those ISAs differ? Have a look:

https://en.wikipedia.org/wiki/ARM_ar...rch64_features

It doesn't seem exactly to list the differences, but it touches on all the key areas. Note that they doubled both the GP and NEON register files. Also, NEON is mandatory, whereas it's optional in ARMv7-A. Do you even know if it was enabled, in your 32-bit build?

Another generic point worth considering is that some 64-bit CPUs might have optimizations that favor 64-bit mode vs. 32-bit. One hypothetical difference might be in the instruction decoder. Intel's Pentium Pro was famously slower at executing 16-bit code than its Pentium predecessor. The continuing popularity of 16-bit applications lead Intel to position the PPro as a professional workstation/server CPU, and extend the Pentium (e.g. with MMX) to cater to the consumer market.

Originally posted by cbxbiker61 View Post

Anyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?

So, how can you say how much each of those differences contributed to the end result? What I took issue with was your assertion that the improvement was due to improved memory bandwidth. That's a very specific claim which you haven't provided data to support.
Likes 1
Leave a comment:
F.Ultra replied

04 February 2019, 02:29 AM
Originally posted by Weasel View Post

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.

In order to process 128-bits your 32-bit ALUs have to perform twice as many loads as compared with 64-bit ALUs leaving you with fewer possible Out of Order executions.
Leave a comment:
F.Ultra replied

04 February 2019, 02:24 AM
Originally posted by coder View Post

Superscalar CPUs often have > 1 load/store unit, enabling multiple word-lengths of data to be loaded from L1 cache in a single cycle.

Also, let's not forget vector instructions (like ARM's 128-bit NEON or Intel's 256-bit AVX).

But, you were talking about the memory bus. And that generally operates at the granularity of cachelines. These days, that's often 64 bytes (512 bits).

And with that large execution units, do you not think that it would be a waste to not also widen the ALU:s from 32-bits? AFAIK all ARM with 128-bit NEON and Intel wil 256-bit AVX are 64-bit architectures.
Leave a comment:
cbxbiker61 replied

03 February 2019, 08:42 PM
Originally posted by Weasel View Post

WTF are you even saying? The bus width has nothing to do with the architecture. Sure that the minimum bus width is usually the size of a pointer in the given architecture, but we're talking about 32-bit chips today and those are built with higher bus width than 32 bits, even if the arch is 32 bit.

The point is that bus widths are larger than the pointer size in today's CPUs and has nothing to do with the pointer size. Nobody cares about the "minimum" bus width in the real world, only your theoretical nonsense.

I'm not the one who started talking about bus width.

memcpy/memove on arm move data more efficiently on 64bit vs 32bit. Look at the source code.

Glibc 2.29 Released With getcpu() On Linux, New Optimizations - Phoronix Forums

https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1077118-glibc-2-29-released-with-getcpu-on-linux-new-optimizations?p=1077836#post1077836

Phoronix: Glibc 2.29 Released With getcpu() On Linux, New Optimizations Ending out January is the version 2.29 release of the GNU C Library (glibc)... http://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.29-Released
Leave a comment:
cbxbiker61 replied

03 February 2019, 08:30 PM
Originally posted by coder View Post

Huh? How do you know that? I think you don't actually know why your program is faster, in 64-bit mode. You're certainly not changing how many lines of the databus are active.

Actually I do...it moves large amounts of data around in memory with memcpy/memmove. 64bit instructions do these ops more efficiently. Look at the GLibC source code.

Anyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?

Last edited by cbxbiker61; 03 February 2019, 08:48 PM.
Leave a comment:
Weasel replied

03 February 2019, 08:16 PM
Originally posted by F.Ultra View Post

Yes I know that we have moved on from that in the last decades. However if the ALUs are still 64-bit then even if the databus is 128-bit the CPU cannot process all 128-bits in one cycle arithmetically. Now there are of course "nothing" that prevents the CPU makers from creating a CPU with 128-bit ALU:s and still use a 32-bit addressbus and thus use 32-bit pointers, it would of coruse violate the current defined data models but then this is embedded (which sometimes is more of a wild west kind of land).

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.
Leave a comment:
coder replied

03 February 2019, 03:59 PM
Originally posted by F.Ultra View Post

Yes I know that we have moved on from that in the last decades. However if the ALUs are still 64-bit then even if the databus is 128-bit the CPU cannot process all 128-bits in one cycle arithmetically.

Superscalar CPUs often have > 1 load/store unit, enabling multiple word-lengths of data to be loaded from L1 cache in a single cycle.

Also, let's not forget vector instructions (like ARM's 128-bit NEON or Intel's 256-bit AVX).

But, you were talking about the memory bus. And that generally operates at the granularity of cachelines. These days, that's often 64 bytes (512 bits).

Last edited by coder; 03 February 2019, 04:01 PM.
Likes 2
Leave a comment:
F.Ultra replied

03 February 2019, 03:34 PM
Originally posted by coder View Post

Databus width and CPU word-length are independent. A common Intel desktop CPU has a 128-bit memory interface, but that doesn't make it a 128-bit CPU.

Yes I know that we have moved on from that in the last decades. However if the ALUs are still 64-bit then even if the databus is 128-bit the CPU cannot process all 128-bits in one cycle arithmetically. Now there are of course "nothing" that prevents the CPU makers from creating a CPU with 128-bit ALU:s and still use a 32-bit addressbus and thus use 32-bit pointers, it would of coruse violate the current defined data models but then this is embedded (which sometimes is more of a wild west kind of land).
Leave a comment:
cybertraveler replied

03 February 2019, 03:03 PM
Originally posted by coder View Post

That assumption is often not valid, because some (most?) ISAs add performance-enhancing features (e.g. more registers, new instructions) in their 64-bit mode. The transition from 32-bit to 64-bit usually provides a good opportunity for the chip maker to update the ISA in ways that also improve performance. Larger pointers can be a small price to pay for this.

Of course, most microcontrollers and CPUs for wearables are still 32-bit, for the reasons you mentioned. But, you have to look beyond the Pi's A53 and go for ARM's Cortex-M cores.

https://en.wikipedia.org/wiki/ARM_Cortex-M

You quoted me as saying "if all other things equal; e.g. equivalent instruction sets".
Leave a comment:

Announcement

Glibc 2.29 Released With getcpu() On Linux, New Optimizations

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: