Announcement

**Weasel** · 03 February 2019, 08:16 PM

Originally posted by F.Ultra View Post

Yes I know that we have moved on from that in the last decades. However if the ALUs are still 64-bit then even if the databus is 128-bit the CPU cannot process all 128-bits in one cycle arithmetically. Now there are of course "nothing" that prevents the CPU makers from creating a CPU with 128-bit ALU:s and still use a 32-bit addressbus and thus use 32-bit pointers, it would of coruse violate the current defined data models but then this is embedded (which sometimes is more of a wild west kind of land).

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.

**cbxbiker61** · 03 February 2019, 08:30 PM

Originally posted by coder View Post

Huh? How do you know that? I think you don't actually know why your program is faster, in 64-bit mode. You're certainly not changing how many lines of the databus are active.

Actually I do...it moves large amounts of data around in memory with memcpy/memmove. 64bit instructions do these ops more efficiently. Look at the GLibC source code.

Anyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?

**cbxbiker61** · 03 February 2019, 08:42 PM

Originally posted by Weasel View Post

WTF are you even saying? The bus width has nothing to do with the architecture. Sure that the minimum bus width is usually the size of a pointer in the given architecture, but we're talking about 32-bit chips today and those are built with higher bus width than 32 bits, even if the arch is 32 bit.

The point is that bus widths are larger than the pointer size in today's CPUs and has nothing to do with the pointer size. Nobody cares about the "minimum" bus width in the real world, only your theoretical nonsense.

I'm not the one who started talking about bus width.

memcpy/memove on arm move data more efficiently on 64bit vs 32bit. Look at the source code.

Glibc 2.29 Released With getcpu() On Linux, New Optimizations - Phoronix Forums

https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/1077118-glibc-2-29-released-with-getcpu-on-linux-new-optimizations?p=1077836#post1077836

Phoronix: Glibc 2.29 Released With getcpu() On Linux, New Optimizations Ending out January is the version 2.29 release of the GNU C Library (glibc)... http://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.29-Released

**F.Ultra** · 04 February 2019, 02:24 AM

Originally posted by coder View Post

Superscalar CPUs often have > 1 load/store unit, enabling multiple word-lengths of data to be loaded from L1 cache in a single cycle.

Also, let's not forget vector instructions (like ARM's 128-bit NEON or Intel's 256-bit AVX).

But, you were talking about the memory bus. And that generally operates at the granularity of cachelines. These days, that's often 64 bytes (512 bits).

And with that large execution units, do you not think that it would be a waste to not also widen the ALU:s from 32-bits? AFAIK all ARM with 128-bit NEON and Intel wil 256-bit AVX are 64-bit architectures.

**F.Ultra** · 04 February 2019, 02:29 AM

Originally posted by Weasel View Post

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.

In order to process 128-bits your 32-bit ALUs have to perform twice as many loads as compared with 64-bit ALUs leaving you with fewer possible Out of Order executions.

**coder** · 04 February 2019, 07:24 AM

Originally posted by cbxbiker61 View Post

Actually I do...it moves large amounts of data around in memory with memcpy/memmove. 64bit instructions do these ops more efficiently. Look at the GLibC source code.

Okay, I'll explain myself. As I've said, ISAs usually change more than just the size of registers, when they're extended to 64-bit. You first compiled your program for ARMv7-A, then ARMv8-A, correct? Do you know all the ways in which those ISAs differ? Have a look:

https://en.wikipedia.org/wiki/ARM_ar...rch64_features

It doesn't seem exactly to list the differences, but it touches on all the key areas. Note that they doubled both the GP and NEON register files. Also, NEON is mandatory, whereas it's optional in ARMv7-A. Do you even know if it was enabled, in your 32-bit build?

Another generic point worth considering is that some 64-bit CPUs might have optimizations that favor 64-bit mode vs. 32-bit. One hypothetical difference might be in the instruction decoder. Intel's Pentium Pro was famously slower at executing 16-bit code than its Pentium predecessor. The continuing popularity of 16-bit applications lead Intel to position the PPro as a professional workstation/server CPU, and extend the Pentium (e.g. with MMX) to cater to the consumer market.

Originally posted by cbxbiker61 View Post

Anyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?

So, how can you say how much each of those differences contributed to the end result? What I took issue with was your assertion that the improvement was due to improved memory bandwidth. That's a very specific claim which you haven't provided data to support.

**coder** · 04 February 2019, 07:32 AM

Originally posted by Weasel View Post

Ehm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.

Ehm, confusing OoO with superscalar. They're orthogonal. You can have a superscalar CPU that's in-order (see VLIW, for more extreme examples of this), or an OoO CPU that's single-dispatch (though I don't have an example in mind). More importantly, your superscalar CPU will also need a multi-ported L1 cache, to support multiple, concurrent loads.

Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.

**Weasel** · 04 February 2019, 01:10 PM

Originally posted by F.Ultra View Post

In order to process 128-bits your 32-bit ALUs have to perform twice as many loads as compared with 64-bit ALUs leaving you with fewer possible Out of Order executions.

The loads can be fused.

BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.

Originally posted by coder View Post

Ehm, confusing OoO with superscalar. They're orthogonal. You can have a superscalar CPU that's in-order (see VLIW, for more extreme examples of this), or an OoO CPU that's single-dispatch (though I don't have an example in mind). More importantly, your superscalar CPU will also need a multi-ported L1 cache, to support multiple, concurrent loads.

Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.

Yeah I meant superscalar, thanks.

Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.

**F.Ultra** · 04 February 2019, 02:00 PM

Originally posted by Weasel View Post

The loads can be fused.

BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.

Yeah I meant superscalar, thanks.

Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.

Can they? Now I have very little insight into the microcode that x86 processors convert the x86 mnemonics into and where the real execution occurs but whenever I see examples of this the loads always goes to registers so since your CPU have only 32-bit registers it can only perform one load per 32-bits of data even though your cache have prefetched 64-bytes and thus you have to spend 4 slots in the ALUs just to read that 128-bits vs 2 slots.

The SIMD examples are somewhat different since they are special utility registers that you cannot touch with the normal ALU and only with the SIMD instruction set and you could say that in that instance the CPU is actually 256-bit (for AVX2), and more to the point, if you already have designed your CPU to be able to manipulate a 256-bit value then why not also expand the ALUs to the same? The only reason imho that Intel not did that when they introduced the SIMD instructions was due to politics (aka they wanted people to move to Itanium).

**cbxbiker61** · 04 February 2019, 02:15 PM

Originally posted by coder View Post

So, how can you say how much each of those differences contributed to the end result? What I took issue with was your assertion that the improvement was due to improved memory bandwidth. That's a very specific claim which you haven't provided data to support.

My interest in moving Mycroft started after I had run synthetic benchmarks that proved the 64bit memory performance was much greater than 32bit. Only after I had implemented MyCroft on 64bits did I conclude there was a correlation. The bottom line is it wouldn't really matter "what" caused the improvement, just that there is an improvement. But again it's interesting that the only one putting up real-world results is getting attacked.

******************
[email protected] 32 bit ramsmp run
******************
ramsmp -b 1 -p 4
RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

8Gb per pass mode, 4 processes

INTEGER & WRITING 1 Kb block: 15322.46 MB/s
INTEGER & WRITING 2 Kb block: 15308.33 MB/s
INTEGER & WRITING 4 Kb block: 15064.17 MB/s
INTEGER & WRITING 8 Kb block: 15822.48 MB/s
INTEGER & WRITING 16 Kb block: 15695.01 MB/s
INTEGER & WRITING 32 Kb block: 15439.36 MB/s
INTEGER & WRITING 64 Kb block: 13341.39 MB/s
INTEGER & WRITING 128 Kb block: 6101.51 MB/s
INTEGER & WRITING 256 Kb block: 3367.37 MB/s
INTEGER & WRITING 512 Kb block: 2117.88 MB/s
INTEGER & WRITING 1024 Kb block: 1798.23 MB/s
INTEGER & WRITING 2048 Kb block: 1677.35 MB/s
INTEGER & WRITING 4096 Kb block: 1665.76 MB/s
INTEGER & WRITING 8192 Kb block: 1640.90 MB/s
INTEGER & WRITING 16384 Kb block: 1641.77 MB/s
INTEGER & WRITING 32768 Kb block: 1657.22 MB/s

****************
[email protected] 64 bit ramsmp run
****************
ramsmp -b 1 -p 4 1
RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

8Gb per pass mode, 4 processes

INTEGER & WRITING 1 Kb block: 32144.15 MB/s
INTEGER & WRITING 2 Kb block: 32387.36 MB/s
INTEGER & WRITING 4 Kb block: 33787.87 MB/s
INTEGER & WRITING 8 Kb block: 32081.99 MB/s
INTEGER & WRITING 16 Kb block: 31863.77 MB/s
INTEGER & WRITING 32 Kb block: 26767.92 MB/s
INTEGER & WRITING 64 Kb block: 20042.62 MB/s
INTEGER & WRITING 128 Kb block: 11616.84 MB/s
INTEGER & WRITING 256 Kb block: 3262.39 MB/s
INTEGER & WRITING 512 Kb block: 2158.54 MB/s
INTEGER & WRITING 1024 Kb block: 1798.58 MB/s
INTEGER & WRITING 2048 Kb block: 1685.76 MB/s
INTEGER & WRITING 4096 Kb block: 1652.56 MB/s
INTEGER & WRITING 8192 Kb block: 1646.55 MB/s
INTEGER & WRITING 16384 Kb block: 1625.21 MB/s
INTEGER & WRITING 32768 Kb block: 1630.52 MB/s

Announcement

Glibc 2.29 Released With getcpu() On Linux, New Optimizations

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment