Originally posted by F.Ultra
View Post
Announcement
Collapse
No announcement yet.
Glibc 2.29 Released With getcpu() On Linux, New Optimizations
Collapse
X
-
Originally posted by coder View PostHuh? How do you know that? I think you don't actually know why your program is faster, in 64-bit mode. You're certainly not changing how many lines of the databus are active.
Anyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?
Last edited by cbxbiker61; 03 February 2019, 08:48 PM.
Comment
-
Originally posted by Weasel View PostWTF are you even saying? The bus width has nothing to do with the architecture. Sure that the minimum bus width is usually the size of a pointer in the given architecture, but we're talking about 32-bit chips today and those are built with higher bus width than 32 bits, even if the arch is 32 bit.
The point is that bus widths are larger than the pointer size in today's CPUs and has nothing to do with the pointer size. Nobody cares about the "minimum" bus width in the real world, only your theoretical nonsense.
memcpy/memove on arm move data more efficiently on 64bit vs 32bit. Look at the source code.
Phoronix: Glibc 2.29 Released With getcpu() On Linux, New Optimizations Ending out January is the version 2.29 release of the GNU C Library (glibc)... http://www.phoronix.com/scan.php?page=news_item&px=Glibc-2.29-Released
Comment
-
Originally posted by coder View PostSuperscalar CPUs often have > 1 load/store unit, enabling multiple word-lengths of data to be loaded from L1 cache in a single cycle.
Also, let's not forget vector instructions (like ARM's 128-bit NEON or Intel's 256-bit AVX).
But, you were talking about the memory bus. And that generally operates at the granularity of cachelines. These days, that's often 64 bytes (512 bits).
Comment
-
Originally posted by Weasel View PostEhm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.
Comment
-
Originally posted by cbxbiker61 View PostActually I do...it moves large amounts of data around in memory with memcpy/memmove. 64bit instructions do these ops more efficiently. Look at the GLibC source code.
https://en.wikipedia.org/wiki/ARM_ar...rch64_features
It doesn't seem exactly to list the differences, but it touches on all the key areas. Note that they doubled both the GP and NEON register files. Also, NEON is mandatory, whereas it's optional in ARMv7-A. Do you even know if it was enabled, in your 32-bit build?
Another generic point worth considering is that some 64-bit CPUs might have optimizations that favor 64-bit mode vs. 32-bit. One hypothetical difference might be in the instruction decoder. Intel's Pentium Pro was famously slower at executing 16-bit code than its Pentium predecessor. The continuing popularity of 16-bit applications lead Intel to position the PPro as a professional workstation/server CPU, and extend the Pentium (e.g. with MMX) to cater to the consumer market.
Originally posted by cbxbiker61 View PostAnyway. I find it interesting how many people want to jump on the bandwagon and attack the only one discussing this topic that has put a real-world comparison/data that backs up what he has said. It appear the attackers are simply offering conjecture. Where's the real-world data that backs up 32bits is superior in any workload?
- Likes 1
Comment
-
Originally posted by Weasel View PostEhm, of course you can process 128-bits in a single clock cycle, we have out-of-order CPUs... not executing one instruction at a time.
Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.
- Likes 1
Comment
-
Originally posted by F.Ultra View PostIn order to process 128-bits your 32-bit ALUs have to perform twice as many loads as compared with 64-bit ALUs leaving you with fewer possible Out of Order executions.
BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.
Originally posted by coder View PostEhm, confusing OoO with superscalar. They're orthogonal. You can have a superscalar CPU that's in-order (see VLIW, for more extreme examples of this), or an OoO CPU that's single-dispatch (though I don't have an example in mind). More importantly, your superscalar CPU will also need a multi-ported L1 cache, to support multiple, concurrent loads.
Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.
Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.
- Likes 2
Comment
-
Originally posted by Weasel View PostThe loads can be fused.
BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.
Yeah I meant superscalar, thanks.
Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.
The SIMD examples are somewhat different since they are special utility registers that you cannot touch with the normal ALU and only with the SIMD instruction set and you could say that in that instance the CPU is actually 256-bit (for AVX2), and more to the point, if you already have designed your CPU to be able to manipulate a 256-bit value then why not also expand the ALUs to the same? The only reason imho that Intel not did that when they introduced the SIMD instructions was due to politics (aka they wanted people to move to Itanium).
Comment
-
Originally posted by coder View Post
So, how can you say how much each of those differences contributed to the end result? What I took issue with was your assertion that the improvement was due to improved memory bandwidth. That's a very specific claim which you haven't provided data to support.
******************
[email protected] 32 bit ramsmp run
******************
ramsmp -b 1 -p 4
RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
8Gb per pass mode, 4 processes
INTEGER & WRITING 1 Kb block: 15322.46 MB/s
INTEGER & WRITING 2 Kb block: 15308.33 MB/s
INTEGER & WRITING 4 Kb block: 15064.17 MB/s
INTEGER & WRITING 8 Kb block: 15822.48 MB/s
INTEGER & WRITING 16 Kb block: 15695.01 MB/s
INTEGER & WRITING 32 Kb block: 15439.36 MB/s
INTEGER & WRITING 64 Kb block: 13341.39 MB/s
INTEGER & WRITING 128 Kb block: 6101.51 MB/s
INTEGER & WRITING 256 Kb block: 3367.37 MB/s
INTEGER & WRITING 512 Kb block: 2117.88 MB/s
INTEGER & WRITING 1024 Kb block: 1798.23 MB/s
INTEGER & WRITING 2048 Kb block: 1677.35 MB/s
INTEGER & WRITING 4096 Kb block: 1665.76 MB/s
INTEGER & WRITING 8192 Kb block: 1640.90 MB/s
INTEGER & WRITING 16384 Kb block: 1641.77 MB/s
INTEGER & WRITING 32768 Kb block: 1657.22 MB/s
****************
[email protected] 64 bit ramsmp run
****************
ramsmp -b 1 -p 4 1
RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09
8Gb per pass mode, 4 processes
INTEGER & WRITING 1 Kb block: 32144.15 MB/s
INTEGER & WRITING 2 Kb block: 32387.36 MB/s
INTEGER & WRITING 4 Kb block: 33787.87 MB/s
INTEGER & WRITING 8 Kb block: 32081.99 MB/s
INTEGER & WRITING 16 Kb block: 31863.77 MB/s
INTEGER & WRITING 32 Kb block: 26767.92 MB/s
INTEGER & WRITING 64 Kb block: 20042.62 MB/s
INTEGER & WRITING 128 Kb block: 11616.84 MB/s
INTEGER & WRITING 256 Kb block: 3262.39 MB/s
INTEGER & WRITING 512 Kb block: 2158.54 MB/s
INTEGER & WRITING 1024 Kb block: 1798.58 MB/s
INTEGER & WRITING 2048 Kb block: 1685.76 MB/s
INTEGER & WRITING 4096 Kb block: 1652.56 MB/s
INTEGER & WRITING 8192 Kb block: 1646.55 MB/s
INTEGER & WRITING 16384 Kb block: 1625.21 MB/s
INTEGER & WRITING 32768 Kb block: 1630.52 MB/s
Comment
Comment