Announcement

**coder** · 04 February 2019, 08:07 PM

Originally posted by F.Ultra View Post

The SIMD examples are somewhat different since they are special utility registers that you cannot touch with the normal ALU and only with the SIMD instruction set and you could say that in that instance the CPU is actually 256-bit (for AVX2),

Um, no.

Originally posted by F.Ultra View Post

and more to the point, if you already have designed your CPU to be able to manipulate a 256-bit value then why not also expand the ALUs to the same?

Because nobody needs 256-bit addressing and only a few more need to compute with 256-bit scalars.

And, just maybe because register and ALU width ain't free. Worse, if you look at how ALU operations are implemented, you're increasing the critical path length by at least log2(n) for n-bit wordlength. So, a wider chip will not only be hotter and bigger (and thus more expensive), but also slower.

Compare that with vector arithmetic, and element-wise operations on a k-element vector only occupy k times as much as the same logic and datapath for operating on a single one of those elements.

Originally posted by F.Ultra View Post

The only reason imho that Intel not did that when they introduced the SIMD instructions was due to politics (aka they wanted people to move to Itanium).

Seriously, WTF? Quit yer trollin'.

**coder** · 04 February 2019, 08:28 PM

Originally posted by cbxbiker61 View Post

The bottom line is it wouldn't really matter "what" caused the improvement, just that there is an improvement.

Yes, no argument there.

Originally posted by cbxbiker61 View Post

But again it's interesting that the only one putting up real-world results is getting attacked.

Attacked? All I did was point out that the picture is more complex than you seemed to realize.

Go back and read my first reply - I just asked a question - one that, if you'd really tried to answer it, should've lead you to my point.

Is your ego really so fragile that you're threatened by such a question? Would you prefer I said nothing, only to miss a chance to learn something? I'd happily leave you alone, if that's what you want.

Originally posted by cbxbiker61 View Post

******************
[email protected] 32 bit ramsmp run
******************
ramsmp -b 1 -p 4

Again, a question: what is this really showing us? Is it showing how fast the CPU can write data to cache (or RAM), or is it showing how fast it can write one word at a time?

I wonder what you'd get by just using the most-optimized memset(). I guess probably little or no difference between 32-bit and 64-bit modes.

(Edit: I'm imagining it would use 128-bit NEON instructions, in both cases. Of course, the 32-bit code would need a fallback implementation for chips without NEON.)

Originally posted by cbxbiker61 View Post

****************
[email protected] 64 bit ramsmp run
****************
ramsmp -b 1 -p 4 1

I don't know if it matters, but this has an extra argument - '1'.

**F.Ultra** · 05 February 2019, 12:25 PM

Originally posted by coder View Post

Seriously, WTF? Quit yer trollin'.

So you really don't agree that Intel dragged their feet with 64-bit x86 due to them wanting to push the enterprises to IA64? That's the story I've heard over and over (which of course does not make it true).

**Weasel** · 05 February 2019, 12:57 PM

Originally posted by coder View Post

Because nobody needs 256-bit addressing and only a few more need to compute with 256-bit scalars.

And, just maybe because register and ALU width ain't free. Worse, if you look at how ALU operations are implemented, you're increasing the critical path length by at least log2(n) for n-bit wordlength. So, a wider chip will not only be hotter and bigger (and thus more expensive), but also slower.

Compare that with vector arithmetic, and element-wise operations on a k-element vector only occupy k times as much as the same logic and datapath for operating on a single one of those elements.

Exactly this. A vector scales linearly in power consumption and space wasted on the die. Double the width, double the cost (and performance).

Meanwhile, a multiply for example scales roughly quadratically. So a 64-bit multiply is 4 times more complex than a 32-bit multiply. 128-bit multiply is 16 times more complex, etc. And if you ever only use it for small numbers that's a lot of wasted power for no reason.

For vectors, obviously 32-bit is twice as fast as 64-bit, because you can simply fit twice as much data in a single instruction.

**cybertraveler** · 05 February 2019, 01:13 PM

Originally posted by Weasel View Post

Exactly this. A vector scales linearly in power consumption and space wasted on the die. Double the width, double the cost (and performance).

Meanwhile, a multiply for example scales roughly quadratically. So a 64-bit multiply is 4 times more complex than a 32-bit multiply. 128-bit multiply is 16 times more complex, etc. And if you ever only use it for small numbers that's a lot of wasted power for no reason.

For vectors, obviously 32-bit is twice as fast as 64-bit, because you can simply fit twice as much data in a single instruction.

I often like your posts, like that one; when you're calm & friendly.

Just sayin'

**coder** · 06 February 2019, 12:38 AM

Originally posted by F.Ultra View Post

So you really don't agree that Intel dragged their feet with 64-bit x86 due to them wanting to push the enterprises to IA64? That's the story I've heard over and over (which of course does not make it true).

No, you were talking about 256-bit scalars, or some such. That doesn't even make sense, not least because Itanium only had 64-bit.

SSE is 128-bit, so even if you were suggesting Intel didn't add 128-bit scalar support for the sake of market segmentation, I still say it's nonsense.

**F.Ultra** · 06 February 2019, 05:28 PM

Originally posted by coder View Post

No, you were talking about 256-bit scalars, or some such. That doesn't even make sense, not least because Itanium only had 64-bit.

SSE is 128-bit, so even if you were suggesting Intel didn't add 128-bit scalar support for the sake of market segmentation, I still say it's nonsense.

OK I might have been a little unclear on that one, sorry. What I was referring to there was the address bus width since addressing large amounts of RAM is what the big enterprise users where after.

Announcement

Glibc 2.29 Released With getcpu() On Linux, New Optimizations

Comment

Comment

Comment

Comment

Comment

Comment

Comment