Announcement

Collapse
No announcement yet.

Glibc 2.29 Released With getcpu() On Linux, New Optimizations

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • F.Ultra
    replied
    Originally posted by coder View Post
    No, you were talking about 256-bit scalars, or some such. That doesn't even make sense, not least because Itanium only had 64-bit.

    SSE is 128-bit, so even if you were suggesting Intel didn't add 128-bit scalar support for the sake of market segmentation, I still say it's nonsense.
    OK I might have been a little unclear on that one, sorry. What I was referring to there was the address bus width since addressing large amounts of RAM is what the big enterprise users where after.

    Leave a comment:


  • coder
    replied
    Originally posted by F.Ultra View Post
    So you really don't agree that Intel dragged their feet with 64-bit x86 due to them wanting to push the enterprises to IA64? That's the story I've heard over and over (which of course does not make it true).
    No, you were talking about 256-bit scalars, or some such. That doesn't even make sense, not least because Itanium only had 64-bit.

    SSE is 128-bit, so even if you were suggesting Intel didn't add 128-bit scalar support for the sake of market segmentation, I still say it's nonsense.

    Leave a comment:


  • cybertraveler
    replied
    Originally posted by Weasel View Post
    Exactly this. A vector scales linearly in power consumption and space wasted on the die. Double the width, double the cost (and performance).

    Meanwhile, a multiply for example scales roughly quadratically. So a 64-bit multiply is 4 times more complex than a 32-bit multiply. 128-bit multiply is 16 times more complex, etc. And if you ever only use it for small numbers that's a lot of wasted power for no reason.

    For vectors, obviously 32-bit is twice as fast as 64-bit, because you can simply fit twice as much data in a single instruction.
    I often like your posts, like that one; when you're calm & friendly.

    Just sayin'

    Leave a comment:


  • Weasel
    replied
    Originally posted by coder View Post
    Because nobody needs 256-bit addressing and only a few more need to compute with 256-bit scalars.

    And, just maybe because register and ALU width ain't free. Worse, if you look at how ALU operations are implemented, you're increasing the critical path length by at least log2(n) for n-bit wordlength. So, a wider chip will not only be hotter and bigger (and thus more expensive), but also slower.

    Compare that with vector arithmetic, and element-wise operations on a k-element vector only occupy k times as much as the same logic and datapath for operating on a single one of those elements.
    Exactly this. A vector scales linearly in power consumption and space wasted on the die. Double the width, double the cost (and performance).

    Meanwhile, a multiply for example scales roughly quadratically. So a 64-bit multiply is 4 times more complex than a 32-bit multiply. 128-bit multiply is 16 times more complex, etc. And if you ever only use it for small numbers that's a lot of wasted power for no reason.

    For vectors, obviously 32-bit is twice as fast as 64-bit, because you can simply fit twice as much data in a single instruction.

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by coder View Post
    Seriously, WTF? Quit yer trollin'.
    So you really don't agree that Intel dragged their feet with 64-bit x86 due to them wanting to push the enterprises to IA64? That's the story I've heard over and over (which of course does not make it true).

    Leave a comment:


  • coder
    replied
    Originally posted by cbxbiker61 View Post
    The bottom line is it wouldn't really matter "what" caused the improvement, just that there is an improvement.
    Yes, no argument there.

    Originally posted by cbxbiker61 View Post
    But again it's interesting that the only one putting up real-world results is getting attacked.
    Attacked? All I did was point out that the picture is more complex than you seemed to realize.

    Go back and read my first reply - I just asked a question - one that, if you'd really tried to answer it, should've lead you to my point.

    Is your ego really so fragile that you're threatened by such a question? Would you prefer I said nothing, only to miss a chance to learn something? I'd happily leave you alone, if that's what you want.

    Originally posted by cbxbiker61 View Post
    ******************
    [email protected] 32 bit ramsmp run
    ******************
    ramsmp -b 1 -p 4
    Again, a question: what is this really showing us? Is it showing how fast the CPU can write data to cache (or RAM), or is it showing how fast it can write one word at a time?

    I wonder what you'd get by just using the most-optimized memset(). I guess probably little or no difference between 32-bit and 64-bit modes.

    (Edit: I'm imagining it would use 128-bit NEON instructions, in both cases. Of course, the 32-bit code would need a fallback implementation for chips without NEON.)

    Originally posted by cbxbiker61 View Post
    ****************
    [email protected] 64 bit ramsmp run
    ****************
    ramsmp -b 1 -p 4 1
    I don't know if it matters, but this has an extra argument - '1'.
    Last edited by coder; 04 February 2019, 08:38 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by F.Ultra View Post
    The SIMD examples are somewhat different since they are special utility registers that you cannot touch with the normal ALU and only with the SIMD instruction set and you could say that in that instance the CPU is actually 256-bit (for AVX2),
    Um, no.

    Originally posted by F.Ultra View Post
    and more to the point, if you already have designed your CPU to be able to manipulate a 256-bit value then why not also expand the ALUs to the same?
    Because nobody needs 256-bit addressing and only a few more need to compute with 256-bit scalars.

    And, just maybe because register and ALU width ain't free. Worse, if you look at how ALU operations are implemented, you're increasing the critical path length by at least log2(n) for n-bit wordlength. So, a wider chip will not only be hotter and bigger (and thus more expensive), but also slower.

    Compare that with vector arithmetic, and element-wise operations on a k-element vector only occupy k times as much as the same logic and datapath for operating on a single one of those elements.

    Originally posted by F.Ultra View Post
    The only reason imho that Intel not did that when they introduced the SIMD instructions was due to politics (aka they wanted people to move to Itanium).
    Seriously, WTF? Quit yer trollin'.

    Leave a comment:


  • cbxbiker61
    replied
    Originally posted by coder View Post

    So, how can you say how much each of those differences contributed to the end result? What I took issue with was your assertion that the improvement was due to improved memory bandwidth. That's a very specific claim which you haven't provided data to support.
    My interest in moving Mycroft started after I had run synthetic benchmarks that proved the 64bit memory performance was much greater than 32bit. Only after I had implemented MyCroft on 64bits did I conclude there was a correlation. The bottom line is it wouldn't really matter "what" caused the improvement, just that there is an improvement. But again it's interesting that the only one putting up real-world results is getting attacked.

    ******************
    [email protected] 32 bit ramsmp run
    ******************
    ramsmp -b 1 -p 4
    RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

    8Gb per pass mode, 4 processes

    INTEGER & WRITING 1 Kb block: 15322.46 MB/s
    INTEGER & WRITING 2 Kb block: 15308.33 MB/s
    INTEGER & WRITING 4 Kb block: 15064.17 MB/s
    INTEGER & WRITING 8 Kb block: 15822.48 MB/s
    INTEGER & WRITING 16 Kb block: 15695.01 MB/s
    INTEGER & WRITING 32 Kb block: 15439.36 MB/s
    INTEGER & WRITING 64 Kb block: 13341.39 MB/s
    INTEGER & WRITING 128 Kb block: 6101.51 MB/s
    INTEGER & WRITING 256 Kb block: 3367.37 MB/s
    INTEGER & WRITING 512 Kb block: 2117.88 MB/s
    INTEGER & WRITING 1024 Kb block: 1798.23 MB/s
    INTEGER & WRITING 2048 Kb block: 1677.35 MB/s
    INTEGER & WRITING 4096 Kb block: 1665.76 MB/s
    INTEGER & WRITING 8192 Kb block: 1640.90 MB/s
    INTEGER & WRITING 16384 Kb block: 1641.77 MB/s
    INTEGER & WRITING 32768 Kb block: 1657.22 MB/s

    ****************
    [email protected] 64 bit ramsmp run
    ****************
    ramsmp -b 1 -p 4 1
    RAMspeed/SMP (GENERIC) v3.5.0 by Rhett M. Hollander and Paul V. Bolotoff, 2002-09

    8Gb per pass mode, 4 processes

    INTEGER & WRITING 1 Kb block: 32144.15 MB/s
    INTEGER & WRITING 2 Kb block: 32387.36 MB/s
    INTEGER & WRITING 4 Kb block: 33787.87 MB/s
    INTEGER & WRITING 8 Kb block: 32081.99 MB/s
    INTEGER & WRITING 16 Kb block: 31863.77 MB/s
    INTEGER & WRITING 32 Kb block: 26767.92 MB/s
    INTEGER & WRITING 64 Kb block: 20042.62 MB/s
    INTEGER & WRITING 128 Kb block: 11616.84 MB/s
    INTEGER & WRITING 256 Kb block: 3262.39 MB/s
    INTEGER & WRITING 512 Kb block: 2158.54 MB/s
    INTEGER & WRITING 1024 Kb block: 1798.58 MB/s
    INTEGER & WRITING 2048 Kb block: 1685.76 MB/s
    INTEGER & WRITING 4096 Kb block: 1652.56 MB/s
    INTEGER & WRITING 8192 Kb block: 1646.55 MB/s
    INTEGER & WRITING 16384 Kb block: 1625.21 MB/s
    INTEGER & WRITING 32768 Kb block: 1630.52 MB/s

    Leave a comment:


  • F.Ultra
    replied
    Originally posted by Weasel View Post
    The loads can be fused.

    BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.

    Yeah I meant superscalar, thanks.

    Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.
    Can they? Now I have very little insight into the microcode that x86 processors convert the x86 mnemonics into and where the real execution occurs but whenever I see examples of this the loads always goes to registers so since your CPU have only 32-bit registers it can only perform one load per 32-bits of data even though your cache have prefetched 64-bytes and thus you have to spend 4 slots in the ALUs just to read that 128-bits vs 2 slots.

    The SIMD examples are somewhat different since they are special utility registers that you cannot touch with the normal ALU and only with the SIMD instruction set and you could say that in that instance the CPU is actually 256-bit (for AVX2), and more to the point, if you already have designed your CPU to be able to manipulate a 256-bit value then why not also expand the ALUs to the same? The only reason imho that Intel not did that when they introduced the SIMD instructions was due to politics (aka they wanted people to move to Itanium).

    Leave a comment:


  • Weasel
    replied
    Originally posted by F.Ultra View Post
    In order to process 128-bits your 32-bit ALUs have to perform twice as many loads as compared with 64-bit ALUs leaving you with fewer possible Out of Order executions.
    The loads can be fused.

    BTW you can have a "64 bit load" on 32 bit processors, even if it's not pointers. e.g. vectors (MMX as an example on 32-bit CPUs). Obviously you can also have 256-bit loads on 32-bit CPUs (AVX2 in 32-bit mode, assuming no 64-bit mode in CPU), etc. Like I said it has nothing to do with the pointer/arch size.

    Originally posted by coder View Post
    Ehm, confusing OoO with superscalar. They're orthogonal. You can have a superscalar CPU that's in-order (see VLIW, for more extreme examples of this), or an OoO CPU that's single-dispatch (though I don't have an example in mind). More importantly, your superscalar CPU will also need a multi-ported L1 cache, to support multiple, concurrent loads.

    Anyway, what memcpy() and friends usually do to maximize the amount of data processed per cycle is to utilize wide, vector registers. To support this, L1 cache typically has at least one port that's wider than the CPU's wordlength.
    Yeah I meant superscalar, thanks.

    Note that it depends on the CPU design. In recent x86 processors you just use a specialized instruction for that, like rep movsb (yes with byte copy) and ends up as the most efficient for large copy operations.

    Leave a comment:

Working...
X