Announcement

Collapse
No announcement yet.

AVX / AVX2 / AVX-512 Performance + Power On Intel Rocket Lake

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Qaridarium
    replied
    Originally posted by coder View Post
    That's actually a problem with how the threads are scheduled by the OS. And the OS is probably limited in what it can do by the threading APIs not having a rich enough way to express concepts like one thread being compute-heavy, another being I/O-heavy, and which ones are on the critical path.
    No, it's a win for scaling, due to its usefulness in latency-hiding. That's why GPUs all use SMT in the extreme!
    What you're referring to is that floating-point heavy workloads tend to benefit from it less (with some highly-optimized AVX benchmarks actually suffering), because a single thread can generally keep the CPU resources busy and maybe the data access patterns are predictable enough for hardware prefetchers to hide most of the latency. In such cases, adding more threads hurts by increasing cache contention.
    However, a lot of real-world systems aren't just spending the vast majority of their time on one specific vector compute kernel -- there are lots of other threads doing other types of work that's less disruptive.
    And look at these SPEC INT 2017 benchmarks, where EPYC was 18.2% faster with SMT than without! That's real performance, and a big win for perf/area! SPEC INT more closely models a lot of server workloads than SPEC FP.
    There's another reason why x86 CPUs tend to use it, which is that it's very difficult for them to scale their front end beyond 4-wide. So, adding another thread is a way of getting past that bottleneck to generate more work for ever-wider cores.
    It does seem to help less, as the number of cores grows large enough to saturate the memory interface. However, when comparing 64-core Threadripper and Threadripper Pro, performance benefitted only slightly by doubling the memory interface. So, it seems that 64/128 cores/threads still doesn't always saturate an 8-channel interface. Therefore, I'd expect 128 non-SMT cores probably wouldn't, either.
    I agree that they should stay clear of AVX-512, but it seems almost inevitable that they'll fall into that well-laid trap set by Intel.
    yes exaktly all you said is true.

    i also dislike hyperthreading because i discovered that i need more "ram" for the same task.
    if you have a 16core 32 threads and you use the highest settings in 7zip a AMD 2950X with 128gb ram run out of memory.
    so for me as a desktop user who has significant less ram compared to server systems this hurts.
    means at first you have to buy more ram and if the ram is already the max for your mainboard you run out of ram.
    turn of hyperthreading in the bios fixes this for me.

    yes sure for desktop users with 1-10 cores hyperthreading was an important feature but as soon as the people buy 16cores and more for a desktop its not a feature anymore.

    and yes some benchmarks benefit from it "SPEC INT 2017 benchmarks, where EPYC was 18.2% faster with SMT than without"
    but thats all stuff for servers and not for desktop user cases...

    if any companie ever build a competive x86_64 desktop cpu with more than 16cores without hyperthreading i would buy it ...
    Last edited by Qaridarium; 12 April 2021, 02:17 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by Qaridarium View Post
    other intel bullshit technologies like hyperthreading is the same shitshow most games on modern high core cpus like 12-16cores
    are 5% faster in most of the games if hyperthreading is disabled.
    That's actually a problem with how the threads are scheduled by the OS. And the OS is probably limited in what it can do by the threading APIs not having a rich enough way to express concepts like one thread being compute-heavy, another being I/O-heavy, and which ones are on the critical path.

    Originally posted by Qaridarium View Post
    also hyperthreading does not have positive effect on more than 32-64cores.-.. benchmark shows that it slows down 128core systems and has no value for 64core systems...
    No, it's a win for scaling, due to its usefulness in latency-hiding. That's why GPUs all use SMT in the extreme!

    What you're referring to is that floating-point heavy workloads tend not to benefit from it less (with some highly-optimized AVX benchmarks actually suffering), because a single thread can generally keep the CPU resources busy and maybe the data access patterns are predictable enough for hardware prefetchers to hide most of the latency. In such cases, adding more threads hurts by increasing cache contention.

    However, a lot of real-world systems aren't just spending the vast majority of their time on one specific vector compute kernel -- there are lots of other threads doing other types of work that's less disruptive.

    And look at these SPEC INT 2017 benchmarks, where EPYC was 18.2% faster with SMT than without! That's real performance, and a big win for perf/area! SPEC INT more closely models a lot of server workloads than SPEC FP.


    There's another reason why x86 CPUs tend to use it, which is that it's very difficult for them to scale their front end beyond 4-wide. So, adding another thread is a way of getting past that bottleneck to generate more work for ever-wider cores.

    Originally posted by Qaridarium View Post
    an 128core ARM server CPU.. would not get any benefit in implementing this.
    It does seem to help less, as the number of cores grows large enough to saturate the memory interface. However, when comparing 64-core Threadripper and Threadripper Pro, performance benefitted only slightly by doubling the memory interface. So, it seems that 64/128 cores/threads still doesn't always saturate an 8-channel interface. Therefore, I'd expect 128 non-SMT cores probably wouldn't, either.

    Originally posted by Qaridarium View Post
    here my message for AMD: do a good job with AVX2(256bit) increase the core count max out the GHZ and do not implement AVX512...

    we do not need another technology who burns more power.
    I agree that they should stay clear of AVX-512, but it seems almost inevitable that they'll fall into that well-laid trap set by Intel.
    Last edited by coder; 11 April 2021, 02:19 AM.

    Leave a comment:


  • coder
    replied
    Originally posted by mtk520 View Post
    Not at all. Integer vector stuff is where the future is. Floating point is also important, and will stay – but the integer part has far more addressable use cases, and also has more complexity yet to be addressed. And the real power is in the intermixing of interpretation of integer and FP – that is just a set of bits of the same width.
    You sound as if you think people will keep doing AI on CPUs. That's your use case for integer and mixed FP. Before AI blew up, GPUs were mostly floating point. But, I got news for you: AI is fast moving to special-purpose accelerators.

    Sure, there are some imaging-related uses of integer vectors, which largely explains MMX, SSE2, and AVX2. However, by the time one gets to 256-bit or 512-bit vectors of packed ints, maybe a CPU isn't the best place for that stuff. If you can scale up that wide, why not wider or with more vector pipelines? Why not move the work to a GPU/DSP-type core, now that we have solutions like SYCL and DPC++ for addressing the programmability issues?

    Originally posted by mtk520 View Post
    There needs to be a conceptual separation in thinking about SIMD in general and vectoring as implemented on X86 platforms in particular. “Canonical” SIMD does not have flow control concept within a vector.
    I'm keenly aware that x86 vector instructions are mislabeled as SIMD. They're not. They're vector-processing instructions. Pure SIMD not only lacks flow-control within a vector, it also tends not to have horizontal operations as a first class primitive.

    Leave a comment:


  • coder
    replied
    Originally posted by mtk520 View Post
    That is quite an incorrect message overall. The limitations are technological, and what was true for Skylake does not hold true any more for current platforms.
    By "current platforms", you mean the Ice Lake CPUs that just launched this week, and some laptop chips that only hit real volumes about a year ago? In that case, yours is quite an incorrect message overall. A conventional understanding of "current platforms" is what most people are using in production and, for AVX-512, that's overwhelmingly Skylake and Cascade Lake.

    Originally posted by mtk520 View Post
    The linked article was correct at the time, it is not any more. There is still an impact to both per-core and per-socket performance, but the profiles of impact are quite radically different. And they will continue to change from platform to platform.
    Right, because some IT faeries flew in overnight and magically swapped out all the Skylake SP and Cascade Lake systems for Ice Lake? No, player.

    And it's like you didn't just see the article where Rocket Lake almost blew a fuse, when he cranked up the AVX-512! That was a K-series CPU. Has anyone stopped to think about what these same benchmarks would look like on a model with a lower thermal headroom? That's right! The CPU would have no choice but to throttle in much the same manner we've seen with other 14 nm CPUs that have it.

    Originally posted by mtk520 View Post
    Yes, that was the price to pay for the introduction of the technology. A reasonable price given the overall situation with process technology and architectural developments. Things have changed quite a lot since then. It is time to stop spreading that horror story.
    So, to tip my hand on this front, my team got burned by using an AVX-512 accelerated library on a Skylake Xeon platform only last month! Once we disabled all the AVX-512 code, performance shot up almost 60%!! That's why I'm so passionate about this issue, and I will continue shouting down shills like yourself, as long as these systems remain in production!

    I know you wish the sins of the past would just magically vanish, but it will take time for those scars to fade. In the meantime, maybe what Intel should be doing is to add a function to popular CPUID libraries that can be used to query the degree of AVX-512 clock-throttling on the CPU, so that software can make a more intelligent decision about whether to use it. Otherwise, those of us who've gotten burned by it are going to continue disabling it everywhere possible, for the next couple years.

    Leave a comment:


  • Qaridarium
    replied
    Originally posted by willmore View Post
    Until you look at efficiency and then it's horrible.
    absolutly right... AMD should NOT implement AVX512....

    other intel bullshit technologies like hyperthreading is the same shitshow most games on modern high core cpus like 12-16cores
    are 5% faster in most of the games if hyperthreading is disabled. also hyperthreading does not have positive effect on more than 32-64cores.-.. benchmark shows that it slows down 128core systems and has no value for 64core systems...

    an 128core ARM server CPU.. would not get any benefit in implementing this. future 96core AMD cpus are in danger hit by this hyperthreading bullshit hurting performance.

    here my message for AMD: do a good job with AVX2(256bit) increase the core count max out the GHZ and do not implement AVX512...

    we do not need another technology who burns more power.

    Leave a comment:


  • mtk520
    replied
    Originally posted by coder View Post

    But, I tend to agree that all the packed-int stuff in AVX2 is maybe getting to the point where you'd rather be using a DSP/GPU-like architecture. As for true SIMD-programming, GPUs just do that so much better.

    Not at all. Integer vector stuff is where the future is. Floating point is also important, and will stay – but the integer part has far more addressable use cases, and also has more complexity yet to be addressed. And the real power is in the intermixing of interpretation of integer and FP – that is just a set of bits of the same width.

    There needs to be a conceptual separation in thinking about SIMD in general and vectoring as implemented on X86 platforms in particular. “Canonical” SIMD does not have flow control concept within a vector.

    Leave a comment:


  • mtk520
    replied
    Originally posted by TemplarGR View Post

    It is even worse. Intel had been adding AVX versions because they weren't able to use gpgpu better. There is no reason for AVX-512 when you can have gpgpu for that job. AMD had the right idea with Bulldozer and Fusion (RIP), trying to decouple some math from the cpu cores with the intention of potentially moving them to a gpu. Sadly AMD at the time didn't have the marketing power to force such a paradigm shift, Intel didn't want it because their gpu tech sucked, and Nvidia didn't want it because they didn't have x86 cpu cores (they did experiment with ARM cores though).
    That would mean really separate execution contexts, far more separate than a scalar and vector domains are today. The latency of anything using masks or other form of lane control would be unnecessarily and unjustifiably high in such scenario, leaving it to use cases of sequential vector code only. To make it fast would require implementing state sharing of some form that would be nontrivially complex to achieve anything close to reasonable levels of latency. That is not practical. GPGPUs have their own use cases where they do excel and for a reason, but vector processing as it is defined today in GPCPUs does not align with approach and direction being taken by GPGPUs.


    Originally posted by TemplarGR View Post
    How much better would the x86 world be today, if x86 cores were stripped from all these SIMD bullshit and these functions were moved to gpu cores? Perhaps with an emulation layer for legacy code. I bet the cpu clocks could be raised higher, and the TDPs would be lower for them. Perhaps this would push memory technology like HBM to mature faster and be cheaper.

    Latency is what limits the overall performance. The benefits of parallel processing will be immediately diminished by an interconnect more narrow than vector length between such GPGPU and GPCPU, as well as by a need to have an equivalent of decoding, scheduling, and retiring logic implemented on the GPGPU side that would only increase the latency with no added benefit. Vectors have a perfect fit in GPCPUs.

    Leave a comment:


  • mtk520
    replied
    Originally posted by coder View Post
    AVX-512 can be a disaster, for performance! Note what Michael said about compilers now defaulting the vector width to 256-bits, to try and limit clock-throttling.

    Here's the worst-case scenario, for AVX-512: https://blog.cloudflare.com/on-the-d...uency-scaling/
    That is quite an incorrect message overall. The limitations are technological, and what was true for Skylake does not hold true any more for current platforms. The linked article was correct at the time, it is not any more. There is still an impact to both per-core and per-socket performance, but the profiles of impact are quite radically different. And they will continue to change from platform to platform.


    Originally posted by coder View Post
    Intel screwed themselves by getting ahead of what the process technology could support. Just like they did with AVX2, except worse. Dropping the CPU from a base clock of 2.1 GHz to 1.4 is just not forgivable! Especially when you're just executing 512-bit instructions for a small % of the time!
    Yes, that was the price to pay for the introduction of the technology. A reasonable price given the overall situation with process technology and architectural developments. Things have changed quite a lot since then. It is time to stop spreading that horror story.

    Originally posted by coder View Post
    Now, my hope and expectation is for Ice Lake SP (and maybe even Rocket Lake) to exercise more care and less latency around clock-speed adjustments, so that it wouldn't be a liability. However, I have yet to see good data on whether Intel managed to effectively mitigate the performance pitfalls of moderate AVX-512 usage, in Ice Lake (or Rocket Lake).

    Take a look (or experiment with it you do not have access to documentation) at the changes related to power licensing for AVX2 and AVX-512. You might get pleasantly surprised.


    Leave a comment:


  • mtk520
    replied
    Originally posted by coder View Post
    Yeah, they usually have a build-time and/or runtime option to override utilization of certain CPU features. So, PTS should really tie into that, for the results to be meaningful (although we don't necessarily know if the hand-coded stuff tries to use only 256-bit or goes for the full 512).
    The code is made to be vector width agnostic to a level. It appears to be more focused on AVX512VL at this time (this is the impression that I get from the current code) for having masks available. There is quite clever and nice (and nontrivial too) layer of macros that try to abstract away specific vector length from the higher layers of the codebase.

    Leave a comment:


  • coder
    replied
    Originally posted by TemplarGR View Post
    This pretty much sums it up. Of course Intel can't have that because -at the moment- they don't sell GPUs.
    Their stated strategy is to offer some level of capability on all devices, from CPU -> GPU -> FPGA -> AI processors. That's what oneAPI is all about.


    Of course, some of that is probably working backwards from the situation they got into by trying to use their x86 CPUs as the singular hammer to attack all compute problems.


    If we didn't know any better, that wouldn't sound like a bad approach.

    Source: https://www.anandtech.com/show/15123...pm-mt-11pm-utc

    Leave a comment:

Working...
X