Announcement

Collapse
No announcement yet.

GCC Lands AVX-512 Fully-Masked Vectorization

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • coder
    replied
    Originally posted by smitty3268 View Post
    The reason people hated the initial implementation was that adding AVX instructions to your app affected the performance of all the other non-avx code running on the CPU. It was unpredictable and could screw you over in various ways.
    You mean by clock-throttling?

    Originally posted by smitty3268 View Post
    Compilers could have added heuristics about whether it made sense to add AVX instructions or not.
    No, they can't. They almost never have enough visibility to determine whether it makes sense. They already have enough trouble trying to figure out whether it makes sense to loop-unroll. You'd really need some PGO-based feedback to know how much time is spent in potentially AVX-512 code, and the answer would change for different workloads.

    Originally posted by smitty3268 View Post
    Pretty much everything about a chip depends on the process it is on. If Intel couldn't implement a decent version of AVX512 on the node they had, that's their decision to make and it's 100% fair to judge them on it. No one forced them to add a feature that wasn't ready yet to a new CPU that didn't need to have it. "They were on a bad node" is a crappy excuse.
    I don't think we disagree on that. I'm not trying to make excuses for them, just chipping in my own $0.02 on where they made their misstep.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by coder View Post
    That can achieve the same theoretical throughput, but increasing latency can also be a performance killer.
    The reason people hated the initial implementation was that adding AVX instructions to your app affected the performance of all the other non-avx code running on the CPU. It was unpredictable and could screw you over in various ways. If they had just made it predictable, people would have been able to see the performance of AVX code and determined directly whether it was worth using or not. Compilers could have added heuristics about whether it made sense to add AVX instructions or not. Instead, you got something that seemed to speed up the executable you were running, but would also cripple anything else that was running at the same time.

    There's no free lunch, here.
    Yeah, that's what i said from the start.

    Going back to that: Pretty much everything about a chip depends on the process it is on. If Intel couldn't implement a decent version of AVX512 on the node they had, that's their decision to make and it's 100% fair to judge them on it. No one forced them to add a feature that wasn't ready yet to a new CPU that didn't need to have it. "They were on a bad node" is a crappy excuse.
    Last edited by smitty3268; 25 June 2023, 06:52 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by smitty3268 View Post
    Double the # of AVX pipelines each core has and run each at half speed, and you get the same performance while using less power.
    That can achieve the same theoretical throughput, but increasing latency can also be a performance killer. Case in point: look at the changes between AMD's GCN and RDNA. They kept the same throughput per CU, but slashed latency to 1/4th, resulting in a substantial performance improvement.

    Getting back to AVX-512, it also creates more work for the scheduler and increases register pressure. There's no free lunch, here.

    Xeon Phi (KNL) implemented AVX-512 on 14nm, but ran at just 1.5 GHz (1.7 GHz turbo). I think that's very telling, and shows where the proper design point was, for the technology.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by coder View Post
    That hasn't been true for a couple decades, at least. You used to be able to plow a larger transistor budget into hard-wiring things that had previously been microcoded. However, modern (vector) FPUs are fully hardwired, perhaps with the exception of operations like sqrt, These days, the main tradeoff is how much latency the instructions have. That's really about critical path, which is a matter of clock speed vs. density, but not outright transistor budget.

    What Intel does with greater transistor budgets is things like increasing cache sizes, adding more shadow registers, enlarging reorder buffers, adding more pipelines, adding more instructions... not to mention adding more cores! I think the individual pipeline stages don't change much.
    Double the # of AVX pipelines each core has and run each at half speed, and you get the same performance while using less power.

    Yes, I'm well aware that would make things more complicated/etc. It's all just tradeoffs.

    The whole point of AVX512 was to increase the throughput a cpu could achieve. If you only cared about latency for limited amounts of data, stick with AVX2, or even SSE. Nobody would have cared nearly as much about that, versus the reality of what Intel shipped.
    Last edited by smitty3268; 24 June 2023, 09:18 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by smitty3268 View Post
    I'm aware, but my comment stands. And there are generally ways to work around such things if you have an unlimited budget. You just have to design things differently, and the downside is it costs a lot more.
    That hasn't been true for a couple decades, at least. You used to be able to plow a larger transistor budget into hard-wiring things that had previously been microcoded. However, modern (vector) FPUs are fully hardwired, perhaps with the exception of operations like sqrt, These days, the main tradeoff is how much latency the instructions have. That's really about critical path, which is a matter of clock speed vs. density, but not outright transistor budget.

    What Intel does with greater transistor budgets is things like increasing cache sizes, adding more shadow registers, enlarging reorder buffers, adding more pipelines, adding more instructions... not to mention adding more cores! I think the individual pipeline stages don't change much.
    Last edited by coder; 23 June 2023, 04:20 AM.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by coder View Post
    It's not about transistor count, but circuit complexity vs. frequency. Smaller nodes tend to come with not only density but also efficiency benefits. That's primarily where 14 nm was lacking -- on the efficiency front.
    I'm aware, but my comment stands. And there are generally ways to work around such things if you have an unlimited budget. You just have to design things differently, and the downside is it costs a lot more.

    Leave a comment:


  • coder
    replied
    Originally posted by smitty3268 View Post
    At some level, virtually everything is a node thing. Throw enough transistors at whatever issue you're having and it tends to go away.
    It's not about transistor count, but circuit complexity vs. frequency. Smaller nodes tend to come with not only density but also efficiency benefits. That's primarily where 14 nm was lacking -- on the efficiency front.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by coder View Post
    Yes, it was a node thing. Intel's AVX-512 implementation hasn't markedly changed (aside from adding more instructions), yet their more recent implementations don't exhibit the same power problems seen in their 14 nm CPUs.
    At some level, virtually everything is a node thing. Throw enough transistors at whatever issue you're having and it tends to go away.

    Smart chip manufacturers know what their node limitations are and design a chip with those in mind, rather than jumping the gun and adding features that can't be implemented very well. It's tough to change the first impression you give to people.

    Leave a comment:


  • coder
    replied
    Originally posted by raystriker View Post
    We don't know if it's a node thing though?
    Yes, it was a node thing. Intel's AVX-512 implementation hasn't markedly changed (aside from adding more instructions), yet their more recent implementations don't exhibit the same power problems seen in their 14 nm CPUs.

    Furthermore, Rocket Lake virtually proves it was node and not microarchitecture, because Rocket Lake was a 14 nm backport of their Ice Lake core. AVX-512 works a lot better on Ice Lake, but Rocket Lake exhibited all the high power utilization problems seen on other 14 nm CPUs that supported it.​

    Originally posted by brucethemoose View Post
    This was even an issue for AVX2 in desktop Haswell, people just didnt complain much because voltage/power was increased instead of throttling like the early AVX512 Xeons. That whole clocking/power scheme was (in hindsight) a mistake.
    They had no choice. Server CPUs can't just stray outside their power envelope, the way desktop CPUs can get away with doing. About the only thing they could've done was to way increase the TDP rating of their Xeons, but that would've meant servers having overbuilt power & cooling for non- AVX-512 workloads. And don't forget: when Skylake SP first launched, AVX-512 workloads were almost nonexistent. So, you'd be overbuilding servers (i.e. making them more expensive) for virtually no benefit.

    The other thing they could've done is not implemented dual-FMA, but that would've significantly limited the performance gains of AVX-512, in key workloads.
    Last edited by coder; 21 June 2023, 09:56 PM.

    Leave a comment:


  • coder
    replied
    Originally posted by user556 View Post
    Intel's lacklustre efforts to remedy the power consumption problem is probably more than a little to do with them betting on the E-cores. If those didn't exist then the P-Cores would be better for it, IMHO.
    The P-cores might be more efficient, but that would certainly come at the cost of some performance.

    However, I think your analysis is flawed. I believe Intel came to the idea of a hybrid CPU somewhat late, after Golden Cove was very far along, and they could see how far short the Intel 7 (10 nm ESF) node delivered on efficiency. It was at that point they accepted the need to go hybrid, if they were to have any chance of competing on heavily-threaded workloads. This late decision-making explains their muddled messaging and approach on AVX-512 support in Alder Lake.

    Leave a comment:

Working...
X