Announcement

**NobodyXu** · 06 August 2022, 10:52 PM

Originally posted by coder View Post

Everything has overheads. I don't know specifically what form of branchlessness you're talking about, but there's usually no such thing as a free lunch.

Yes, and the cost for branchlessness is you have to be careful when implementing the algorithm, check the generated assembly to ensure it doesn't end up having a branch, and check the perform to see if it worths it.

I read somewhere that it is better to avoid branches in quick sort, because the CPU can't make a good prediction of it.

Originally posted by coder View Post

If some non-local change is made that invalidates a set of hints, you might see a performance regression -- if you're looking for carefully enough for them.

Yeah, that's why I don't think hinting should be promoted everywhere as it can only be used in very specific scenarios.

For most softwares, it makes more sense to use PGO.

**coder** · 06 August 2022, 11:07 PM

Originally posted by NobodyXu View Post

Yes, and the cost for branchlessness is you have to be careful when implementing the algorithm, check the generated assembly to ensure it doesn't end up having a branch, and check the perform to see if it worths it.

Can you please cite some examples, so I can understand what you have in mind?

The main form of "branchlessness" I'm aware of is predication, but that's only a win for cheap operations and doesn't scale at all to more deeply-nested control flow.

Another technique applicable to processing vast amounts of data (GPU scale or bigger) is to sort data into batches, according to which exit it would take from a basic block. You can then process each batch of data over each basic block, for instance using a SIMD processor. The downside of this approach is the loss of data locality.

Originally posted by NobodyXu View Post

I read somewhere that it is better to avoid branches in quick sort, because the CPU can't make a good prediction of it.

Well, modern CPUs have memory prefetchers that would be equally stymied even if you avoid branches.

I think merge & radix sort tend to parallelize better, and perhaps have more friendly data-access patterns. Maybe I'll see if I can find some good benchmarks on that.

Originally posted by NobodyXu View Post

Yeah, that's why I don't think hinting should be promoted everywhere as it can only be used in very specific scenarios.

If I ever use hints, it's either on hotspot loops that I know will be large, or to pessimize error handling.

In GLib, the latter is actually pretty common: https://www.freedesktop.org/software...-UNLIKELY:CAPS

**NobodyXu** · 07 August 2022, 05:40 AM

Originally posted by coder View Post

Can you please cite some examples, so I can understand what you have in mind?

Checkout this video https://youtu.be/bVJ-mWWL7cE

Originally posted by coder View Post

The main form of "branchlessness" I'm aware of is predication, but that's only a win for cheap operations and doesn't scale at all to more deeply-nested control flow.

Well, it is mainly for math code, so yeah, it has its limitations.
Though I think it is more robust than giving branch prediction hints.

Originally posted by coder View Post

Another technique applicable to processing vast amounts of data (GPU scale or bigger) is to sort data into batches, according to which exit it would take from a basic block. You can then process each batch of data over each basic block, for instance using a SIMD processor. The downside of this approach is the loss of data locality.

Interesting, sounds like comparing matrix/array in R/numpy which gives you a new boolean array to index the original array.

Yeah it loses the data locality and might requires a new allocation.

Originally posted by coder View Post

Well, modern CPUs have memory prefetchers that would be equally stymied even if you avoid branches.

True, nowadays the bottleneck for most programs running on CPU is the latency of the memory.
Even with DDR5, accessing memory is still way slower than the CPU cycle.

Programs that are really computation intensive are either offload to the GPU or to some specialized hardwares.

Originally posted by coder View Post

I think merge & radix sort tend to parallelize better, and perhaps have more friendly data-access patterns. Maybe I'll see if I can find some good benchmarks on that.

Merge sort are very predictor-friendly since it performs linear access on every iteration on each subset.
Radix sort seems to more similar to hashmap, but still very friendly when compared to quick sort's picking a random pivot and partition the array.

Neither merge sort nor quick sort can be easily parallelized.
Merge sort starts with having n blocks that can be parallelized, then on the second iteration it only have n / 2 blocks to parallelize, then n / 4 until it finally has only 3 to merge.
Quick sort starts with 1 block, splitting it into 2, then 4, until it is smaller enough to run insertion sort on it.

I think Radix sort can be parallelized when collecting the elements into the buckets using a local bucket per thread, though this means that it becomes an unstable sorting algorithm.

Originally posted by coder View Post

If I ever use hints, it's either on hotspot loops that I know will be large, or to pessimize error handling.

Yep, though for loops, doesn't GCC and LLVM assumes that the looping condition will likely to be true?

Originally posted by coder View Post

In GLib, the latter is actually pretty common: https://www.freedesktop.org/software...-UNLIKELY:CAPS

Thanks for the information!

I can certainly imagine why it is useful for libc: memcpy, memmove, strchr and others can all use these hints to some degrees.

**coder** · 07 August 2022, 06:32 PM

Originally posted by NobodyXu View Post

Checkout this video https://youtu.be/bVJ-mWWL7cE

Didn't watch. Maybe later. Post anything else, if you've got it.

Originally posted by NobodyXu View Post

Well, it is mainly for math code, so yeah, it has its limitations.
Though I think it is more robust than giving branch prediction hints.

ARM actually removed predication from most of its instructions, in the transition from v7 to v8. That's a datapoint which might be relevant.

It's certainly a key technique in writing SIMD code. AVX-512 added some additional facilities (I think making it a first class feature, rather than forcing programmers to emulate predication through bit-masking).

Originally posted by NobodyXu View Post

Interesting, sounds like comparing matrix/array in R/numpy which gives you a new boolean array to index the original array.

Yeah it loses the data locality and might requires a new allocation.

I used to think about writing a ray-tracing engine for GPUs that sorted rays into batches, based on which paths they took through a BSP tree or BVH. In that way, my hope was to trade a bit more memory traffic by sorting these batches, for the benefit of substantially improving coherency of leaf node & polygon accesses. Not sure if it'd have been worth the trouble, other than perhaps for global illumination.

It seems to me that ray tracing seems to involve lots of compute, but really it's more of a data coherency problem. At least for something like a GPU, which has oodles of raw compute power.

Originally posted by NobodyXu View Post

True, nowadays the bottleneck for most programs running on CPU is the latency of the memory.
Even with DDR5, accessing memory is still way slower than the CPU cycle.

Memory prefetchers don't get enough attention, IMO. They're probably second only to OoO execution in terms of their importance in hiding memory access latency. So, while you might avoid tripping up OoO by avoiding branch prediction, if the memory prefetcher can't guess which data to load, you will still end up stalling the pipeline for >= 100 ns on a full cache miss (~500 cycles on a 5 GHz CPU).

BTW, the access latency of external DRAM DIMMs never really changes, I think. If anything, each new DDR standard tends to make it a little worse (though, I think that ground tends to be regained as higher-speed modules are developed). Newer DDR standards are really about increasing bandwidth, capacity, and energy efficiency. DDR5 also splits each 64-bit DIMM into two independent 32-bit subchannels, to enable more parallelism, which is something LPDDR4 already did.

Originally posted by NobodyXu View Post

Programs that are really computation intensive are either offload to the GPU or to some specialized hardwares.

Offload to GPUs and other accelerators is still happening only for the minority of workloads. Mainly number-crunching (including AI), video compression, and graphics. Most server & cloud software still relies just on CPU cores.

During my normal workday, my web browser & GUI are probably the only things touching my GPU. Not my compiler, though -- and that is doing lots of heavy computation for me.

Originally posted by NobodyXu View Post

Neither merge sort nor quick sort can be easily parallelized.

Well, merge sort one of the first algorithms I ever looked at writing in OpenCL. To effectively parallelize it, you do need data redistribution between each iteration.

Bitonic sorter - Wikipedia

https://en.wikipedia.org/wiki/Bitonic_sorter

Originally posted by NobodyXu View Post

Yep, though for loops, doesn't GCC and LLVM assumes that the looping condition will likely to be true?

I've not had occasion to analyze what sort of effect the hints actually have. There was one case where I used it rather hastily and under deadline pressure. It had a very small, but seemingly real impact on that loop. The compiler we used at the time was GCC 9.3, I believe.

**NobodyXu** · 08 August 2022, 04:39 AM

Originally posted by coder View Post

It's certainly a key technique in writing SIMD code. AVX-512 added some additional facilities (I think making it a first class feature, rather than forcing programmers to emulate predication through bit-masking).

Yep, branches is hard to vectorize and I think that's why compiler needs to unroll the loops before they can vectorize them.

Originally posted by coder View Post

I used to think about writing a ray-tracing engine for GPUs that sorted rays into batches, based on which paths they took through a BSP tree or BVH. In that way, my hope was to trade a bit more memory traffic by sorting these batches, for the benefit of substantially improving coherency of leaf node & polygon accesses. Not sure if it'd have been worth the trouble, other than perhaps for global illumination.

It seems to me that ray tracing seems to involve lots of compute, but really it's more of a data coherency problem. At least for something like a GPU, which has oodles of raw compute power.

Amazing, I have no experience in GPU programming and ray tracing though.

Originally posted by coder View Post

Memory prefetchers don't get enough attention, IMO. They're probably second only to OoO execution in terms of their importance in hiding memory access latency. So, while you might avoid tripping up OoO by avoiding branch prediction, if the memory prefetcher can't guess which data to load, you will still end up stalling the pipeline for >= 100 ns on a full cache miss (~500 cycles on a 5 GHz CPU).

Yep, in modern computing once you have a cache-miss you have already lost the game of performance.

Originally posted by coder View Post

BTW, the access latency of external DRAM DIMMs never really changes, I think. If anything, each new DDR standard tends to make it a little worse (though, I think that ground tends to be regained as higher-speed modules are developed). Newer DDR standards are really about increasing bandwidth, capacity, and energy efficiency. DDR5 also splits each 64-bit DIMM into two independent 32-bit subchannels, to enable more parallelism, which is something LPDDR4 already did.

It's a shame that we never get to reduce the latency of RAM even if it is the main bottleneck.

Originally posted by coder View Post

Offload to GPUs and other accelerators is still happening only for the minority of workloads. Mainly number-crunching (including AI), video compression, and graphics. Most server & cloud software still relies just on CPU cores.

During my normal workday, my web browser & GUI are probably the only things touching my GPU. Not my compiler, though -- and that is doing lots of heavy computation for me.

Yep, GPU are only good for highly-parallelizable computing with little to no branch and other accelerators are only good for what they are designed for.

Originally posted by coder View Post

Well, merge sort one of the first algorithms I ever looked at writing in OpenCL. To effectively parallelize it, you do need data redistribution between each iteration.

Bitonic sorter - Wikipedia

https://en.wikipedia.org/wiki/Bitonic_sorter

Bitonic sorter does not scale well over number of inputs though, so it is only works when number of inputs is known ahead of time.

Originally posted by coder View Post

I've not had occasion to analyze what sort of effect the hints actually have. There was one case where I used it rather hastily and under deadline pressure. It had a very small, but seemingly real impact on that loop. The compiler we used at the time was GCC 9.3, I believe.

I read somewhere that for loop conditions, they are default to "likely" so adding hints for them doesn't make much sense.

**coder** · 08 August 2022, 01:39 PM

Originally posted by NobodyXu View Post

Yep, in modern computing once you have a cache-miss you have already lost the game of performance.

SMT helps, which is the main solution GPUs employ. Or, if you just have so many small cores that you just accept some are going to be stalled on memory at any given time.

Originally posted by NobodyXu View Post

It's a shame that we never get to reduce the latency of RAM even if it is the main bottleneck.

You can reduce it by putting it in package. I'm not sure how much.

Part of the problem is the physics of how long it takes signals to travel those distances, and the other part is a matter of how DRAM works. These simple facts have given rise to things like 3 & 4-level cache hierarchies and memory prefetchers. Of course, the cache hierarchy almost certainly adds some latency to DRAM fetches, but probably not that much.

Originally posted by NobodyXu View Post

Yep, GPU are only good for highly-parallelizable computing with little to no branch and other accelerators are only good for what they are designed for.

GPUs are good at branches, as long as you can guarantee all of the SIMD lanes are taking the same branch. Then, you don't have to bother with things like predication.

Originally posted by NobodyXu View Post

Bitonic sorter does not scale well over number of inputs though, so it is only works when number of inputs is known ahead of time.

I don't see why it couldn't be written to handle a variable input size.

**NobodyXu** · 08 August 2022, 11:27 PM

Originally posted by coder View Post

SMT helps, which is the main solution GPUs employ. Or, if you just have so many small cores that you just accept some are going to be stalled on memory at any given time.

SMT does help, though it requires using multithreading and its implementation is complicated by the spectre/meltdown problem.

Does GPU also have SMT? I thought it has >1000 small simple cores without branch prediction and shared the cache.

Originally posted by coder View Post

You can reduce it by putting it in package. I'm not sure how much.

Part of the problem is the physics of how long it takes signals to travel those distances, and the other part is a matter of how DRAM works. These simple facts have given rise to things like 3 & 4-level cache hierarchies and memory prefetchers. Of course, the cache hierarchy almost certainly adds some latency to DRAM fetches, but probably not that much.

The root problem probably is just DRAM is too slow compared to CPU.
If it gets faster, then it will be put into the package.

Originally posted by coder View Post

GPUs are good at branches, as long as you can guarantee all of the SIMD lanes are taking the same branch. Then, you don't have to bother with things like predication.

Hmm, yeap, I missed that.

Originally posted by coder View Post

I don't see why it couldn't be written to handle a variable input size.

You would have to manually construct the Bitonic sorter for every input size you want to support then.

**coder** · 09 August 2022, 02:23 AM

Originally posted by NobodyXu View Post

SMT does help, though it requires using multithreading and its implementation is complicated by the spectre/meltdown problem.

In spite of these side-channel attacks, Intel and AMD keep it in their x86 CPUs. I think that shows how important it is for maintaining competitive performance.

Originally posted by NobodyXu View Post

Does GPU also have SMT?

Yes. All of them, as far as I'm aware. I know less about mobile GPUs, but certainly AMD, Nvidia, and Intel all use it to a substantial degree.

Originally posted by NobodyXu View Post

I thought it has >1000 small simple cores without branch prediction and shared the cache.

Sort of. First, what you're calling a "core" is better thought of as a SIMD lane. Thus far, GPUs have only reached a couple hundred of what would be equivalent to a CPU core. Each GPU vendor has their own terminology for them, but Nvidia's SM, AMD's CU, and Intel's EU are all roughly similar to a CPU core. I think AMD has a new name for them, in RDNA, and Intel renamed them in Xe, but I forget the new terminology.

Second, they're in-order. This minimizes area spent on anything other than computation and memory, but it makes them extremely sensitive to latency.

That leads us to their solution for latency hiding, which is to use many SMT threads (Nvidia calls these Warps, AMD calls them Wavefronts, and I think Intel just calls them threads) to hide memory latency. As you surely know, graphics is an incredibly bandwidth-intensive affair, which means lots of memory accesses and that makes latency among their biggest problems.

BTW, GPUs depend quite heavily on directly-addressable, on-die SRAM. In some implementations, a configurable amount of this can be treated as cache. Usually, both cache and addressable SRAM are situated in a 2-level hierarchy (i.e. there's L1 and L2 cache; likewise, local SRAM and shared SRAM).

Originally posted by NobodyXu View Post

The root problem probably is just DRAM is too slow compared to CPU.
If it gets faster, then it will be put into the package.

Here's something interesting to consider: HBM is DRAM, and is also stacked in-package.

You've no doubt heard about Apple's M1 product line? They use LPDDR in-package. IIRC, they started with LPDDR4 and moved up to LPDDR5, in Pro/Max.

You can find whitepapers and Hot Chips presentations for many recent GPUs. Intel started including much less detail, after their Gen 9 iGPUs. Here are a couple:

**NobodyXu** · 09 August 2022, 03:03 AM

Originally posted by coder View Post

Sort of. First, what you're calling a "core" is better thought of as a SIMD lane. Thus far, GPUs have only reached a couple hundred of what would be equivalent to a CPU core. Each GPU vendor has their own terminology for them, but Nvidia's SM, AMD's CU, and Intel's EU are all roughly similar to a CPU core. I think AMD has a new name for them, in RDNA, and Intel renamed them in Xe, but I forget the new terminology.

Second, they're in-order. This minimizes area spent on anything other than computation and memory, but it makes them extremely sensitive to latency.

It sounds like a GPU core is actually quite large and each core have a lot of arithmetic engines.

This makes me wonder will we have CPU and GPU unified, e.g. issuing GPU instructions directly from CPU.
I know M1/M2 is probably a step towards this direction, but not sure about the details.

Originally posted by coder View Post

Here's something interesting to consider: HBM is DRAM, and is also stacked in-package.

You've no doubt heard about Apple's M1 product line? They use LPDDR in-package. IIRC, they started with LPDDR4 and moved up to LPDDR5, in Pro/Max.[/LIST]

Interesting, I think having it in-package does reduce the latency at the cost of harder to replace/upgrade.
Still it does not improve on latency and has to rely entirely on cache to fix the gap.
Perhaps the only way to improve latency is to add more cache, like the AMD's new 3D vcache.

**coder** · 09 August 2022, 03:38 AM

Originally posted by NobodyXu View Post

It sounds like a GPU core is actually quite large and each core have a lot of arithmetic engines.

Well, 32-way SIMD is 1024-bit. You can compare that to CPU cores with AVX-512.

When comparing GPU cores to CPU cores, note that GPUs still fit a lot more cores per area.

Originally posted by NobodyXu View Post

This makes me wonder will we have CPU and GPU unified, e.g. issuing GPU instructions directly from CPU.

That's not too different from CPUs gaining SIMD instruction sets. Compute-wise, they've never been very competitive with GPUs, however.

Intel is moving a little further in this direction, with AMX, which has 8x 8192-bit tile registers. I think it's a little more loosely coupled to the core than usual.

Originally posted by NobodyXu View Post

I know M1/M2 is probably a step towards this direction, but not sure about the details.

No, it's not. Its GPU is just like other integrated GPUs, in that it's a distinct block and has its own programmable cores that are completely independent of the CPU cores.

Originally posted by NobodyXu View Post

Interesting, I think having it in-package does reduce the latency at the cost of harder to replace/upgrade.

I'm sure it reduces latency, but I'm not sure by how much. A significant part of DRAM's latency is due to the protocol, which is intrinsic to the way it works.

Originally posted by NobodyXu View Post

Perhaps the only way to improve latency is to add more cache, like the AMD's new 3D vcache.

Not the only way. As I said before, hardware prefetching is a key latency-hiding technique, in modern CPUs. Almost as important as Out-of-Order execution.

And just like branch prediction, the prefetchers get smarter and smarter in each successive generation of CPUs.

Announcement

GCC 12 Profile Guided Optimization Benchmarks With The AMD Threadripper 3990X

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment