@oiaohm: Yeah, again you are master of hardware and Intel/AMD and all the others clearly don't know what they're doing with the branch predictor for 3 decades hindering their performance (while even wikipedia claims otherwise, with a reference). I don't CARE about what YOU consider single-threaded workload or not, or what RISC-V considers it. Not everything can be parallelized (at least not trivially).
Here's a simple situation for you: you have, literally, one thread in your application. You programmed it and it has one thread literally.
Find ways to speed it up with branch prediction. I don't give a FUCK of what other workload you run on your PC, or how "efficient overall" it is. I want this one app that's single-threaded to complete fast (which means less amount of time!). Get it?
Every bullshit you said specifically requires programming for more threads, but there's only 1 in this situation. Deal with the facts.
Real example: you use a super compression algorithm which has zero parallelization potential for decompressing (any parallelization decreases compression ratio and we want the maximum possible so it must be single-threaded!). For example, LZMA with only 1 block for maximum compression ratio. I want this decoded as fast as possible. Tell me again how you do it without a branch predictor.
The decoder has only 1 thread since literally every bit/byte depends on the previous data (so it can't be done in parallel). I don't give a flying fuck of what other workload there is on my PC at the same time as the decoder. All I care is the amount of time it takes from the moment I start the decoder and the moment it finishes.
Here's a simple situation for you: you have, literally, one thread in your application. You programmed it and it has one thread literally.
Find ways to speed it up with branch prediction. I don't give a FUCK of what other workload you run on your PC, or how "efficient overall" it is. I want this one app that's single-threaded to complete fast (which means less amount of time!). Get it?
Every bullshit you said specifically requires programming for more threads, but there's only 1 in this situation. Deal with the facts.
Real example: you use a super compression algorithm which has zero parallelization potential for decompressing (any parallelization decreases compression ratio and we want the maximum possible so it must be single-threaded!). For example, LZMA with only 1 block for maximum compression ratio. I want this decoded as fast as possible. Tell me again how you do it without a branch predictor.
The decoder has only 1 thread since literally every bit/byte depends on the previous data (so it can't be done in parallel). I don't give a flying fuck of what other workload there is on my PC at the same time as the decoder. All I care is the amount of time it takes from the moment I start the decoder and the moment it finishes.
Comment