Announcement

**carewolf** · 31 October 2019, 06:15 PM

Originally posted by duby229 View Post

you're not making any sense. Avx2 is an instruction set extension where the instructions are specifically designed for use in specific scenarios where data oriented processing is important, something where x86 in general isn't well suited for... you can't just accelerate "whatever"... you can only accelerate specifically designed workloads that -can- be data oriented. most workloads can't be exactly -because- of ahmdahls law.

EDIT: I have to ask, you do know what vectors and scalars are right? you do know the diffences in design choices between scalar and vector pipelines right? Integer units and floating point units and memory management units right? if you don't then this conversstion is a dead end.

because your respponses to mine make it seem like you don't understand how isa instructions are processed in different types of execution units depending on the type of data being processed. bottom line is that bit depth for scalar pipelines or bit precision for vector pipelines is not the same thing as execution paralellism. nor is bit banging or bit trunking the same thing as execution paralellism.

Maybe you should try reading my response again and learn something?

A few notes: AVX2 is x86. What it operates on are normal things that happen all the time, if they didn't it wouldn't be part of a general purpose CPU. The very existince of those situations completely refuses your claim in the comment I replied to, which is why I brought it up. Failing to recognize this and instead attacking me for being right, is what I like to call: "Being aggressively wrong". Unfortunately getting quite common here.

**duby229** · 01 November 2019, 02:10 AM

Originally posted by carewolf View Post

Maybe you should try reading my response again and learn something?

A few notes: AVX2 is x86. What it operates on are normal things that happen all the time, if they didn't it wouldn't be part of a general purpose CPU. The very existince of those situations completely refuses your claim in the comment I replied to, which is why I brought it up. Failing to recognize this and instead attacking me for being right, is what I like to call: "Being aggressively wrong". Unfortunately getting quite common here.

No, I'm not wrong, you just don't know what you're talking about. Avx2 is not x86, it is an instruction set extension on x86, it does not displace or replace x86, it only extends it. and those instructions in that extension are only useful in very specific scenarios that have to be specifically designed for...

the point that I'm trying to make is that no x86 architecture is wider than 4 issue. And they can be any instruction including avx2 but still no more than 4.

EDIT: theoretically bulldozer derived architectures could have been much wider than 4 if you only consider how the front end of its CMT architecture issued instructions to 2 pipelines. but in actual products it was only ever 4 wide because each of its two pipelines only ever had two integer units per pipeline. anyway CMT-like architectures have the potential even still to be the widest CPU architectures conceivable.

**carewolf** · 01 November 2019, 03:55 AM

Originally posted by duby229 View Post

No, I'm not wrong, you just don't know what you're talking about. Avx2 is not x86, it is an instruction set extension on x86, it does not displace or replace x86, it only extends it. and those instructions in that extension are only useful in very specific scenarios that have to be specifically designed for...

Then take SSE2 which is included in x86-64. The point remains the same. And I would argue the situations are not very specific. Processing a lot of data is an extremely common use case for computers.

In any case the original point I was making is just that Amdahl's law is a formula for calculating total speed-up when optimizing only a part of a program. What conclusions you draw from it depends on the exact composition of the program, and what you are optimizing. If a large part of the execution time is in something data-centric that can be optimize by SIMD, then optimizing 10x or 100x is often very worthwhile. And those cases are also the ones that remain interesting for adding even more cores. If instead you have a program where only a small fraction of the run-time is trivially paralizable, then of course such speed-ups would be irrelevant.

I am not arguing for paralizing code that that isn't trivially paralizable. As you mention any small amount of parallel execution that is possible in such code is likely already taken by an OOO archictecture.

**duby229** · 01 November 2019, 06:40 AM

Originally posted by carewolf View Post

Then take SSE2 which is included in x86-64. The point remains the same. And I would argue the situations are not very specific. Processing a lot of data is an extremely common use case for computers.

In any case the original point I was making is just that Amdahl's law is a formula for calculating total speed-up when optimizing only a part of a program. What conclusions you draw from it depends on the exact composition of the program, and what you are optimizing. If a large part of the execution time is in something data-centric that can be optimize by SIMD, then optimizing 10x or 100x is often very worthwhile. And those cases are also the ones that remain interesting for adding even more cores. If instead you have a program where only a small fraction of the run-time is trivially paralizable, then of course such speed-ups would be irrelevant.

I am not arguing for paralizing code that that isn't trivially paralizable. As you mention any small amount of parallel execution that is possible in such code is likely already taken by an OOO archictecture.

So here's a quick rundown. X86 was designed originally with variable length instructions. But when x86 transitioned to scalar pipelined architectures they implemented a risc-like back end. The front end decoded x86 instructions into those back-end instructions called micro-ops. where the simplest x86 instructions could be decoded into single back-end micro-ops but the complex ones couldn't be. That's why MMX was invented. MMXdisplaced the longest and most complex x86 instructions with new ones that could be decoded into a single micro-op. SSE was invented because of SIMD, as you say. But SIMD is not the same as execution parallelism, SSE instructions still get decoded into single micro-ops. AVX was invented because of MIMD, but again its not the same as execution parallelism because they still get decoded into single micro-ops.

heres an analogy, say you need to cut one notch in 4 sheets of paper, with MMX you would have to cut one notch one time on each sheet of paper one at a time. With SSE its like stacking the sheets so when you cut one notch one time it cuts all four sheets at the same time. Lets extend this analogy to AVX, lets say you had to cut four notches into four sheets, with AVX its like four scissors cutting the notches in the stack of sheets all at the same time. But its still not the same as execution parallelism because those AVX instructions still get decoded into single micro-ops.

Announcement

An Introduction To Intel's Tremont Microarchitecture

Comment

Comment

Comment

Comment