Announcement

Collapse
No announcement yet.

An Introduction To Intel's Tremont Microarchitecture

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #71
    Originally posted by carewolf View Post

    No, it says no such thing, though your conclusions are closer. Amdahl's law is just a way to calculate speed up if you only speed up a certain part of the execution (for instance speeding up half the running time 2x gives a total speed up of 1/0.75 = 1.33x ). By adding more cores you only speed up CPU-depending code that can be parallelized, by increasing single-thread performance you speed up anything CPU-dependent, so a larger section of the code, but by a smaller margin. Still a lot of execution time is waiting on memory or disk, and Amdahl's law would still give you the deminished returns even for improvements to single-thread performance.

    But Amdahls law says nothing about not being able to parallelize indefinitely, just that the returns are limited by the size of the exuction time that you parallelize. For instance speeding up half the execution time infinitiely with infinite cores gives a total speed up of 1/0.5 = 2x, since the rest of the execution time is unchanged.
    you're definitely wrong, although your math is right your conclusion doesn't make sense. for example most x86 processors are OoO superscalar pipeline architectures. Ahmdahls law is the rreason why none of them have no more than 4 execution units per pipeline, they are interally parallelized.

    the math itself -is- the point that you can't parallelize indefinitely. although you can extract parallelism as various different scales. Either at instruction level or thread level or process level. Ahmdals law applies independently at each scale.

    Comment


    • #72
      Originally posted by DavidC1 View Post

      What? No they don't. The RTX 2080 Ti has 4352 CUDA cores. The Vega VII has 3840 CUs. GFlops-wise they are the same. Each CUs or CUDA cores are capable of 2 FLOPs per cycle.
      Except that 1 AMD compute unit is not equivalent with 1 nVidia cuda core. At the execution unit level AMD's architecture has alot more than nVidia's.

      AMD's architecture favors comute capacity over compute efficiency, while nVidia's favors compute efficiency over compute capacity. One cuda core is essentially a complete processor while one CU is not...
      Last edited by duby229; 10-29-2019, 06:50 PM.

      Comment


      • #73
        Originally posted by duby229 View Post

        you're definitely wrong, although your math is right your conclusion doesn't make sense. for example most x86 processors are OoO superscalar pipeline architectures. Ahmdahls law is the rreason why none of them have no more than 4 execution units per pipeline, they are interally parallelized.
        No it isn't. The reason is that there is are very rarely situations where you can look further into future reliably to execute more in parallel, and power efficiency is more important now. Besides, you can evaluate 8 32bit integer operations in parallel with a single AVX2 instruction, and you can evaluate two AVX2 instructions in parallel in most Intel CPUs, so 16 operations in parallel is quite common, and when you offload code to the GPU you move to 100s if not 1000s of operations in parallel, which is quite worthwhile if you have ever seen benchmarks of hardware accelerate "whatever".

        Comment


        • #74
          Originally posted by carewolf View Post
          No it isn't. The reason is that there is are very rarely situations where you can look further into future reliably to execute more in parallel, and power efficiency is more important now. Besides, you can evaluate 8 32bit integer operations in parallel with a single AVX2 instruction, and you can evaluate two AVX2 instructions in parallel in most Intel CPUs, so 16 operations in parallel is quite common, and when you offload code to the GPU you move to 100s if not 1000s of operations in parallel, which is quite worthwhile if you have ever seen benchmarks of hardware accelerate "whatever".
          you're not making any sense. Avx2 is an instruction set extension where the instructions are specifically designed for use in specific scenarios where data oriented processing is important, something where x86 in general isn't well suited for... you can't just accelerate "whatever"... you can only accelerate specifically designed workloads that -can- be data oriented. most workloads can't be exactly -because- of ahmdahls law.

          EDIT: I have to ask, you do know what vectors and scalars are right? you do know the diffences in design choices between scalar and vector pipelines right? Integer units and floating point units and memory management units right? if you don't then this conversstion is a dead end.

          because your respponses to mine make it seem like you don't understand how isa instructions are processed in different types of execution units depending on the type of data being processed. bottom line is that bit depth for scalar pipelines or bit precision for vector pipelines is not the same thing as execution paralellism. nor is bit banging or bit trunking the same thing as execution paralellism.
          Last edited by duby229; 10-31-2019, 12:09 AM.

          Comment


          • #75
            Originally posted by duby229 View Post

            you're not making any sense. Avx2 is an instruction set extension where the instructions are specifically designed for use in specific scenarios where data oriented processing is important, something where x86 in general isn't well suited for... you can't just accelerate "whatever"... you can only accelerate specifically designed workloads that -can- be data oriented. most workloads can't be exactly -because- of ahmdahls law.

            EDIT: I have to ask, you do know what vectors and scalars are right? you do know the diffences in design choices between scalar and vector pipelines right? Integer units and floating point units and memory management units right? if you don't then this conversstion is a dead end.

            because your respponses to mine make it seem like you don't understand how isa instructions are processed in different types of execution units depending on the type of data being processed. bottom line is that bit depth for scalar pipelines or bit precision for vector pipelines is not the same thing as execution paralellism. nor is bit banging or bit trunking the same thing as execution paralellism.
            Maybe you should try reading my response again and learn something?

            A few notes: AVX2 is x86. What it operates on are normal things that happen all the time, if they didn't it wouldn't be part of a general purpose CPU. The very existince of those situations completely refuses your claim in the comment I replied to, which is why I brought it up. Failing to recognize this and instead attacking me for being right, is what I like to call: "Being aggressively wrong". Unfortunately getting quite common here.
            Last edited by carewolf; 10-31-2019, 06:32 PM.

            Comment


            • #76
              Originally posted by carewolf View Post
              Maybe you should try reading my response again and learn something?

              A few notes: AVX2 is x86. What it operates on are normal things that happen all the time, if they didn't it wouldn't be part of a general purpose CPU. The very existince of those situations completely refuses your claim in the comment I replied to, which is why I brought it up. Failing to recognize this and instead attacking me for being right, is what I like to call: "Being aggressively wrong". Unfortunately getting quite common here.
              No, I'm not wrong, you just don't know what you're talking about. Avx2 is not x86, it is an instruction set extension on x86, it does not displace or replace x86, it only extends it. and those instructions in that extension are only useful in very specific scenarios that have to be specifically designed for...

              the point that I'm trying to make is that no x86 architecture is wider than 4 issue. And they can be any instruction including avx2 but still no more than 4.

              EDIT: theoretically bulldozer derived architectures could have been much wider than 4 if you only consider how the front end of its CMT architecture issued instructions to 2 pipelines. but in actual products it was only ever 4 wide because each of its two pipelines only ever had two integer units per pipeline. anyway CMT-like architectures have the potential even still to be the widest CPU architectures conceivable.
              Last edited by duby229; 11-01-2019, 02:21 AM.

              Comment


              • #77
                Originally posted by duby229 View Post

                No, I'm not wrong, you just don't know what you're talking about. Avx2 is not x86, it is an instruction set extension on x86, it does not displace or replace x86, it only extends it. and those instructions in that extension are only useful in very specific scenarios that have to be specifically designed for...
                Then take SSE2 which is included in x86-64. The point remains the same. And I would argue the situations are not very specific. Processing a lot of data is an extremely common use case for computers.

                In any case the original point I was making is just that Amdahl's law is a formula for calculating total speed-up when optimizing only a part of a program. What conclusions you draw from it depends on the exact composition of the program, and what you are optimizing. If a large part of the execution time is in something data-centric that can be optimize by SIMD, then optimizing 10x or 100x is often very worthwhile. And those cases are also the ones that remain interesting for adding even more cores. If instead you have a program where only a small fraction of the run-time is trivially paralizable, then of course such speed-ups would be irrelevant.

                I am not arguing for paralizing code that that isn't trivially paralizable. As you mention any small amount of parallel execution that is possible in such code is likely already taken by an OOO archictecture.

                Comment


                • #78
                  Originally posted by carewolf View Post
                  Then take SSE2 which is included in x86-64. The point remains the same. And I would argue the situations are not very specific. Processing a lot of data is an extremely common use case for computers.

                  In any case the original point I was making is just that Amdahl's law is a formula for calculating total speed-up when optimizing only a part of a program. What conclusions you draw from it depends on the exact composition of the program, and what you are optimizing. If a large part of the execution time is in something data-centric that can be optimize by SIMD, then optimizing 10x or 100x is often very worthwhile. And those cases are also the ones that remain interesting for adding even more cores. If instead you have a program where only a small fraction of the run-time is trivially paralizable, then of course such speed-ups would be irrelevant.

                  I am not arguing for paralizing code that that isn't trivially paralizable. As you mention any small amount of parallel execution that is possible in such code is likely already taken by an OOO archictecture.
                  So here's a quick rundown. X86 was designed originally with variable length instructions. But when x86 transitioned to scalar pipelined architectures they implemented a risc-like back end. The front end decoded x86 instructions into those back-end instructions called micro-ops. where the simplest x86 instructions could be decoded into single back-end micro-ops but the complex ones couldn't be. That's why MMX was invented. MMXdisplaced the longest and most complex x86 instructions with new ones that could be decoded into a single micro-op. SSE was invented because of SIMD, as you say. But SIMD is not the same as execution parallelism, SSE instructions still get decoded into single micro-ops. AVX was invented because of MIMD, but again its not the same as execution parallelism because they still get decoded into single micro-ops.

                  heres an analogy, say you need to cut one notch in 4 sheets of paper, with MMX you would have to cut one notch one time on each sheet of paper one at a time. With SSE its like stacking the sheets so when you cut one notch one time it cuts all four sheets at the same time. Lets extend this analogy to AVX, lets say you had to cut four notches into four sheets, with AVX its like four scissors cutting the notches in the stack of sheets all at the same time. But its still not the same as execution parallelism because those AVX instructions still get decoded into single micro-ops.

                  Comment

                  Working...
                  X