Announcement

Collapse
No announcement yet.

AMD Ryzen 7 5800X Linux Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by marios View Post

    Unless we know the operating system used, the compiler, the compilation flags and the code of the benchmark itself, the score is as valuable as the output of /dev/random.
    They might measure useless workloads, the code might be written by apple, the competition CPU might be running windows vista, the different architectures might be using different compilers with different optimizations and so on.
    True comparisons can only be made with open source benchmarks, compiled with march=native and equivalent compilers and optimizations. It seems it will take a while until we have it though (we have neither march=M1 nor march=zenver3)
    This. However, in this case I bet /dev/random is more predictable. Those benchmark results are useless shit.

    Comment


    • #32
      Originally posted by duby229 View Post
      It's the -ENTIRE- reason why x86 traditionally has only 3 or 4 integer units per pipeline, it's -the- most parallelism that can be extracted when decoding x86 instructions.
      I have to gently disagree here - it's not the most parallelism that can be extracted, but it is definitely the point of diminishing returns for typical code and typical compilers, whether the intermediate ISA is x86 or ISA.

      It's only the combination of tiny fab processes and a new arms race between CPU vendors that is prompting recent increases in both width (# ALUs, AGUs, load/store paths etc..) and depth (reorder buffer, physical reg file, load/store queue depth, prefetcher complexity etc...). There is additional parallelism to be exploited but it takes a big increase in width and depth to get a fairly small increase in performance, and that just hasn't been worth doing until recently.

      The other relatively recent change is heavy use of micro-op caches, which has largely removed what used to be a bottleneck at the instruction decoder stage. Fixed length ISAs used to have an advantage here but even Zen2 has an 8-wide path from the micro-op cache into the execution pipeline.
      Last edited by bridgman; 12 November 2020, 06:03 PM.
      Test signature

      Comment


      • #33
        Originally posted by discordian View Post

        Sorry, you are talking about architecture instructions, which is a very shaky stuff even on risc.

        Every comparison between CPUs is with Benchmarks that have their own definition of instructions, like arithmetic operations on floats.

        This is the topic, this is how CPUs are measured by Benchmarks, the a16 is leading there by a good margin (varies btw Benchmarks of course). The architecture details just give insight how it is archived.

        On x86 the implicit restrictions hinder parallelism (variable length instructions, strong memory order), which hinders scaling amount of architecture instructions running concurrently. But that's the cause, the effect is that Benchmarks have better scores than x86 at the same frequency, or in other words more "logical/source code" operations per clock.
        Ok, but I think what you're talking about is shaky stuff. SPEC results for example, rarely if ever, conform to real world usage. But I think I'm splitting hairs about purely synthetic benchmarks now.

        Comment


        • #34
          Originally posted by bridgman View Post

          I have to gently disagree here - it's not the most parallelism that can be extracted, but it is definitely the point of diminishing returns for typical code and typical compilers, whether the intermediate ISA is x86 or ISA.

          It's only the combination of tiny fab processes and a new arms race between CPU vendors that is prompting recent increases in both width (# ALUs, AGUs, load/store paths etc..) and depth (reorder buffer, physical reg file, load/store queue depth, prefetcher complexity etc...). There is additional parallelism to be exploited but it takes a big increase in width and depth to get a fairly small increase in performance, and that just hasn't been worth doing until recently.
          I agree with you. I just think we've already passed beyond the point of diminishing returns for x86. Anything further will be measured in tens of percents and not hundreds or thousands of percents. If we need to go wider, then more pipelines is the only feasible option. (And if you make each of those pipelines too big then -that- would produce diminishing returns on the number of pipelines possible)
          Last edited by duby229; 12 November 2020, 06:09 PM.

          Comment


          • #35
            Originally posted by duby229 View Post
            I just think we've already passed beyond the point of diminishing returns for x86.
            Agree, but it is not specific to x86 since (a) micro-op caches already isolate the execution pipeline from decoder throughput in most cases and (b) building a wider decoder for a variable length ISA is not impossible, just another case of diminishing returns.
            Test signature

            Comment


            • #36
              Originally posted by bridgman View Post

              Agree, but it is not specific to x86 since (a) micro-op caches already isolate the execution pipeline from decoder throughput in most cases and (b) building a wider decoder for a variable length ISA is not impossible, just another case of diminishing returns.
              Yeah, I don't fully understand that. It is capable of queuing more instructions than it has the capability of issuing, which seems to me, would add latency between queuing and retiring. I don't know, maybe a longer pipeline hides it somehow.

              Comment


              • #37
                Originally posted by duby229 View Post
                EDIT: x86 instructions get decoded into RISC-like micro instructions. Those micro instructions are called uops and they have a minimum complexity that is derived from the microarchitecture itself.
                To be clear, arm's Neoverse N1, IBM's Power9, and likely Apple's recent cores, break instructions into internal micro-operations during decode.

                Comment


                • #38
                  Originally posted by duby229 View Post
                  Yeah, I don't fully understand that. It is capable of queuing more instructions than it has the capability of issuing, which seems to me, would add latency between queuing and retiring. I don't know, maybe a longer pipeline hides it somehow.
                  Right, there can be a big latency between queueing and executing (waiting for dependencies) AND between executing and retiring (waiting for instructions from earlier in the program flow to get their dependencies and execute).

                  If you think about a big queue where instructions are entered sequentially and retired sequentially, execute whenever their dependencies are satisfied, and stay in the queue after execution until all the instructions before them have retired that's pretty close. What makes it work is that instructions don't have to "work their way through the queue" unless there are earlier instructions waiting for cache or memory - it's more like a ring buffer than a physical FIFO.

                  If you really want a headache think about the fact that a lot of those already-finished instructions may have been speculatively executed and so the results need to be tossed if a branch goes in a different direction than what the predictor expected.
                  Last edited by bridgman; 12 November 2020, 08:24 PM.
                  Test signature

                  Comment


                  • #39
                    Originally posted by Space Heater View Post

                    To be clear, arm's Neoverse N1, IBM's Power9, and likely Apple's recent cores, break instructions into internal micro-operations during decode.
                    Also to be clear, of those only Power9 comes close to x86 in instruction set complexity.

                    Comment


                    • #40
                      Originally posted by birdie View Post
                      In terms of performance per dollar and thermals the Ryzen 5800X is the worst CPU of this lineup.

                      You can pay $100 (22%) and get 50% more cores with the same thermal package, i.e. the 5900X and it runs significantly cooler too.

                      Or you can pay $150 (33%) less and lose just 25% of cores, i.e. the 5600X.
                      5600X and 5900X are hexa-core CCX, so the 5900X would have some overhead like latency when workload jumps between the two CCX, may not be as much of an issue depending on what you run, and if you keep it constrained to a specific CCX cores? But if you wanted to run something with 8/16 cores threads instead of 6/12, you'd have slight perf dip afaik.

                      For many it probably is negligible that it's not worth it, but I'd rather an full 8-core CCX personally. Those that don't care enjoy the savings of the models using 6-core CCXs.

                      ---

                      Benchmarks generally are going to focus on single core/thread perf, or multi-thread(all the cores/threads), so this type of workload isn't going to look as advantageous in such a scenario obviously. Or do you know of a benchmark testing 8/16 core/thread workloads on a 5900X?
                      Last edited by polarathene; 12 November 2020, 09:05 PM.

                      Comment

                      Working...
                      X