Announcement

Collapse
No announcement yet.

CPUs From 2004 Against AMD's New 64-Core Threadripper 3990X + Tests Against FX-9590

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #51
    Originally posted by boitano View Post
    AMD completely outsells Intel in the DIY market for performance reasons. Intel outsells AMD in the OEM maket for business reasons that have nothing to do with CPU performance. Unfortunately the DIY market is tiny compared to the OEM one.
    Quite true, and unfortunately history has shown us that the "reasons" in the OEM market are due to anti-competitive behavior from intel. The Zen product line has been so well received however, that OEM's may be forced to go against intel's wishes. It will be very interesting indeed if Apple goes AMD in the next iteration of their laptops or desktops...

    Comment


    • #52
      Originally posted by Raka555 View Post

      With int64_t:

      r7-3700x
      real 0m8.138s vs 0m8.138s

      i7-3770:
      real 0m13.938s vs 0m4.083s

      i7-4600u:
      real 0m15.719s vs 0m4.883s
      Oh cool, Intel has an optimized division for <64bit operands. Probably very useful for binary-only benchmark programs conceived in the windows-xp era
      But that aside, the test program shows that even for a plain "int" current compilers emit 32bit division; so maybe this is so common that it's worth
      optimizing the hardware for.
      Last edited by mlau; 02-10-2020, 03:17 PM.

      Comment


      • #53
        Originally posted by boitano View Post
        Maybe I'm wrong but It seems to me AMD put a 64 cores workstation CPU to market more because they can than because there's a sizeable market for it. Feels more like a trollish move against Intel.
        I guess intel also launched the $10,000 Xeon Platinum 8280 just to troll AMD, not to actually sell product because there's a market for it?

        It's true there isn't a ton of software yet that can take advantage of that many threads at once. But those who have that kind of workload, already own the required software, and have pockets deep enough to buy these flagship chips. Think MATLAB, Maya, 4K video editing, or finite element analysis (structural stress, etc). Outside of these specialized segments, threading has always been a chicken or egg scenario. The hardware doesn't exist because there's no software to take advantage of it. The software doesn't exist because why put in the effort when there's no hardware to run it on.

        AMD is doing something bold here, and they're walking the walk when it comes to delivering massive performance in a single socket. This is the future. IPC has not increased so dramatically over the years, as the benchmarks clearly show. If it was up to intel, we'd all still be running 4-core chips based on sandy bridge and 14nm++++. Or maybe even 32 bit chips, because Itanium.
        Last edited by torsionbar28; 02-10-2020, 04:56 PM.

        Comment


        • #54
          Originally posted by Raka555 View Post

          With int64_t:

          r7-3700x
          real 0m8.138s vs 0m8.138s

          i7-3770:
          real 0m13.938s vs 0m4.083s

          i7-4600u:
          real 0m15.719s vs 0m4.883s

          This is a nice find. So 64bit integers are hurting Intel big time, while they are doing very well with 32bit integers. No wonder they were pushing x32 so hard.
          I wonder what happened to the x32 efforts...

          Something similar seems to be happening on the RPIs as well. "lilunxm12" reported the following on their pi3b+ on 64bit Ubuntu 20.04:
          real 0m18.248s
          user 0m18.193s
          sys 0m0.005s

          I am getting the following on my rpi3b+ with Raspbian 10 32bit:
          real 1m26.870s
          user 1m26.827s
          sys 0m0.022s

          That is a huge difference. But my RPI might be throttling as I don't have active cooling on it.
          On 20.04 arm64, using int64_t gives
          real 0m33.329s
          user 0m33.280s
          sys 0m0.009s
          You just shouldn't do that on a 32 bit system, as single registers can't hold the operand

          Comment


          • #55
            Originally posted by torsionbar28 View Post
            $10,000 Xeon Platinum 8280
            $10000 for 28 cores VS. $4000 for 64 cores. Brutal.

            Originally posted by torsionbar28 View Post
            threading has always been a chicken or egg scenario.
            One totally non professional and mundane task I would like to see these many cores CPUs used for in the future is AI simulation in videogames, where there are hundreds of NPCs on the scene, each one simulated with one dedicated thread. NPCs are usually so stupid their stupidity ruins all the immersion.

            Comment


            • #56
              Originally posted by Raka555 View Post

              If you mean what I get when timing it:

              r7-3700x:
              real 0m8.138s

              i7-3770:
              real 0m4.083s

              i7-4600u:
              real 0m4.883s
              I know that's not apples to apples comparison but just out of curiosity compiled and tested that on a phone:

              Snapdragon 820:
              $clang --version
              clang version 8.0.1 (tags/RELEASE_801/final)
              Target: aarch64-unknown-linux-android
              $ clang prime.c -o prime -O3
              $ time ./prime
              664580

              real 0m2.402s
              user 0m2.380s
              sys 0m0.000s


              Update:

              Int64 version
              $ clang prime64.c -O3 -o prime64
              $ time ./prime64
              664580

              real 0m2.489s
              user 0m2.470s
              sys 0m0.000s

              So either clang generates significantly more optimal code compared to gcc, or time switch to arm64 But I'd rather say this benchmark is not that representative of real world cpu performance.
              Last edited by klokik; 02-10-2020, 06:33 PM.

              Comment


              • #57
                Originally posted by TemplarGR View Post
                Bulldozer was a great architecture and was a step towards Fusion. AMD's grand plan was to eliminate FPU and SIMD from the cpu cores completely, eventually, and move those calculations on the iGPU. This makes a metric ton of sense, since cpu cores only rarely calculate floating point math. And those calculations are better suited for gpgpu, which is only hindered these days by pcie latency. AMD Fusion was the best idea for cpus in 2 decades. But AMD didn't have the software and marketing grunt to push for such change, and Intel realising they would lose if AMD went that road, doubled up on AVX and their floating point calculations, especially per thread.
                Bulldozer was not a good architecture. Bulldozer was an architecture designed around some very specific bets. AMD bet that it could slim down the core and gain size advantages, even though it was competing at n-1 versus Intel. It bet that it could offset the performance impacts of that slimming by scaling clock, and it bet that it could keep the IPC impact of certain Bulldozer design decisions to a minimum.

                The first bet worked: Bulldozer cores were smaller than K10 cores had been. The second bet succeeded marginally with Piledriver -- AMD was definitely able to get higher clocks out of the uarch compared to K10. But the third bet -- the one about IPC? That failed. Instead of being confined to a small number of scenarios, the "corner cases" of Bulldozer dominated its performance. L1 cache contention between the two CPU cores caused lower than expected CPU scaling in multi-threaded workloads, which exacerbated the problem. Kaveri and Carrizo would later improve this penalty by increasing the size of the L1 caches and making other improvements to the chip, but those changes came much later. The original BD design was downright bad, which is why the CPU was delayed from launch in 2011. AMD had good yield on Bulldozer (a CPU they knew wouldn't be competitive) and poor yield on Llano, the APU it could have actually sold. AMD delayed Bulldozer and forced GF to eat a mountain of losses by only paying them for good Llano die. GF paid them back the following year by forcing AMD to give up its interest in the foundry in exchange for being able to manufacture parts at TSMC.

                The idea you have -- namely that you can just replace your FPU with a GPU -- never would have worked. For starters, there's non-zero latency caused by spinning a workload off to the GPU. That workload has to be setup and initialized by the CPU, and GPUs are high latency devices. Yes, AMD made some noise about this idea at the very beginning of the Fusion PR rollout, but there's a reason they never pursued it. GPU acceleration is about using the GPU for workloads and in areas where it makes sense to do so, not in areas where it does not. In many cases, the latency hit for setting a problem up on the GPU is larger than the increased performance the video card can bring to bear on the problem.

                Originally posted by TemplarGR View Post
                These days on 7nm, cpu cores even with all those SIMD parts, are TINY. It would have made a lot more sense to have even tinier cpu cores by removing the floating point units (which cost a LOT of silicon), adding tons of cache, and a beefy igpu, and move those calculations there. It would have been far better performant. It would allow the cpu cores to stop bothering with things they are not at their best, and leave the igpu do what it is best suited for... But this failed to evolve because idiots thought Bulldozer was a failure just because video games relied still on single and dual cores and as we all know, gaming is the most important thing in computing.... Even today intel sells a ton of cpus because it has slightly better per core performance and this matters to gaming. People are cretins. Now all AMD is doing is copying Intel's design but selling it at a far lower profit margin.... Yay.
                Wouldn't have worked. Again, a GPU is not a low-latency device. It isn't (and cannot) be as tightly coupled as the FPU, because the GPU *does* far more than the FPU and cannot be integrated into the CPU in the same fashion. It has its own cache, its own internal buses, its own data paths. GPUs don't implement x87 or SIMD instruction sets, so you're basically arguing that AMD should have either 1). Found a way to build a translation layer between GPU and FPU / SIMD or 2). Invented a new standard. The first wouldn't be performant -- we're now emulating a different instruction set on a much slower card. The second is unrealistic. Fusion didn't get adopted as it was.

                I was at AMD's 2013 HSA event, where Sun executives pledged that Java 8 would be fully HSA-aware and capable. It was not. Bulldozer was a bad architecture. According to Jim Keller (who told me this in-person), when he was hired, AMD had to make a choice between fixing BD or building Zen. The effort to do one was judged to be as difficult as the other. He decided to go for Zen, and the team backed him. Bulldozer didn't get put down because it sucked in gaming. Bulldozer got put down because it was a poor design, period, and AMD felt it wasn't worth the effort of fixing. The decision to build Zen was made about a month after Keller was hired, and they never looked back. Nor should they have. Best decision AMD ever made.

                Comment


                • #58
                  there is a market for everything ... eg shi*** rgb lightning. So I think Threadripper is more reasonable.

                  Comment


                  • #59
                    Originally posted by klokik View Post

                    So either clang generates significantly more optimal code compared to gcc, or time switch to arm64 But I'd rather say this benchmark is not that representative of real world cpu performance.
                    I did try with clang as well, but the code gcc produced was faster or the same.

                    This was originally not a "benchmark". I actually needed to produce prime numbers, so it stems from a real world application.
                    (And I know this is not the optimized version)
                    Last edited by Raka555; 02-11-2020, 01:38 AM.

                    Comment


                    • #60
                      Originally posted by lilunxm12 View Post


                      You just shouldn't do that on a 32 bit system, as single registers can't hold the operand
                      I hear you, but I find it a bit hard to believe that they ran ARM processors for decades without being able to get the most out of 32bit integer math.
                      I would rather put my money on "modern compilers" not generating optimal code for 32bit ARM.

                      Comment

                      Working...
                      X