Announcement

Collapse
No announcement yet.

AMD Ryzen 7 5800X3D On Linux: Not For Gaming, But Very Exciting For Other Workloads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by nicalandia View Post

    LZ4 Does not appear to take advantage of the 3D V-Cache

    LZ4.jpg
    LZ4 is very cache friendly. It reads through its input buffer and coppies that to the output buffer and occasionally reads back a little bit in the output buffer and coppies that to the current end of the output buffer. This makes it very tolerant of cache eviction, etc. It's very easy for simple fetch predictors to keep up with the way LZ4 accesses memory. Other compression programs use different methods with much larger cache footprints, but LZ4 is not one of them. It's suitable for microcontrollers, etc. If you have enough memory for the input buffer and the output buffer, you can do LZ4 without extra storage. You can't do that with BZ2 or other transform based systems.

    Originally posted by Raka555,n1320985
    Being a bit pedantic here but the app don't "take advantage" of larger cache.
    It is more like bloated apps that require larger caches.

    If LZ4 was well written, then you won't see much of a boost.
    You're speaking in the context of compression programs here and you are dead wrong. Saying that a compression program is using too much memory is like saying that an algorythm is lazy because it didn't find a way to solve the Traveling Salesman problem in polynomial time. I don't want to quote the Cat in the Hat, but he's right.

    To address your overly pedantic opinion, a program is well written if it takes advantage of the hardware on which it runs. LZ4 doesn't happen to be able to do that as it's already fully satisfied by even a basic processor, but that's because it's designed for a little processor. Honestly, it's a bit silly to be using it as a benchmark in this way. It's about as good of a benchmark as 'grep /proc/cpuinfo "cpu MHz"'

    Comment


    • #32
      I would be interested to see how it compares to the 5900X on the professional workloads, given they are the same price.

      Comment


      • #33
        That's the benchmark I was looking for. Especially Openfoam and other HPC workloads are benefitting alot.

        Comment


        • #34
          Originally posted by skeevy420 View Post

          Zstd as well. Like Michael points out in the article, that probably had good benefits for file systems using Zstd for compression. I wonder if LZ4, XZ, and other codecs get performance improvements as well.
          Likely depends on the dictionary size and if it fits in cache. More cache makes a dictionary more efficient. Zstd can have a very large dictionary.

          Lz4 has a sliding window for compression (see LZ77), so a tiny dictionary of sorts and it'll likely be slower on the 3D variant.

          Xz can have a big dictionary. It would likely benefit from the increased cache allowing speed operation with bigger dictionaries.

          Comment


          • #35
            Originally posted by domih View Post
            Thanks for this article as well as the one on Milan X!

            I'm not a gamer so I'm mostly interested in the possible performance increase brought by 3D V-cache in development related tools and servers. What about (beyond ML/DL):
            - JSON parsing (in various languages),
            - XML parsing and other XML operations (in various languages),
            - MySQL, PostgreSQL,
            - Cassandra,
            - Large Python or PHP list and dict handling,
            - JIT compilation in Java, Python, PHP,
            - Crypto (AES, RSA),
            - JavaScript,
            - Web servers (Apache, Nginx).
            For JSON and XML, it may help. Each typically build up a DOM in memory, so a many megabyte file may get a boost. Would be kind of niche sizes though.

            For MySQL, it may help with filesorts, though you'll need to increase sort_buffer_size to take advantage of it (and if you're doing filesorts where that matters your schema/indexes/queries really need a rethink).

            For Cassandra, I can't see it helping much. Usually the heap in a Java app is much bigger than 96 MB. When I run Cassandra in production, I generally set newgen to several gigabytes. Compactions spew new objects that end up polluting oldgen if newgen is too small.

            PHP is unlikely to benefit. In all my own benchmarking, PHP cares mostly about memory latency. It suffers from the higher cache latency in Epyc (even Zen 3) compared to Xeon. Ditto memcache. It wouldn't surprise me if Python behaves differently, but I don't have any highly used Python running.

            Java apps can be a mixed bag. Some care about cache latency, like API servers. Others don't, like Kafka. I could see the 3D variant being useful in some situations.

            For PHP, Python, Java, and JavaScript, a lot of what the programs do often results in pointer-chasing. If your working set fits in the 96 MB cache and not in the 32 MB cache, that can make a big difference. The JIT process doesn't get run often.

            Crypto is all frequency oriented as keys fit in the L1 cache. 3D won't help here.

            Nobody is using Apache where performance matters. Context switches galore. The extra cache might help keeping all the processes' memory in L3, maybe.

            It could be a boost for Nginx. The extra cache could keep more static files in L3 (great for mapped IO) and could also be used for TCP buffers if Nginx is producing traffic... but on a highly loaded server with tens of thousands of connections, TCP buffers consume hundreds of MB of memory even if default sizes are small. Though Nginx would have no trouble saturating a 10 Gb connection with a 5800X3D well before the cache would help. You'd probably have to go 100 Gb before it would matter: but then the number of connections would need too much buffer memory anyway.

            I'd be interested to see how it impacts over-subscribed environments. I bet you could run more VMs/containers/processes well on a 5800X3D compared to the 5800X. Maybe on a busy dev box with a replicated stack it would give better performance when paging from a bigger L3. But if you're running that many processes, a CPU with more cores is probably a better idea. On a dev box, I'd probably spend the money on a better NVMe drive first.

            Comment


            • #36
              Originally posted by Michael View Post

              Click the openbenchmarking.org link on the last page of the article, some of those are covered. Others are coming.

              Also navigating from this page will also yield in-progress metrics for other tests - https://openbenchmarking.org/s/AMD+R...5800X3D+8-Core
              Awesome! I've been eagerly anticipating these benchmarks, and look forward to the 5950X comparison as well! Theory is good, but benchmarks are better!

              Comment


              • #37
                Michael

                Why haven't You enabled ReBAR on this test system?
                The visible VRAM size is capped at just 256MB.
                Maybe that would have helped some GPU-limited games...

                Other than that though, I really appreciate that You enforced the performance governor for these benchmarks.

                Keep it up!

                Comment


                • #38
                  Originally posted by LinuxID10T View Post

                  I think it is just the type of games and applications being used. For games, most Windows games are larger and more complex and need a larger amount of space in the cache. You only see increases in performance where core logic to games is too big to fit in a traditional L3 cache and has to be fetched from memory. As for applications, most publications didn't really test any specialized deep learning or HPC applications due to the audience. Linux is just much bigger than Windows in those spaces.
                  Yeah, this list of "games" is very much not representative of what you see on windows gaming benchmarks.

                  DDraceNetwork? Xonotic? Tesseract? Those aren't going to show any benefit no matter what OS you're on.

                  It would be interesting to see someone do some tests on actual windows AAA games through Proton, to see how they perform. I'm not saying they will definitely show any benefit, but it's certainly possible they'll show a large one and the Deus Ex results make me think it's likely they would. And yes, I know why Michael doesn't do that. Doesn't change the fact that it'd be far more interesting results.
                  Last edited by smitty3268; 25 April 2022, 08:45 PM.

                  Comment


                  • #39
                    Originally posted by willmore View Post
                    You're speaking in the context of compression programs here and you are dead wrong. Saying that a compression program is using too much memory is like saying that an algorythm is lazy because it didn't find a way to solve the Traveling Salesman problem in polynomial time. I don't want to quote the Cat in the Hat, but he's right.

                    To address your overly pedantic opinion, a program is well written if it takes advantage of the hardware on which it runs. LZ4 doesn't happen to be able to do that as it's already fully satisfied by even a basic processor, but that's because it's designed for a little processor. Honestly, it's a bit silly to be using it as a benchmark in this way. It's about as good of a benchmark as 'grep /proc/cpuinfo "cpu MHz"'
                    On top of that, I wouldn't call HPC software, which is clearly benefiting from larger L3 caches, bloatware. On the other contrary, the main loop of some of these programs is as simple as it gets: looping over an array and doing FMAs.

                    Comment


                    • #40
                      Originally posted by willmore View Post
                      LZ4 is very cache friendly. It reads through its input buffer and coppies that to the output buffer and occasionally reads back a little bit in the output buffer and coppies that to the current end of the output buffer. This makes it very tolerant of cache eviction, etc. It's very easy for simple fetch predictors to keep up with the way LZ4 accesses memory. Other compression programs use different methods with much larger cache footprints, but LZ4 is not one of them. It's suitable for microcontrollers, etc. If you have enough memory for the input buffer and the output buffer, you can do LZ4 without extra storage. You can't do that with BZ2 or other transform based systems. You're speaking in the context of compression programs here and you are dead wrong. Saying that a compression program is using too much memory is like saying that an algorythm is lazy because it didn't find a way to solve the Traveling Salesman problem in polynomial time. I don't want to quote the Cat in the Hat, but he's right. To address your overly pedantic opinion, a program is well written if it takes advantage of the hardware on which it runs. LZ4 doesn't happen to be able to do that as it's already fully satisfied by even a basic processor, but that's because it's designed for a little processor. Honestly, it's a bit silly to be using it as a benchmark in this way. It's about as good of a benchmark as 'grep /proc/cpuinfo "cpu MHz"'
                      Could also be that this is simply maxing out the LZ4 performance on the CPU, aka the latency of the compression/decompression with the speed of this cpu is just at the threshold where more cache doesn't help, aka the prefetch is faster or as fast as the algorithm.

                      Comment

                      Working...
                      X