Announcement

Collapse
No announcement yet.

Mesa Turns To BLAKE3 For Faster Vulkan Shader Hashing

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by markg85 View Post
    As for benchmarks. I don't have fancy graphs or numbers as this is from memory and from about a year ago. On the raspberry pi (I tried V4) blake3 was stupidly fast for calculating a checksum for a file. Like ~3x faster. You can try that yourself with the "b3sum" tool.

    On desktop hardware the difference wasn't that extreme but still like 1.5x.

    Which, in my book, means it's just a vastly superior hashing algorithm compared to sha (I used sha256sum). If you want benchmarks, run them yourself. You likely already have sha256sum on your PC, just install b3sum and go compare till your hearts content 😉
    sha256sum may not be a representative SHA-256 benchmark; the coreutils sha256sum SHA-256 implementation is substantially slower than the OpenSSL SHA-256 implementation:

    Code:
    > time dd if=/dev/zero bs=1M count=500 status=none | sha256sum
    a08a92258f621b55d08ad1e84c90c2ea6286fc6b6c9a4dfa7156afb16c190170  -
    
    real    0m1.903s
    user    0m1.844s
    sys    0m0.162s
    > time dd if=/dev/zero bs=1M count=500 status=none | openssl sha256
    SHA2-256(stdin)= a08a92258f621b55d08ad1e84c90c2ea6286fc6b6c9a4dfa7156afb16c190170
    
    real    0m0.382s
    user    0m0.296s
    sys    0m0.196s
    > sha256sum --version | head -1
    sha256sum (GNU coreutils) 9.1
    > openssl version
    OpenSSL 3.0.9 30 May 2023 (Library: OpenSSL 3.0.9 30 May 2023)
    ​
    (That said, I'm generally a big fan of BLAKE2 and BLAKE3)
    Last edited by pabs; 23 June 2023, 05:51 PM. Reason: add clarifying note about blake2

    Comment


    • #22
      Wonder if Git would/could ever drop SHA1?

      Comment


      • #23
        I'd been maintaining b3sum on AUR before it moved to official repos, happy to see the modern hashing gets more adoption.

        Comment


        • #24
          Originally posted by AndyChow View Post

          You have a benchmark on that? I've heard the same for Blake2b, but in my direct test with AES-NI it absolutely wasn't the case.

          Blake3 can be faster than any other current cryptographic algorithm by using multiple CPU threads, because unlike with most older hashes it is possible to divide the input in parts that are hashed independently.

          When Blake3 is restricted to use a single CPU thread, on modern CPUs (i.e. neither on Skylake derivatives nor on Raspberry Pi, but on any Zen, Ice Lake or newer, Apollo Lake or newer Intel Atom CPUs and on most 64-bit ARM CPUs) it is slower than optimized implementations of SHA-1 or SHA-256, which use the hardware instructions.

          Blake3 works great on mostly idle computers, but it may fail to accelerate the applications that already had useful work that could have been done by the CPU threads occupied by computing Blake3.

          Last edited by AdrianBc; 24 June 2023, 07:55 AM.

          Comment


          • #25
            Originally posted by pabs View Post

            sha256sum may not be a representative SHA-256 benchmark; the coreutils sha256sum SHA-256 implementation is substantially slower than the OpenSSL SHA-256 implementation:

            Code:
            > time dd if=/dev/zero bs=1M count=500 status=none | sha256sum
            a08a92258f621b55d08ad1e84c90c2ea6286fc6b6c9a4dfa7156afb16c190170 -
            
            real 0m1.903s
            user 0m1.844s
            sys 0m0.162s
            > time dd if=/dev/zero bs=1M count=500 status=none | openssl sha256
            SHA2-256(stdin)= a08a92258f621b55d08ad1e84c90c2ea6286fc6b6c9a4dfa7156afb16c190170
            
            real 0m0.382s
            user 0m0.296s
            sys 0m0.196s
            > sha256sum --version | head -1
            sha256sum (GNU coreutils) 9.1
            > openssl version
            OpenSSL 3.0.9 30 May 2023 (Library: OpenSSL 3.0.9 30 May 2023)
            ​
            (That said, I'm generally a big fan of BLAKE2 and BLAKE3)
            Interesting! I didn't know that.
            I tried it locally but am getting some different results.
            Do note that i'm clearing caches in between as this data tends to stay in memory if it fits.

            Code:
            ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
            Password:
            
            Ram-cache and Swap Cleared
            ❯ time cat 10gb_random_data | openssl sha256
            SHA2-256(stdin)= ac2a1f6ff9bf5a507a9ce239b390d765e633419defb2350fa75a4f5282957d64
            cat 10gb_random_data  0.02s user 3.08s system 34% cpu 9.067 total
            openssl sha256  4.54s user 0.67s system 57% cpu 9.067 total
            ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
            Password:
            
            Ram-cache and Swap Cleared
            ❯ time sha256sum 10gb_random_data
            ac2a1f6ff9bf5a507a9ce239b390d765e633419defb2350fa75a4f5282957d64  10gb_random_data
            sha256sum 10gb_random_data  4.14s user 2.00s system 70% cpu 8.744 total
            ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
            Password:
            
            Ram-cache and Swap Cleared
            ❯ time b3sum 10gb_random_data
            0b7dbbf80444deb0877e3d25d2251754e0e7fdb7325801a31fc3a13a39ca2458  10gb_random_data
            b3sum 10gb_random_data  2.53s user 2.91s system 340% cpu 1.601 total​
            So in my case - on a AMD Ryzen 9 5950X - b3sum is about 5.6x faster then either sha256sum or openssl sha256. The CPU has the sha_ni instruction set though i don't know if openssl is using it. Regardless, it's not going to beat loading 10GB with 1.6 seconds as i'm guessing much of that time is spend loading the data from the NVMe drive. Which is confirmed by running b3sum again (now the data is in memory) and completes in 0.4 seconds. Meaning that b3sum only needs about 0.4 seconds to checksum 10GB. sha256sum, while faster on subsequent runs, still needs ~5.3 seconds to complete every time. If you compare these two, so keeping IO out of the equation and just looking at checksum thoughput, then b3sum (0.4 seconds) is about 10.7x faster then sha256sum (with ~5.3 seconds).

            I'm happy to have rerun my tests as i was thinking sha to be just marginally slower on my current hardware. Turns out blake3 is only better on this hardware then what i had before. Core count probably helps here but it yet again only proves that sha has had it's days and we should consider moving to blake3 for hashing.

            Comment


            • #26
              Originally posted by markg85 View Post

              Interesting! I didn't know that.
              I tried it locally but am getting some different results.
              Do note that i'm clearing caches in between as this data tends to stay in memory if it fits.

              Code:
              ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
              Password:
              
              Ram-cache and Swap Cleared
              ❯ time cat 10gb_random_data | openssl sha256
              SHA2-256(stdin)= ac2a1f6ff9bf5a507a9ce239b390d765e633419defb2350fa75a4f5282957d64
              cat 10gb_random_data 0.02s user 3.08s system 34% cpu 9.067 total
              openssl sha256 4.54s user 0.67s system 57% cpu 9.067 total
              ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
              Password:
              
              Ram-cache and Swap Cleared
              ❯ time sha256sum 10gb_random_data
              ac2a1f6ff9bf5a507a9ce239b390d765e633419defb2350fa75a4f5282957d64 10gb_random_data
              sha256sum 10gb_random_data 4.14s user 2.00s system 70% cpu 8.744 total
              ❯ su -c "echo 3 >'/proc/sys/vm/drop_caches' && swapoff -a && swapon -a && printf '\n%s\n' 'Ram-cache and Swap Cleared'"
              Password:
              
              Ram-cache and Swap Cleared
              ❯ time b3sum 10gb_random_data
              0b7dbbf80444deb0877e3d25d2251754e0e7fdb7325801a31fc3a13a39ca2458 10gb_random_data
              b3sum 10gb_random_data 2.53s user 2.91s system 340% cpu 1.601 total​
              So in my case - on a AMD Ryzen 9 5950X - b3sum is about 5.6x faster then either sha256sum or openssl sha256. The CPU has the sha_ni instruction set though i don't know if openssl is using it. Regardless, it's not going to beat loading 10GB with 1.6 seconds as i'm guessing much of that time is spend loading the data from the NVMe drive. Which is confirmed by running b3sum again (now the data is in memory) and completes in 0.4 seconds. Meaning that b3sum only needs about 0.4 seconds to checksum 10GB. sha256sum, while faster on subsequent runs, still needs ~5.3 seconds to complete every time. If you compare these two, so keeping IO out of the equation and just looking at checksum thoughput, then b3sum (0.4 seconds) is about 10.7x faster then sha256sum (with ~5.3 seconds).

              I'm happy to have rerun my tests as i was thinking sha to be just marginally slower on my current hardware. Turns out blake3 is only better on this hardware then what i had before. Core count probably helps here but it yet again only proves that sha has had it's days and we should consider moving to blake3 for hashing.



              Blake3 achieves its speed by spawning as many threads as the hardware provides, so it is normal to be much faster on a 5950X than anything else, as long as the computer was idle. On a busy computer, e.g. on a server, Blake3 may reduce the throughput, because it actually does more work than SHA-256. The extra speed on a mostly idle computer, like a desktop or laptop, is obtained by using all the cores while SHA-256 uses a single thread.


              The right way to test openssl is to test it in the same way as you test sha256sum and not by using cat with a pipe:

              time openssl dgst -r -sha256 10gb_random_data


              Also, on a slower 5900X I get a double speed for openssl and sha256 and a quadruple speed for b3sum, caused by a much higher CPU utilization, so there is something different with the configuration of your computer. Perhaps you have a slower SSD, so the results show only in part the difference in speed between algorithms. Instead of clearing the cache it may be better to execute several times each time command, so that 10gb_random_data will be read from the file cache, and then you should look at the last results, which will differentiate better the algorithms, without SSD influence.


              It seems that after many years coreutils has finally added support for the SHA instructions.

              Now, with coreutils 9.3, I see the same speed for openssl and for sha256sum, while one year ago openssl was twice faster than sha256sum.


              Last edited by AdrianBc; 24 June 2023, 03:11 PM.

              Comment


              • #27
                > When Blake3 is restricted to use a single CPU thread, on modern CPUs (i.e. neither on Skylake derivatives nor on Raspberry Pi, but on any Zen, Ice Lake or newer, Apollo Lake or newer Intel Atom CPUs and on most 64-bit ARM CPUs) it is slower than optimized implementations of SHA-1 or SHA-256, which use the hardware instructions.

                It's more complicated than this, and it really depends on the architecture.

                On Ice Lake and Tiger Lake with both SHA-NI and AVX-512, single-threaded BLAKE3 is 4.3x faster than SHA-256 for long inputs: https://bench.cr.yp.to/results-hash.html#amd64-panther

                On more recent Alder Lake CPUs with SHA-NI but only AVX2, single-threaded BLAKE3 is 2.2x faster: https://bench.cr.yp.to/results-hash....626960,5600000

                ARM NEON is half as wide as AVX2, and BLAKE3 tends to be slower than hardware-accelerated SHA-256 on ARM, unless it uses multiple threads. Microarchitectural details also matter a lot, e.g. https://github.com/BLAKE3-team/BLAKE...ent-1595705202. All the SVE implementations I'm aware of are the same width as NEON, and BLAKE3 doesn't have any SVE code yet, but if wider SVE implementations come out in the future things could get more interesting.

                Comment


                • #28
                  Originally posted by markg85 View Post
                  Interesting! I didn't know that.
                  I tried it locally but am getting some different results.
                  Do note that i'm clearing caches in between as this data tends to stay in memory if it fits.
                  I investigated and it looks like I was half-wrong; the coreutils utilities (sha1sum, sha256sum, etc) use the cryptographic hash implementations from gnulib. Depending on the build flags, gnulib will either use it's own (slower) internal implementation or use the (faster) OpenSSL implementation of a given cryptographic hash.

                  In lib/sha256.c, for example, if HAVE_OPENSSL_SHA256 is defined then gnulib will use the OpenSSL SHA-256 implementation, and if HAVE_OPENSSL_SHA256 is not defined then gnulib will use the slower internal SHA-256 implementation defined in lib/sha256.c.

                  The same thing is true for MD5, SHA-1, and the other supported SHA-2 hash functions (SHA-224, SHA-384, and SHA-512); you can see the preprocessor remapping in lib/gl_openssl.h.

                  So I suspect the reason "sha256sum" is slower than "openssl sha256" for me and approximately the same speed for you is because the "sha256sum" on my system (Debian Bookworm, x86-64) was built using the slower gnulib SHA-256 implementation and the "sha256sum" on your system is using the faster OpenSSL SHA-256 implementation.

                  I don't know what distribution you're using, but I can see that the Arch coreutils package links against OpenSSL and the Debian Bookworm coreutils package does not.

                  Originally posted by markg85 View Post
                  ... (snipped) ...

                  So in my case - on a AMD Ryzen 9 5950X - b3sum is about 5.6x faster then either sha256sum or openssl sha256. The CPU has the sha_ni instruction set though i don't know if openssl is using it. Regardless, it's not going to beat loading 10GB with 1.6 seconds as i'm guessing much of that time is spend loading the data from the NVMe drive. Which is confirmed by running b3sum again (now the data is in memory) and completes in 0.4 seconds. Meaning that b3sum only needs about 0.4 seconds to checksum 10GB. sha256sum, while faster on subsequent runs, still needs ~5.3 seconds to complete every time. If you compare these two, so keeping IO out of the equation and just looking at checksum thoughput, then b3sum (0.4 seconds) is about 10.7x faster then sha256sum (with ~5.3 seconds).

                  I'm happy to have rerun my tests as i was thinking sha to be just marginally slower on my current hardware. Turns out blake3 is only better on this hardware then what i had before. Core count probably helps here but it yet again only proves that sha has had it's days and we should consider moving to blake3 for hashing.
                  One quick and dirty way to check if your version of OpenSSL is using the Intel SHA extensions is to disassemble libcrypto.so and see if it's using the SHA-256 instructions, like so:

                  Code:
                  # check for sha256 instructions in libcrypto.so
                  > objdump -dMintel /usr/lib/x86_64-linux-gnu/libcrypto.so.3 | grep sha256msg1|wc -l
                  48
                  
                  # show that my sha256sum is *not* dynamically linked against openssl
                  > ldd /usr/bin/sha256sum
                      linux-vdso.so.1 (0x00007fff8dbc7000)
                      libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fe294ea9000)
                      /lib64/ld-linux-x86-64.so.2 (0x00007fe2950bd000)
                  
                  # check for sha256 instructions in my sha256sum
                  > objdump -dMintel /usr/bin/sha256sum |grep sha256msg1|wc -l
                  0
                  >
                  ​
                  (For what it's worth, my guess is that your libcrypto.so is built with the Intel SHA extensions).

                  Regarding adoption of BLAKE3: I'm a fan of BLAKE3 and would like to see more adoption.
                  Last edited by pabs; 24 June 2023, 03:59 PM. Reason: fix small typos

                  Comment


                  • #29
                    Thank you for your reply @AdrianBc​!

                    I just tried with "time openssl dgst -r -sha256 10gb_random_data" and if anything it's actually slower.

                    Removing caches all really depends on how you want to test things.
                    Readers here have a tendency to to run a benchmark with X then with Y and not clearing caches in between. Then result Y is going to "look" much faster because X did the heavy work of putting it into memory. Just wanted to make a fair comparison.

                    But i also included numbers of where it is in memory (5.3 sec for SHA, 0.4 sec for blake3).

                    I don't get your stated "...Also, on a slower 5900X I get a double speed for openssl and sha256..." along with "...Now, with coreutils 9.3, I see the same speed for openssl and for sha256sum...". It's one or the other... Anyhow, the (blake3) speeds are fine on my end and approaching the theoretical limits i can get.

                    Comment


                    • #30


                      My understanding is that blake3 is faster than SHA1 even in cases where it's limited to single-threading and using the hardware acceleration. I believe the link above is testing those conditions.

                      The top machine listed is Alder Lake - older cpus will presumably make the difference much larger, but Alder Lake still has it almost twice as fast on >4K sizes. It gets slower on small sized inputs, which means it may make less sense for something like a hash table. But shader text is probably going to tend to be larger. Zen 3 results there look similar.

                      I've heard blake 3 can take better advantage of AVX512 for the same reason it's capable of being more multi-threading friendly, which helps it out on these newer machines which are the same ones that tend to have the SHA1 hardware acceleration. If you are on a cpu that limits speed when AVX instructions are used, that may be a problem that takes away some of blake 3's advantages.

                      That said, I know the devs were also interested in increasing the hash length. SHA1 is only 160 bit while blake3 is 256. If you plan on changing that hash size, there's no particular reason to stick with one of the SHA variants, you might as well go for whatever will work best.

                      There was also a recent merge request to start using a SHA1 implementation on ARM that was assembly optimized rather than the C version Mesa currently uses, and it was rejected because nobody wanted to have to maintain that ARM assembly code in the Mesa project. They want to use hash code that is heavily used (and therefore supported) elsewhere so it's not a burden on their project. Lots of ways that could have been resolved, obviously, but this will be a major win for current ARM support at the very least.
                      Last edited by smitty3268; 24 June 2023, 05:58 PM.

                      Comment

                      Working...
                      X