No announcement yet.

BLAKE3 Cryptographic Hashing Function Sees Experimental Vulkan Implementation

  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by HadrienG View Post
    [*]Lack of CPU implementation and poor support for CUDA-style single-source programming means that if the bulk of your compute infrastructure is CPU-based and you have a few new GPU nodes, you need to write and maintain what is basically a copy of your codebase' core logic just to run on those occasional GPU nodes. To say that people are not thrilled about this would be an understatement.
    but there is a CPU implementation ?

    Phoronix: Mesa "Vallium" - Software/CPU-Based Vulkan Based On LLVMpipe While there has been the CPU-based "Kazan" Vulkan driver (formerly Vulkan-CPU as a Google Summer of Code project) and Google's SwiftShader has been implementing CPU-based Vulkan support, it turns out Red Hat's David Airlie has been
    Phantom circuit Sequence Reducer Dyslexia


    • #12
      Originally posted by Qaridarium View Post
      but there is a CPU implementation ?
      As far as I know, neither Kazan nor Vallium are currently in a shape where you'd actually want to use them in production.


      • #13
        Originally posted by Steffo View Post

        Look at the github page. Blake 3 is also in a single threaded CPU case much faster than other algorithms.
        Ah you meant use Blake3 instead of SHA-1 - I thought you meant using this GPU offload. What I said applied to GPU offload.


        • #14
          Originally posted by alpha_one_x86 View Post
          https://catchchallenger.first-world....onception#Hash Blake3 is just few bit faster than sha256 here and vulkan on server is not common...
          I don't think it was meant for could be very useful for some client devices (e.g low end computers, phones etc.)


          • #15
            Hello! BLAKE3 author here. A comment on the original article:

            to date it's been just implemented in Rust for the multi-threaded version and a reference C implementation
            The official repo only contains Rust and C implementations, but BLAKE3 has been implemented by other people in several other languages already. We link to 3rd party implementations in the @BLAKE3-team Twitter feed. There's a Go implementation that's almost as fast as C/Rust, and there are bindings for Python and WASM.

            Also note that the reference implementation is written in Rust, not C. The official repo contains the reference implementation in Rust, an optimized implementation in Rust (the `blake3` crate), and an optimized implementation in C.

            Regarding some benchmarks linked to earlier in this thread:

            Blake3 is just few bit faster than sha256 here
            Those figures report SHA-256 (presumably `sha256sum`, which usually links against OpenSSL) hashing 1 GB in 587ms, and `b3sum` doing the same in 461ms, on a Ryzen 5 3400G CPU. I want to give some context for those figures, based on my best guesses about how they were measured. For comparison, the i5-8250U CPU on my laptop has 4 physical / 8 logical cores, the same as the Ryzen 5 3400G. Also very importantly, both CPUs support the AVX2 instruction set, so they dispatch to the same BLAKE3 implementation.

            When I create a 1 GB file `f` on my laptop and run `time b3sum f`, I get 119ms (best of 10 runs in a loop). That's because `b3sum` memory maps the whole file and splits the work across all 4 cores. However, if I run `time b3sum < f` instead, I get 488ms, about 4x slower. That's because when reading from stdin, `b3sum` can't memory map the file, and in the current implementation only one thread gets used to hash it. I assume the reported 461ms figure comes from this method, which is to say, it's measuring single-threaded BLAKE3 on a CPU that's somewhat faster than mine. (Probably a bit faster than it appears here, if the original figure wasn't a best-of-10 measurement.)

            Now, when I run `time sha256sum f` on my laptop, it takes 2.502 seconds. This is much slower than the reported figure from the Ryzen 5 3400G. I think the reason for this difference is that the Ryzen 5 3400G supports SHA extensions, which provide hardware acceleration for SHA-256, and `sha256sum` is taking advantage of that. My CPU doesn't support SHA extensions, so I'm measuring performance in software.

            If all that's correct, I'd interpret these figures to mean that on the Ryzen 5 3400G, single-threaded BLAKE3 is slightly faster than hardware-accelerated SHA-256.