to date it's been just implemented in Rust for the multi-threaded version and a reference C implementation
Also note that the reference implementation is written in Rust, not C. The official repo contains the reference implementation in Rust, an optimized implementation in Rust (the `blake3` crate), and an optimized implementation in C.
Regarding some benchmarks linked to earlier in this thread:
Blake3 is just few bit faster than sha256 here
When I create a 1 GB file `f` on my laptop and run `time b3sum f`, I get 119ms (best of 10 runs in a loop). That's because `b3sum` memory maps the whole file and splits the work across all 4 cores. However, if I run `time b3sum < f` instead, I get 488ms, about 4x slower. That's because when reading from stdin, `b3sum` can't memory map the file, and in the current implementation only one thread gets used to hash it. I assume the reported 461ms figure comes from this method, which is to say, it's measuring single-threaded BLAKE3 on a CPU that's somewhat faster than mine. (Probably a bit faster than it appears here, if the original figure wasn't a best-of-10 measurement.)
Now, when I run `time sha256sum f` on my laptop, it takes 2.502 seconds. This is much slower than the reported figure from the Ryzen 5 3400G. I think the reason for this difference is that the Ryzen 5 3400G supports SHA extensions, which provide hardware acceleration for SHA-256, and `sha256sum` is taking advantage of that. My CPU doesn't support SHA extensions, so I'm measuring performance in software.
If all that's correct, I'd interpret these figures to mean that on the Ryzen 5 3400G, single-threaded BLAKE3 is slightly faster than hardware-accelerated SHA-256.
Leave a comment: