Announcement

Collapse
No announcement yet.

Rav1e 0.4 Alpha Released With Much Faster Performance For Rust AV1 Encoding

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rav1e 0.4 Alpha Released With Much Faster Performance For Rust AV1 Encoding

    Phoronix: Rav1e 0.4 Alpha Released With Much Faster Performance For Rust AV1 Encoding

    After over a half year working on this new version, Rav1e 0.4 is on the way but first is the alpha milestone out today...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
    But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

    As I don't really see big benefit to have 4-6 implementation
    1. in C or RUST
    2/3. for intel/AMD 32/64bit SSE or AVX
    4. ARM NEON (32bit/64bit)
    5. IBM powerpc
    6. RISC-V

    Comment


    • #3
      Originally posted by miskol View Post
      It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
      But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

      As I don't really see big benefit to have 4-6 implementation
      1. in C or RUST
      2/3. for intel/AMD 32/64bit SSE or AVX
      4. ARM NEON (32bit/64bit)
      5. IBM powerpc
      6. RISC-V
      how about function multiversioning ?

      Comment


      • #4
        Originally posted by miskol View Post
        As I don't really see big benefit to have 4-6 implementation
        I bet no developers are happy with multiple implementations but reality is, that's unavoidable, as there are old and new OSs (32 bit vs 64 bit), SIMD instructions become richer and more powerful (SSE vs AVX), architectures work differently (IA86 vs arm).
        This still doesn't justify rust vs C but you know, new languages arise for some reasons, so someone starts using them.

        Comment


        • #5
          There's no C code in rav1e, it just optionally exposes a C interface to allow calling into this Rust library from other languages.

          Comment


          • #6
            Originally posted by miskol View Post
            It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
            But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

            As I don't really see big benefit to have 4-6 implementation
            1. in C or RUST
            2/3. for intel/AMD 32/64bit SSE or AVX
            4. ARM NEON (32bit/64bit)
            5. IBM powerpc
            6. RISC-V
            Because you don't get how hardware works and honestly having 6 implementation is a big compromise, if you really want the best performance possible it should be way more than 6, like:

            1.) C/C++, Rust, Python, Go, Ada, whatever language you like as a fallback code path ---- Slowest
            2.) X86_64
            a.) Sandy Bridge --Ivy Bridge --- SSSE3/SSE4.2 + architecture specifics
            b.) Haswell+ --- SSSE3/SSE4.2 for Xeons --- AVX2 for desktop + architecture specifics
            c Skylake+ --- AVX512 for Xeons(would need testing, see later why) --- AVX2 for desktop + architecture specifics
            d.) Zen1/+ Ryzen only AVX2 for desktop + architecture specifics
            f.) Zen1/+ TR only AVX2 + architecture specifics + NUMA awareness
            g.) Zen2 AVX2 + architecture specifics

            and apply the same for the other architectures

            Why? OMG? x86_64/arm/etc are standards you are stupid!!! LOL?

            Simple, even tho X86/Arm/PowerPC/etc --- SSE/AVX/NEON/Altivec/etc are standards, the standard only guarantees availability of this features nothing else because the micro architecture is not standard and every implementation have its quirks regardless of how SIMD will perform, for example performance depends extremely on :

            1.) How the ALU do fetches
            2.) How the Cache(as L1,L2,L3,L4) fetches and fallback(pass vs victim)
            3.) How the prefetchers behave and which direction they like more with the ALU
            4.) How memory locality works
            5.) How thermal limits are applied to SIMD units(THIS IS A PAIN WITH XEONS AND AVX512)
            6.) How frequency limits behave when using SIMD(AGAIN, XEONS)
            7.) How the application DATA behaves, not all data is efficient on every SIMD width

            So, in practice is quite normal to kill yourself extracting every bit of performance for lets say IVY Bridge Xeon just to see your implementation being almost as slow as you fallback path on lets say TR 1950x, or watch your beautiful (yet horrendous) AVX512 code just being 20% faster than AVX2 in you shiny Skylake Xeon because the CPU drop the frequency to the ground before going full Chernobyl because Intel engineering geniuses put this monstrosity on 14nm.

            Outside HPC computing, most developers will go for the GOOD ENOUGH SIMD implementation which translates as "Hey is not even using the CPU fully but is faster than the fallback code in mostly all CPUs tested, so fuck it" like Rav1e is doing

            Comment

            Working...
            X