Announcement

**miskol** · 16 September 2020, 07:01 AM

It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

As I don't really see big benefit to have 4-6 implementation
1. in C or RUST
2/3. for intel/AMD 32/64bit SSE or AVX
4. ARM NEON (32bit/64bit)
5. IBM powerpc
6. RISC-V

**CochainComplex** · 16 September 2020, 08:44 AM

Originally posted by miskol View Post

It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

As I don't really see big benefit to have 4-6 implementation
1. in C or RUST
2/3. for intel/AMD 32/64bit SSE or AVX
4. ARM NEON (32bit/64bit)
5. IBM powerpc
6. RISC-V

how about function multiversioning ?

Function Multiversioning (Using the GNU Compiler Collection (GCC))

https://gcc.gnu.org/onlinedocs/gcc/Function-Multiversioning.html

Function Multiversioning (Using the GNU Compiler Collection (GCC))

**topolinik** · 16 September 2020, 09:02 AM

Originally posted by miskol View Post

As I don't really see big benefit to have 4-6 implementation

I bet no developers are happy with multiple implementations but reality is, that's unavoidable, as there are old and new OSs (32 bit vs 64 bit), SIMD instructions become richer and more powerful (SSE vs AVX), architectures work differently (IA86 vs arm).
This still doesn't justify rust vs C but you know, new languages arise for some reasons, so someone starts using them.

**moltonel** · 16 September 2020, 12:39 PM

There's no C code in rav1e, it just optionally exposes a C interface to allow calling into this Rust library from other languages.

**jrch2k8** · 16 September 2020, 01:56 PM

Originally posted by miskol View Post

It will be really nice if we can write code in some specific way and as the output there will be huge speed up with vector instructions.
But I want to be 100% sure that, when I write code in same way in assembler I don't get any more speed up.

As I don't really see big benefit to have 4-6 implementation
1. in C or RUST
2/3. for intel/AMD 32/64bit SSE or AVX
4. ARM NEON (32bit/64bit)
5. IBM powerpc
6. RISC-V

Because you don't get how hardware works and honestly having 6 implementation is a big compromise, if you really want the best performance possible it should be way more than 6, like:

1.) C/C++, Rust, Python, Go, Ada, whatever language you like as a fallback code path ---- Slowest
2.) X86_64
a.) Sandy Bridge --Ivy Bridge --- SSSE3/SSE4.2 + architecture specifics
b.) Haswell+ --- SSSE3/SSE4.2 for Xeons --- AVX2 for desktop + architecture specifics
c

Skylake+ --- AVX512 for Xeons(would need testing, see later why) --- AVX2 for desktop + architecture specifics
d.) Zen1/+ Ryzen only AVX2 for desktop + architecture specifics
f.) Zen1/+ TR only AVX2 + architecture specifics + NUMA awareness
g.) Zen2 AVX2 + architecture specifics

and apply the same for the other architectures

Why? OMG? x86_64/arm/etc are standards you are stupid!!! LOL?

Simple, even tho X86/Arm/PowerPC/etc --- SSE/AVX/NEON/Altivec/etc are standards, the standard only guarantees availability of this features nothing else because the micro architecture is not standard and every implementation have its quirks regardless of how SIMD will perform, for example performance depends extremely on :

1.) How the ALU do fetches
2.) How the Cache(as L1,L2,L3,L4) fetches and fallback(pass vs victim)
3.) How the prefetchers behave and which direction they like more with the ALU
4.) How memory locality works
5.) How thermal limits are applied to SIMD units(THIS IS A PAIN WITH XEONS AND AVX512)
6.) How frequency limits behave when using SIMD(AGAIN, XEONS)
7.) How the application DATA behaves, not all data is efficient on every SIMD width

So, in practice is quite normal to kill yourself extracting every bit of performance for lets say IVY Bridge Xeon just to see your implementation being almost as slow as you fallback path on lets say TR 1950x, or watch your beautiful (yet horrendous) AVX512 code just being 20% faster than AVX2 in you shiny Skylake Xeon because the CPU drop the frequency to the ground before going full Chernobyl because Intel engineering geniuses put this monstrosity on 14nm.

Outside HPC computing, most developers will go for the GOOD ENOUGH SIMD implementation which translates as "Hey is not even using the CPU fully but is faster than the fallback code in mostly all CPUs tested, so fuck it" like Rav1e is doing

Announcement

Rav1e 0.4 Alpha Released With Much Faster Performance For Rust AV1 Encoding

Rav1e 0.4 Alpha Released With Much Faster Performance For Rust AV1 Encoding

Comment

Comment

Comment

Comment

Comment