Linux 5.7 Netfilter To See AVX2 Optimizations For Big Performance Boost - Can Be Up To ~420%

ZeDestructor replied

30 March 2020, 11:32 PM
Originally posted by newwen View Post

Doesn't executing AVX2 instructions consume a lot of power? If the processing time decreases, it's logical that the total consumed energy for the process to decrease, at the expense of higher peak/intantaneous power consumption, but has anyone tested it?

It does, but when you're looking at 0-30% more power for 30-420% better throughput, it's still more efficient overall.
Leave a comment:
computerquip replied

16 March 2020, 11:04 PM
Originally posted by Mario Junior View Post

This BTFO all people says: "but muhhh, the compiler knows how to optimize assembly code better than you do it manually."

There's obviously going to be cases where hand-optimized assembler is going to be faster. The problem is that it's less readable, harder to understand, much more verbose, and far less portable than C in general. And it *does*, in the vast majority of situations, generate better assembly than you could write by hand.
Likes 1
Leave a comment:
sophisticles replied

16 March 2020, 10:21 AM
Originally posted by Raka555 View Post

I assume it is hand optimized assembler ?

All SIMD programming, is done using either assembler or compiler intrinsics, which are basically C style instructions that convert 1-1 to assembler.

By definition SIMD is hand optimized, unless of course you rely on compiler optimizations.,
Leave a comment:
newwen replied

16 March 2020, 09:45 AM
Doesn't executing AVX2 instructions consume a lot of power? If the processing time decreases, it's logical that the total consumed energy for the process to decrease, at the expense of higher peak/intantaneous power consumption, but has anyone tested it?
Likes 1
Leave a comment:
ssokolow replied

16 March 2020, 07:11 AM
Originally posted by Mario Junior View Post

This BTFO all people says: "but muhhh, the compiler knows how to optimize assembly code better than you do it manually."

To be fair, the best solution is sort of a middle-ground: compiler intrinsics.

The "either use the fast path or fail to compile" of explicitly calling SIMD instructions directly, but without the "compiler has no idea what optimizations are safe to perform on the data flow surrounding this instruction" of assembly.

That's why Microsoft decided to omit inline assembly support from Visual C++ for 64-bit targets.
Likes 2
Leave a comment:
Mario Junior replied

16 March 2020, 06:11 AM
Originally posted by tiennou View Post

A look at the commit (https://git.kernel.org/pub/scm/linux...94d765c8eecbe1) points in that direction.

This BTFO all people says: "but muhhh, the compiler knows how to optimize assembly code better than you do it manually."
Likes 1
Leave a comment:
ldesnogu replied

16 March 2020, 05:48 AM
Originally posted by cbxbiker61 View Post

It's always welcome to see performance improvements. Optimizing arm would no doubt hit a larger user base with all of the arm based OpenWRT DdWRT routers (I've got 5 routers running OpenWRT and one of them would benefit from netfilter performance improvements).

It seems the author agrees:

nft_set_pipapo: Introduce AVX2-based lookup implementation - kernel/git/pablo/nf-next.git - Netfilter's -next tree

https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/commit/?id=7400b063969bdca4a06cd97f1294d765c8eecbe1

A similar strategy could be easily reused to implement specialised versions for other SIMD sets, and I plan to post at least a NEON version at a later time.
Likes 4
Leave a comment:
milkylainen replied

16 March 2020, 05:12 AM
Originally posted by r08z View Post

This is what ClearLinux does on a regular basis for all libraries and programs with a few simple avx2 instricts patches to help the program make better use of the -march=haswell compiler flag.

Vectorization and architecture support is not the same as structured optimized assembly for complex data.
But if the code structure was different, maybe more standard vectorization would have helped.
Replacing a functions or calls with a hand optimized intrinsic is not the same either.
Likes 3
Leave a comment:
r08z replied

16 March 2020, 04:52 AM
This is what ClearLinux does on a regular basis for all libraries and programs with a few simple avx2 instricts patches to help the program make better use of the -march=haswell compiler flag.
Leave a comment:
Setif replied

16 March 2020, 03:22 AM
I hope It's not about some operations that occur once or twice a day, changed from taking 4.2 ms to 1.0 ms (420%).
Likes 1
Leave a comment:

Announcement

Linux 5.7 Netfilter To See AVX2 Optimizations For Big Performance Boost - Can Be Up To ~420%

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: