Announcement

**JustRob** · 20 August 2017, 01:53 AM

Originally posted by PuckPoltergeist View Post

And this doen't affect the optimizations mentioned here. We're speaking about assembler optimizations within glibc, not something the compiler generates.

... and there are already special libraries for this, with highly optimized code.

Correct. Assembler written by Intel most likely runs faster on Intel CPUs and may or may not run as fast on a different Architecture (no reason for them to spend time testing during Code Design phase; something to have already done when designing the Arch).

The Code submitted is LGPL https://en.wikipedia.org/wiki/GNU_Le...Public_License so while it is likely optimized in Intel's favor anyone is free to rewrite it in any manner they choose (including optimizing it for a specific Architecture).

It's up to the CPU from AMD, ARM, Intel (etc.) to eat whatever it is given and execute it as quickly as possible. Any Code however it is written that runs much slower on one particular Architecture and much faster on another makes the faster CPU the winner (on speed alone, not necessarily Wattage or Bang/Watts).

It's up to the Programmer to make the Code short (particularly for ARM) and fast, it's up to the CPU to whip through it quickly and power efficiently.

Indeed specialized Libraries are developed for specific applications where some Loops are unrolled and some Instructions reordered (not caring about size, readability, or anything except speed) while other portions are written to conserve Memory or to be particularly easy to understand - it depends upon the goal.

Usually speed is the priority but where a huge savings in Memory usage can be had sometimes that becomes an important consideration.

As a simplified example, you likely don't want a half dozen Math Instructions in a row all accessing the same Register followed by a half dozen Memory Transfer Instructions all accessing the same Memory Address, you would interleave them.

In writing for a specific Architecture one would intersperse Memory access with other Instructions (particularly with Bulldozer) so that the CPU isn't Thread locked or sitting on a Wait State (like Multitasking on a single Thread level).

So everyone should feel free to read and improve the submission, the LGPL even allows Intel's submission to be used in proprietary software (so don't say that Intel never gives anything away).

Thanks Intel.

**thebear** · 20 August 2017, 02:46 AM

Originally posted by arjan_intel View Post

actually I was bored last weekend and saw these show up in profiles.... so I poked and optimized and stuck there result in Clear Linux.
(well half of them was a few more weeks ago)
HJ then made the code more pretty and upstreamed it into glibc.

I and others have been working on such optimizations via Clear Linux for over two years now (since the start of CL) well before Zen was known

Great work! Thank you very much for this (I have some software for which these transcendentals are the bottleneck).

**Virtus** · 20 August 2017, 09:06 AM

AVX2/FMA instructions are available since 2013 with Haswell... 4 years to optimize basic scalar arithmetic functions for their processors and no SIMD versions yet.

**arjan_intel** · 20 August 2017, 11:19 AM

Originally posted by thebear View Post

Great work! Thank you very much for this (I have some software for which these transcendentals are the bottleneck).

you can try this easily by doing

docker run -it clearlinux/machine-learning

(that gets you a pretty complete OS image)

and inside that do a "swupd update" to get to the latest version

**Virtus** · 20 August 2017, 01:50 PM

Both AVX2 and FMA were introduced 4 years ago in Haswell processors, it is nice to finally see the release of an optimized glibc today...
Since glibc provides only arithmetic functions operating on scalar values I will continue using intrinsics and libraries like Boost.SIMD to take full advantage of SIMD units and process fp32 values by packs of 8...
Can someone tell me why the optimized versions are written in assembly instead of intrinsics thus preventing compiler optimizations depending on the compilation/architecture flags ? If I remember correctly, compilers are able to reorder intrinsics but not assembly instructions.

**nuetzel** · 20 August 2017, 06:32 PM

Originally posted by thebear View Post

Great work! Thank you very much for this (I have some software for which these transcendentals are the bottleneck).

I can only second this.
Time goes by so fast and Arjan is sooo long on board...;-)

BTW Arjan, can you help me with this: (Was: Re: Resizeable PCI BAR support V5)
https://lists.freedesktop.org/archiv...ne/010636.html and

Re: Resizeable PCI BAR support V5 — Linux Kernel

https://www.spinics.net/lists/kernel/msg2546760.html

Linux Kernel: Re: Resizeable PCI BAR support V5

I'm searching for the right (tm) 'Nehalem' (X34xx) Northbridge documentation.
Will get my fingers dirty. --- Maybe someone of you are faster than me.

Greetings,
Dieter

**berolinux** · 21 August 2017, 07:13 AM

I've just added the patches to OpenMandriva -- so there's another distribution to try the patches on.

OpenMandriva - Here comes OpenMandriva ROME

http://openmandriva.org/

This is the official OpenMandriva website. There you can find documentation, all needed links for our services and getting the distribution, and the latest news.

**arjan_intel** · 21 August 2017, 09:36 AM

Originally posted by Virtus View Post

AVX2/FMA instructions are available since 2013 with Haswell... 4 years to optimize basic scalar arithmetic functions for their processors and no SIMD versions yet.

the simd versions are available in libmvec which is in glibc in like the last 2 or 3 releases at least, and used automatically with a new enough gcc(but iirc they require -ffast-math or a subset thereof)

**ermo** · 21 August 2017, 11:18 AM

Originally posted by arjan_intel View Post

actually I was bored last weekend and saw these show up in profiles.... so I poked and optimized and stuck there result in Clear Linux.
(well half of them was a few more weeks ago)
HJ then made the code more pretty and upstreamed it into glibc.

I and others have been working on such optimizations via Clear Linux for over two years now (since the start of CL) well before Zen was known

Oh shush. Why did you have to go and poke holes in my perfectly convenient "evil, evil intel" conspiracy theory?!

(on a serious note: TYVM to you and your partners in crime -- it's always nice to see widespread open source software being improved to take advantage of newer platforms)

Announcement

Intel Adds AVX2/FMA Optimized Math Functions To Glibc 2.27

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment