Announcement

**dungeon** · 28 October 2014, 08:50 AM

Originally posted by tarceri View Post

To use your own words, I'm not impressed by that benchmarking technique

As I said previously the function in question seems to be hardly used at all in extremetuxracer so if you are really seeing a difference (which would be very difficult to conclude because your not really comparing anything) its very unlikely that its caused by the patch.

I am not here to impress somebody, i just said there are side effects of this i am seeing. You only measure callgrind+openarena, that is profiling benchamark you believe that is OK for you, but that is not real world situation for me

.

I'm not trying to be rude here but I seriously think you need to try some real benchmarks before handing out this advice or providing feedback on patches.

Eh when someone wanna begin to be rude, he make sentence with excuse like: i am don't try to be rude, but seriously...

. Some people may need your advise, but i am not one of them

. We can spoke friendly but not with that one: i am don't try to be rude, but seriously... In real world i say on that: sit down, before start shiting

It is joke of course, ammended to that

Also I'm pretty sure sse4.1 support in mesa is only currently use to build one function which is used in the intel driver, so its highly unlikely that removing this from configure.ac will do anything at all since you are using AMD hardware.

I said that for other people to try, nothing wrong with that isn't it

... they may see a difference Intel CPU users included - nothing to do with only me

.

Well do it how you like it, just make that optimization disablible because performance depends on some hardware driver it seems... so i can disable it and don't think about it

.

**asdfblah** · 28 October 2014, 01:23 PM

Originally posted by tarceri View Post

Just because SSE2 is enabled doesnt always mean gcc will know when its best to use it.

Well, that's what I meant, optimizations "by hand". x86-64 is 10 years old, I thought something like this was done already.
Anyway, thank you bery much for your work

**tarceri** · 28 October 2014, 04:48 PM

Originally posted by dungeon View Post

I am not here to impress somebody, i just said there are side effects of this i am seeing. You only measure callgrind+openarena, that is profiling benchamark you believe that is OK for you, but that is not real world situation for me

.

Sigh. No I already posted that I ran callgrind+extremetuxracer and the function that I made changes to was only use 0.04% and only called just over 1200 times which is nothing. I did see other areas where modifications can be made but my point is the results you are giving me are not concrete or repoducable and there is no reason I can see that this optimisation should be disabled as you suggest based on your claims.

Originally posted by dungeon View Post

Eh when someone wanna begin to be rude, he make sentence with excuse like: i am don't try to be rude, but seriously...

. Some people may need your advise, but i am not one of them

. We can spoke friendly but not with that one: i am don't try to be rude, but seriously... In real world i say on that: sit down, before start shiting

It is joke of course, ammended to that

I was trying to emphasis that I was being sincere. I honestly think you should try something a bit more reliable then trying to eyeball results in gallium hud.

Originally posted by dungeon View Post

I said that for other people to try, nothing wrong with that isn't it

... they may see a difference Intel CPU users included - nothing to do with only me

.

Well do it how you like it, just make that optimization disablible because performance depends on some hardware driver it seems... so i can disable it and don't think about it

.

You can disable what ever you like, but your trying to tell others its make performace better without any real evidence that its helpful at all.

**cbxbiker61** · 28 October 2014, 07:38 PM

SSE2/SSE41/AVX2 patch

Originally posted by tarceri View Post

If anyones interested I just got sent an email pointing me to this post [1] about auto-vectorization in gcc i.e automated use of SSE/AVX.

There are likely to be at most 3 targets for this particular optimisation:

SSE2 - Because its common and can be assumed by default in 64-bit builds.

SSE4.1 - Because it includes min/max instructions which means its faster as thats the main thing the function does.

AVX2 - Because it has min/max instructions that can compare 8 values at once rather than 4 in SSE4.1

Also for anyone curious here is a list of the intrinsics available [2]

[1] http://locklessinc.com/articles/vectorize/
[2] https://software.intel.com/sites/lan...trinsicsGuide/

I agree, therefore I have come up with a new patch which implements all three approaches.

Also calling the vector code is only triggered when it would yield an improvement in performance, maybe some could benchmark this patch.

http://www.xilka.com/xilka/source/tmp/Mesa-SSE2-SSE41-AVX2-gldrawelements.patch

**cbxbiker61** · 28 October 2014, 08:28 PM

SSE2/SSE41/AVX2 patch

Originally posted by cbxbiker61 View Post

I agree, therefore I have come up with a new patch which implements all three approaches.

Also calling the vector code is only triggered when it would yield an improvement in performance, maybe some could benchmark this patch.

http://www.xilka.com/xilka/source/tm...elements.patch

This version is better, since it takes alignment into consideration when deciding whether to call the vector code.

http://www.xilka.com/xilka/source/tmp/Mesa-SSE2-SSE41-AVX2-gldrawelements-2.patch

**dungeon** · 29 October 2014, 01:54 AM

Originally posted by tarceri View Post

You can disable what ever you like, but your trying to tell others its make performace better without any real evidence that its helpful at all.

I will continue to advise people to try that on their own hardware/drivers to be sure

, because performance goes up for me by ~2% if i disable every mesa optimization at build time - you can believe me or not, that is how it is for me

. On the other hand you said on your blog that you reduce CPU usage by 2.5% while you don't know what happens with performance isn't it - that means to me you didn't actually looked at fps rate that AT ALL, only measure CPU usage in callgrind

We can drop your callgrind profiling and my eyesight which i believe more then any other buggy or not software

... and ask some third party like Michel to do benchmark on various hardware, i am sure it will be not always good or bad, but varying. When something IS vague like that it should not be enabled by default or if it is i would like to have easy switch to disable that

.

**geearf** · 29 October 2014, 05:33 AM

Originally posted by tarceri View Post

If anyones interested I just got sent an email pointing me to this post [1] about auto-vectorization in gcc i.e automated use of SSE/AVX.

There are likely to be at most 3 targets for this particular optimisation:

SSE2 - Because its common and can be assumed by default in 64-bit builds.

SSE4.1 - Because it includes min/max instructions which means its faster as thats the main thing the function does.

AVX2 - Because it has min/max instructions that can compare 8 values at once rather than 4 in SSE4.1

Also for anyone curious here is a list of the intrinsics available [2]

[1] http://locklessinc.com/articles/vectorize/
[2] https://software.intel.com/sites/lan...trinsicsGuide/

Would it be possible to have a runtime check for available instruction sets and use them as available, instead of making it a compile time switch?
That would be make it a lot easier for distributions (and their users).

**tarceri** · 29 October 2014, 05:57 AM

Originally posted by geearf View Post

Would it be possible to have a runtime check for available instruction sets and use them as available, instead of making it a compile time switch?
That would be make it a lot easier for distributions (and their users).

Yeah thats what it will be

**geearf** · 29 October 2014, 06:12 AM

Originally posted by tarceri View Post

Yeah thats what it will be

Awesome!

Is GCC generating that switch for you, or do you have to add some code for this behavior?

**cbxbiker61** · 29 October 2014, 06:34 AM

Originally posted by geearf View Post

Would it be possible to have a runtime check for available instruction sets and use them as available, instead of making it a compile time switch?
That would be make it a lot easier for distributions (and their users).

I think you mis-understanding the code.

The existing compile verifies that the compiler supports the required switch for compiling sse4.1 code. If it does, then it compiles the sse4.1 version, and in the case of my patch that also includes the sse2 version and avx2 version.

There already is runtime detection to decide whether to use the sse/avx versions.

Announcement

Reducing The CPU Usage In Mesa To Improve Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment