Announcement

**popper** · 08 January 2011, 10:16 PM

Originally posted by marek View Post

popper> You apparently live in the illusion that profilers or any kind of measuring tool can show you what to optimize.

ROTFL

curls up in the corner as the hard Core Linux gfx and kernel devs come out in force to call popper names on the user support site

Originally posted by marek View Post

The graphics driver stack is huge, there are hundreds of functions that call each other, each spending very little time in itself.

The right question usually isn't "how to speed up this function", it's rather something like "can I somehow change the upper layers so that this function is called less often?"

Now if you start with that kind of question, you realize that if you don't know what's REALLY going on in the code at various levels, profiling is mostly USELESS and will only make you spend time on parts of code that will give you very little speedup, if any (e.g. you may end up wondering why atomic increment is so high on the profile

).

sure, but you cant rule out something just because it hard to prove, and i agree, better to code smarter (called less often)

you have not convinced me that "profiling is mostly USELESS" though, as profiling often where you Can, even a small speed-up in one routine is still better than none at all , and it may just be that the dev/ or someone reviewing his code finds a better option because that took place.

**bridgman** · 08 January 2011, 10:26 PM

Originally posted by popper View Post

you have not convinced me that "profiling is mostly USELESS" though, as profiling often where you Can, even a small speed-up in one routine is still better than none at all , and it may just be that the dev/ or someone reviewing his code finds a better option because that took place.

Popper, you would have a lot more credibility if you just admitted you were wrong. I don't see any way you can say something like :

Originally posted by popper View Post

WHAT, are all AMD Linux workflows this slow, and wasting lots of cycles! your obviously doing it wrong, even a NON Head of AMD manager of linux can see that

your saying you dont even have/write as yet a simple C app doing x264 type checkasm tests down to the decicycles range of all the C and assembly routines on all code ?

... then come back and say that what you really meant was that we should be starting with something like checkasm (as if the driver devs don't already have CPU profilers) and build it up into a complex tool that could tell us where the bottlenecks are in a highly parallel hardware/software system.

Like it or not, you are still almost completely missing the point. Performance tuning for a device driver has very little to do with making individual routines run fast on the CPU and almost everything to do with using the hardware more efficiently.

**popper** · 08 January 2011, 10:43 PM

arr, so in Your Opinion it's now about "credibility" and "admitted you[i was] were wrong" rather than a conversation to convince a reader one one or the other that something could work to everyone's advantage.

its interesting that the thread so far has not actually produced any potential fixes and speed ups etc

Although I Do appreciate glisse taking the initative and trying to lay it out in A quick sketch try and HELP everyone reading that not Hard Core, thanks

**popper** · 08 January 2011, 10:51 PM

Damn the edit time and i refreshed so cant delete and re edit, but opening the "credibility" can of worms is a very slippery slope....

**bridgman** · 08 January 2011, 11:00 PM

Popper, I'm trying to help. You don't have to accept it.

**popper** · 08 January 2011, 11:12 PM

Originally posted by bridgman View Post

Popper, I'm trying to help. You don't have to accept it.

as am i, but remember this thread in the future, after all they say it's good to Learn New thing's Every day.

**bridgman** · 08 January 2011, 11:18 PM

Sorry, perhaps I missed something. How exactly was this supposed to help, given that the developers were already using the kind of profilers you were talking about ?

Originally posted by popper View Post

WHAT, are all AMD Linux workflows this slow, and wasting lots of cycles! your obviously doing it wrong, even a NON Head of AMD manager of linux can see that

your saying you dont even have/write as yet a simple C app doing x264 type checkasm tests down to the decicycles range of all the C and assembly routines on all code ?

**popper** · 08 January 2011, 11:43 PM

Originally posted by bridgman View Post

Sorry, perhaps I missed something. How exactly was this supposed to help, given that the developers were already using the kind of profilers you were talking about ?

You Put the case

Originally posted by bridgman View Post

Performance optimization is basically :

- run some benchmarks & save the results
repeat forever {
- do some profiling
- form a theory re: where the bottleneck is
- change some code to test the theory
- re-run the benchmarks to see if things go faster
- (4 times out of 5) curse and discard the theory (or save as the basis for a more complex theory)
- (1 time out of 5) make happy noises and get some sleep
}

i put the case , perhaps try and make a new tool that will automaticly cut down at least some of these separate steps to help efficiency and keep the thought/coding process flowing in the right direction.
nothing more nothing less.

**bridgman** · 08 January 2011, 11:49 PM

(/me bangs head against wall)

I still don't get it. Are you talking about a "form a theory" tool ? Everything else already has tools and those tools are constantly being improved.

**marek** · 09 January 2011, 12:29 AM

Originally posted by popper View Post

its interesting that the thread so far has not actually produced any potential fixes and speed ups etc

Maybe not in this thread (and I don't think this thread can help us in any way unless someone can actually stand up and say what we do wrong in the code and not telling us what tools we should use), but just yesterday there was a discussion between me and Calim, who's a nouveau developer, on IRC that our user buffer uploads, which happen in any application that doesn't use VBOs, may be uploading like 4x more data than is actually needed. And the solution is now known as well, there are two fixes required:

1) We should only upload vertices in the range [min_index, max_index] instead of [0, buffer size], the range is given in the parameters of the draw_vbo function (the main drawing function). Already done for r300g (commit), any other driver will have to implement it too as it's a prerequisite for some other optimization I plan to do, so this is inevitable.

2) Interleaved vertex elements in user buffers are uploaded somewhat... redundantly? Not sure if this is the right word. Anyway, if the same user buffer happens to be set in slots 0,1,2,3,4,5, it gets uploaded... guess what... 6 times instead of once. This can only be fixed in the state tracker or in the Mesa core. There are things like varying buffer offsets in those slots, so you can't easily merge the slots together in a driver.

So now you can see that:
- Performance optimizations are being worked on and things are getting better. Some bottlenecks (or rather coding mistakes?) are now understood.
- No tool can tell you about the two issues described above.

Also two days ago constant buffer uploads were optimized in r600g (commit), so you should already have a little higher frame rate in CPU-limited apps if you are using the latest code. r300g has a faster winsys since yesterday (commit).

Announcement

A Big Comparison Of The AMD Catalyst, Mesa & Gallium3D Drive

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment