Announcement

**funfunctor** · 06 February 2017, 02:51 AM

Originally posted by indepe View Post

Well, in so far as described in the article, in theory this technique can be used with any API, single or multi-threaded. Or at any level of an application's internal or external call stack. However in the case of Vulkan, it would be less likely a meaningful way to split up the work onto multiple threads, since other options exist.

Usually, I think, you would dispatch work on a higher level. To do it at the external API level is a way that can be implemented by a library below the application level, transparent to the application. In a sense, it is a substitute for multithreading within the application, and may in some cases work very well, and in others not so well (involving unnecessary copying of data, for example).

Makes me wonder if it could be combined with GLVND to optionally benefit any/all OpenGL drivers on Linux.

This has got nothing to do with anything else but GL specifically.

**indepe** · 06 February 2017, 03:17 AM

Originally posted by funfunctor View Post

This has got nothing to do with anything else but GL specifically.

I know. What I am saying is that, in theory, instead of "Stash the GL calls in a batchbuffer" you could also "Stash the Vulkan calls in a batchbuffer" or "Stash the printf calls in a batchbuffer". Actually, the latter is something I was planning to do (or at least try out) the coming week, even though printf is not a single-threaded API. (Or at least the printf implementation which I am using for debugging purposes, doesn't mix output from concurrent calls.)

Of course, you were correct to point out that the single-threaded nature of OpenGL does not generally apply to GPUs, but is specific to the OpenGL API itself.

**Pontostroy** · 06 February 2017, 04:57 AM

i did some test, for many apps (csgo, glxgears, gputest) mesa_glthread does not work

Code:

_mesa_glthread_init
_mesa_glthread_destroy

talos result

Talos Ultra	ren_bMultiThreadedRendering=1	ren_bMultiThreadedRendering=0
mesa_glthread=true	44,5\61,4\33,9	43,9\60,4\32,6
mesa_glthread=false	45,5\62,1\33,3	43,7\59,9\32,4

**M@GOid** · 06 February 2017, 07:00 AM

Originally posted by Mark Rose View Post

I hope this helps with ETS2!

I believe that game would benefit from more bit polishing from the developer. Even on Windows the diference between DirectX and OpenGL are huge.

**funfunctor** · 06 February 2017, 07:20 AM

Originally posted by indepe View Post

I know. What I am saying is that, in theory, instead of "Stash the GL calls in a batchbuffer" you could also "Stash the Vulkan calls in a batchbuffer" or "Stash the printf calls in a batchbuffer". Actually, the latter is something I was planning to do (or at least try out) the coming week, even though printf is not a single-threaded API. (Or at least the printf implementation which I am using for debugging purposes, doesn't mix output from concurrent calls.)

Of course, you were correct to point out that the single-threaded nature of OpenGL does not generally apply to GPUs, but is specific to the OpenGL API itself.

Ah I see where your miss understanding is coming from now. Right so, with Vk you don't need to do that as Vk already has things like command buffers where you can configure as many concurrent streams as you like and handle all that yourself. This is what is meant by Vk being "low level" in that you as the user then become responsible for setting up the threads, buffers and whatever else, packing them with data and sending them on their way. Hope this helps without being too technical, let me know if you still don't understand and I can explain more deeply..

**indepe** · 06 February 2017, 07:37 AM

Originally posted by funfunctor View Post

Ah I see where your miss understanding is coming from now. Right so, with Vk you don't need to do that as Vk already has things like command buffers where you can configure as many concurrent streams as you like and handle all that yourself. This is what is meant by Vk being "low level" in that you as the user then become responsible for setting up the threads, buffers and whatever else, packing them with data and sending them on their way. Hope this helps without being too technical, let me know if you still don't understand and I can explain more deeply..

Not sure about what the misunderstanding is. I'd say with OpenGL this technique may make sense because the CPU overhead of API calls is high, so you might try to move this overhead to a different core. With Vulkan you have lower overhead (aside from that it has to compete with other options of making use of multiple cores).

**pal666** · 06 February 2017, 09:09 AM

this is much more important than on-disk cache

**schmidtbag** · 06 February 2017, 10:57 AM

In many cases, it often seems that the open source AMD drivers are bottlenecked by the CPU. I could see how this feature may substantially improve performance, particularly in games that use post-processing.

Originally posted by atomsymbol

In my opinion, glthread (if enabled from command-line) will be slowing down a large number of OpenGL apps this year (2017) and this issue won't be resolved until year 2018+.

On-disk cache is much closer to being capable of working as expected/intended in year 2017 than glthread (mareko).

Though I agree that ODC is much closer to being readily usable in 2017, I don't really understand why glthread would slow down many GL apps. I could understand it not having any impact on some games, or maybe slowing things down depending on the CPU used, but not so much depending on the app itself.

**schmidtbag** · 06 February 2017, 12:30 PM

Originally posted by atomsymbol

Well. I compiled https://cgit.freedesktop.org/~mareko/mesa/?h=glthread, run a game and observed a performance decrease by up to 60%. I am not claiming that glthread doesn't benefit some other games.

Have you narrowed down that glthread was the issue?

In general, the only thing that decides whether multi-threaded code performing a task is faster than single-threaded code performing the same task is whether it is computable beforehand/in_advance that the former is faster than the latter. If it cannot be computed it is faster it may just as well be slower.

Understood, but this is why I think there would either be a negligible performance improvement or a drastic improvement. The article states the the GL calls are queued, implying that even though the threads are processed in parallel, they're still meant to be executed in a specific order and doesn't imply they're dependent on each other. In other words, if each GL call was put into multiple threads and they were all meant to complete a single image, then there could be drastic performance decreases, because all threads are working toward a single task. That being said, if one of the threads wasn't done, all of the others have to wait for it, which in turn hurts performance. But since they're queued, that suggests the threads are not explicitly dependent on each other, in which case there should be little to no decrease in performance, but an increase would be dependent on how much a each CPU core is bottlenecked.

**Veerappan** · 06 February 2017, 12:46 PM

Originally posted by schmidtbag View Post

Have you narrowed down that glthread was the issue?

Understood, but this is why I think there would either be a negligible performance improvement or a drastic improvement. The article states the the GL calls are queued, implying that even though the threads are processed in parallel, they're still meant to be executed in a specific order and doesn't imply they're dependent on each other. In other words, if each GL call was put into multiple threads and they were all meant to complete a single image, then there could be drastic performance decreases, because all threads are working toward a single task. That being said, if one of the threads wasn't done, all of the others have to wait for it, which in turn hurts performance. But since they're queued, that suggests the threads are not explicitly dependent on each other, in which case there should be little to no decrease in performance, but an increase would be dependent on how much a each CPU core is bottlenecked.

From Gregory Hainaut @ https://lists.freedesktop.org/archiv...ry/143190.html

Here a feedback of my quick test
On PCSX2 (PS2 emulator), I noticed that synchronization badly impacts the perf. In my case, there are mostly related to texture transfer (CPU->GPU) and clear buffer functions. Strangely I didn't notice anything related to BufferSubData* but I guess it is the same. Those functions trigger a sync because of the pointer parameter. However texture transfer could use a PBO so it isn't a real pointer. And clear uses a pointer to a color hence a small payload (worst case is likely around 16/32B). IMHO, it can surely be inlined/memcpy in the gl dispatcher (otherwise the old GL2 clear API is sync free). I hacked the code to remove the sync on texture transfer and I got a major speed boost. I didn't count the number of draw call neither sync ratio. But I suspect that perf impact could depends on the sync repartition. Unlike me, I guess that Borderlands2 uploads/clears buffers/textures/uniform at the start of the frame. Which mean various small sync at the start of the frame (which might be optimized as a spin lock). Therefore the hot rendering loop might be sync free hence the speed boost. To conclude, based on my single testcase, current state of the code isn't yet optimal and it might explain why few apps see any perf improvement so far. But the potential is here.

So by threading the various GL calls, additional synchronization is added. That synchronization *can* lead to performance degradation for some applications depending on how they use the GL API. And this is why the threaded dispatch isn't enabled for applications by default, but would have an environment variable or drirc opt-in mechanism to selectively enable it for applications.

Announcement

Mesa Threaded OpenGL Dispatch Finally Landing, Big Perf Win For Some Games

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment