Announcement

**cj.wijtmans** · 16 March 2017, 07:08 AM

Originally posted by Shevchen View Post

... just to chime in:
I guess, if you want to optimize games for multithreading, one of the things I'd do is to decouple threads and game logic as far as possible and have as many processes running simultaneously. One a process needs data from another process, it would just access the variable that is currently available and be done with it. You may get small sync problems as one variable might be from another frame than the other, but I'd try to code it in a way, that it doesn't matter much.

Terrible idea. You have to think in worker threads, one for each thread. Do as much work as possible through a priority queue before the time for the frame runs out. When it runs out just cut off the work for the next frame.

**efikkan** · 16 March 2017, 10:45 PM

Originally posted by indepe View Post

Wishful thinking? Why shouldn't a game have lot of data that is updated at the beginning of a frame's processing time, and then treated read-only for the rest of the time, shared by several threads? That might be how I would try to do it.

In such a case all your assumptions would be completely wrong.

First of all, even the L3 cache is only a few MB, so any data there will be thrown out quickly any way, unless it's in use every few thousands of clock cycles or so. The only data you can share between different threads in a such case would be some read-only global state, which of course will be very limited. If you try to use more, it will just be thrown out due to how caches work.

Originally posted by Shevchen View Post

I guess, if you want to optimize games for multithreading, one of the things I'd do is to decouple threads and game logic as far as possible and have as many processes running simultaneously. One a process needs data from another process, it would just access the variable that is currently available and be done with it. You may get small sync problems as one variable might be from another frame than the other, but I'd try to code it in a way, that it doesn't matter much.

Yes, to some extent a good rendering engine should separate different stages of the pipeline into different threads working independently of each other. By doing so, the render thread can work more undisturbed.

Originally posted by Shevchen View Post

So, for example I have a radar system trying to show all the objects in the universe on my radar and I have all the objects in the universe. It doesn't matter, if I miss the correct position on the radar by 1 frame on a certain object, as the radar is a) low resolution and b) one frame later its correct again. Thus I have a decoupled system that is still performant, as it doesn't have to wait on other threads to "refresh their latest variable". So it doesn't even need to be read only.

But... I'm not a coder... its just how I'd imagine things to be "properly" done for games. On the other hand, if we do hit-reg by octree, we have dependent layers build on top of each other that can't be running amok in the background, and thus one layer has to wait for the previous one etc... *confused*

It takes a coder to fully understand this, but I can try to explain

As a well known game programmer once said; every programming problem is a data problem. This is also true for game engines, and a good engine should be constructed according to the data flow:
Device input -> Game logic/loop -> Frame rendering

These steps can of course be separated into separate threads, or actually they should be separated. The game logic should in any real time game run at a constant tick/clock rate, totally independent of the rendering. More simple game engines run the game logic/loop in the same thread as the rendering. One good well known example of this is Euro Truck 2, where if the rendering is getting slow or the game is loading data it slows down the tick of the game, and then when it catches up again it processes the input you did while the frame rate was low. In this particular game it's very evident for anyone playing for some hours, so it's a very good example of how not to do it.

It has become common to process the game logic separately. This is of course not unproblematic either, as new challenges emerge. But at least it achieves the most important; game logic can run at a constant tick while the rendering is doing it's own pace. As anyone familiar with multithreading knows, when different threads modify data you can run into what we call data hazards. The game logic thread will constantly work on changing the positions of all the objects in the game world, but what happens if you are rendering a bunch of objects while the game logic is updating their positions? Well you'll get twitching animations and objects slightly "jumping around", which is a problem you can spot in most games even today. It astounds me how many graphically and creatively amazing effects the artists are able to make, yet the experience is ruined for anyone who is able to spot all the small glitches, twitches and stutter in the animation. This is not the artist's fault, this is purely due to faulty engine design.

So the fundamental problem is sharing of data between threads; one is updating while the other is using it to render things. But think about it for a moment, a rendered frame should be a snapshot in time, regardless of the frame rate. This means that the game state needs to be consistent throughout the rendering of a single frame. What you have here is a synchronization problem, and most programmers will jump to the conclusion about using locks (like a mutex), to prevent the threads from interfering with each other. But if you use locks, then you're back to the same speed as running both in the same thread again, or probably even worse since there is overhead in usage of locks.

But the good news is that the solution is actually "simple", but the solution can't be applied easily to a game engine, since it's due to a fundamental design choice. The solution is to create a snapshot of the game state for the renderer. This means that for every iteration in the game logic, it will create a new snapshot of the finished state of all objects in the game, so it can continue working without interfering with the rendering.

We can use something called "triple buffering", which is a kind of lockfree/non-blocking algorithm to pass data between threads, which works flawlessly as long as you have one thread creating states and one thread consuming them. You only need three pointers and a few lines of code to implement it, and they can work asynchronous without any problems.

It works like this:
- The game logic thread creates a new state, when it's done it swaps two pointers, indicating which state is the latest.
- The rendering thread starts each frame by taking the latest state, freeing up the previous one.
Since this is triple buffering, there will always be one free state, so each end can always switch the pointers. The benefit of this approach is that it doesn't matter how long it takes for either end to produce or consume states, and still the data hazards are avoided.

As I mentioned, the principle is simple and sound, but it requires a rewrite of the game engine. First of all this usually breaks completely with the OOP design in most engines, since there now will be multiple states of a single "object" in the game, and the object states will be collected into snapshots, not stored encapsulated inside each object like OOP wants you to do. So to solve this, the design has to "break" partly with OOP, or ditch it all together. But when you do this, you will get frames that are finally perfect snapshots in time without any glitches, and if you're smart and pass an efficient state the rendering thread can more or less just pass along to the GPU, then the rendering thread will not be CPU bottlenecked at all.

**indepe** · 17 March 2017, 05:34 AM

Originally posted by efikkan View Post

First of all, even the L3 cache is only a few MB, so any data there will be thrown out quickly any way, unless it's in use every few thousands of clock cycles or so. The only data you can share between different threads in a such case would be some read-only global state, which of course will be very limited. If you try to use more, it will just be thrown out due to how caches work.

L3 cache on Ryzen 7 is 16 MB. That's a nice amount.

**efikkan** · 17 March 2017, 11:48 AM

Originally posted by indepe View Post

L3 cache on Ryzen 7 is 16 MB. That's a nice amount.

It still wouldn't matter for multiple threads crunching many GB/s, there is very little data sharing for such loads.

**indepe** · 17 March 2017, 03:13 PM

Originally posted by efikkan View Post

It still wouldn't matter for multiple threads crunching many GB/s, there is very little data sharing for such loads.

1 GB/s at a frame rate of 100 Hz is 10 MB, less than the L3 cache size, and obviously it would matter if just a part of that doesn't have to be loaded from main memory. So you seem to be contradicting yourself here.

**efikkan** · 18 March 2017, 06:37 AM

Originally posted by indepe View Post

1 GB/s at a frame rate of 100 Hz is 10 MB, less than the L3 cache size, and obviously it would matter if just a part of that doesn't have to be loaded from main memory. So you seem to be contradicting yourself here.

I hope you are joking now, otherwise I'm starting to get worried about you, you know I said many GB/s. Just the game logic alone will be up to several GB for each game tick, which is usually at 60 Hz or more for desktop games. When it comes to rendering, it easily surpasses several hundred MB per frame, and then probably more within the driver itself.

L3 cache doesn't work by prioritizing what you think is "important" or things that "will be shared". If two threads are to share the same cache it has to be "synchronized" and access the same addresses within a few dozen ns. When you read 20 MB of data, most of the old L3 cache is already gone. The cache is a number of banks having their separate LRU cache. If something is to be staying there for a while, it has to be continuously used. In the real world, multiple threads will share nearly no data in L3 cache.

**indepe** · 18 March 2017, 07:51 AM

Originally posted by efikkan View Post

I hope you are joking now, otherwise I'm starting to get worried about you, you know I said many GB/s. Just the game logic alone will be up to several GB for each game tick, which is usually at 60 Hz or more for desktop games. When it comes to rendering, it easily surpasses several hundred MB per frame, and then probably more within the driver itself.
...

60 fps frame rate is a good number for 4K resolution, but otherwise not a winning number in this context. 100 fps is a better number.

I suppose "it easily surpasses several hundred MB per frame" can be interpreted as more than 300 Megabyte per frame? Because if you meant 200 MB, you would have said "a couple hundred", right?

**indepe** · 19 March 2017, 03:52 PM

Originally posted by efikkan View Post

I hope you are joking now, otherwise I'm starting to get worried about you, you know I said many GB/s. Just the game logic alone will be up to several GB for each game tick, which is usually at 60 Hz or more for desktop games. When it comes to rendering, it easily surpasses several hundred MB per frame, and then probably more within the driver itself.

L3 cache doesn't work by prioritizing what you think is "important" or things that "will be shared". If two threads are to share the same cache it has to be "synchronized" and access the same addresses within a few dozen ns. When you read 20 MB of data, most of the old L3 cache is already gone. The cache is a number of banks having their separate LRU cache. If something is to be staying there for a while, it has to be continuously used. In the real world, multiple threads will share nearly no data in L3 cache.

P.S.: My previous post was a hint. Your numbers are impossible. Not a joke. Most significantly, you are mistaken about how many main memory cacheline accesses are possible within "a few dozen ns", by orders of magnitude. First I was wondering if you meant "ms" rather than "ns", but a frame rate of "60 Hz or more" does not even have two dozen "ms".

Once you correct your numbers (and you will be surprised by how much you need to do so), you will find that a 16 MB L3 cache may easily matter in a 100 fps game with multiple threads.

Announcement

GeForce GTX 1080 Ti: Core i7 7700K vs. Ryzen 7 1800X Linux Gaming Performance

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment