The Linux Kernel's Scheduler Apparently Causing Issues For Google Stadia Game Developers
Among the issues that game developers have been facing in bringing their games to Linux for Google's Stadia cloud gaming service apparently stem from kernel scheduler issues. We've known the Linux kernel scheduler could use some improvements and independent developers like Con Kolivas with BFS / MuQSS have pushed for such, but hopefully in 2020 we'll see some real action.
Game/C++ developer Malte Skarupke wrote a post about how bad the Linux kernel scheduler is and that solutions like MuQSS are an improvement but not complete. Malte noted, "I found that most mutex implementations are really good, that most spinlock implementations are pretty bad, and that the Linux scheduler is OK but far from ideal. The most popular replacement, the MuQSS scheduler has other problems instead. (the Windows scheduler is pretty good though)."
The latest kernel scheduler woes appear to be game/engine developers hitting issues in readying their software for Google Stadia. "So this all started like this: I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks. I was curious about that because I happened to be the person who wrote the spinlock we were using. The problem was that there was a thread that spent several milliseconds trying to acquire a spinlock at a time when no other thread was holding the spinlock. Let me repeat that: The spinlock was free to take yet a thread took multiple milliseconds to acquire it. In a video game, where you have to get a picture on the screen every 16 ms or 33 ms (depending on if you’re running at 60hz or 30hz) a stall that takes more than a millisecond is terrible. Especially if you’re literally stalling all threads. (as was happening here) In our case we were able to make the problem go away by replacing spinlocks with mutexes."
In a comment by MuQSS lead developer Con Kolivas, Malte responded, "I know that we were not the only developers who had problems with the scheduler on Stadia. And Google is very aware of the problem. They care a lot about latency because latency is super important for the Stadia experience. And one of the ways they’ve reduced latency is to run games at 60hz that run at 30hz on console. But that means you only have 16ms to get a frame on the screen, and if the scheduler gives you a random hitch of a millisecond, you’re screwed. So this might be an opportunity for you to get [the MuQSS] scheduler used by more people and to maybe eventually get it into the mainline kernel. If you can solve the problem that ticket_spinlock ran into, I would recommend your scheduler over the default scheduler unreservedly. And maybe you can reach out to Google and see if they want to use your scheduler for Stadia."
He also posted some mutex benchmark code that I'm now looking at for possible PTS usage in comparing the kernels. Read more particularly on the spinlocks vs. mutexes performance via this blog post.
Let's hope for scheduler improvements to the Linux kernel in 2020 and maybe even seeing MuQSS mainlined if there becomes enough support.
Game/C++ developer Malte Skarupke wrote a post about how bad the Linux kernel scheduler is and that solutions like MuQSS are an improvement but not complete. Malte noted, "I found that most mutex implementations are really good, that most spinlock implementations are pretty bad, and that the Linux scheduler is OK but far from ideal. The most popular replacement, the MuQSS scheduler has other problems instead. (the Windows scheduler is pretty good though)."
The latest kernel scheduler woes appear to be game/engine developers hitting issues in readying their software for Google Stadia. "So this all started like this: I overheard somebody at work complaining about mysterious stalls while porting Rage 2 to Stadia. The only thing those mysterious stalls had in common was that they were all using spinlocks. I was curious about that because I happened to be the person who wrote the spinlock we were using. The problem was that there was a thread that spent several milliseconds trying to acquire a spinlock at a time when no other thread was holding the spinlock. Let me repeat that: The spinlock was free to take yet a thread took multiple milliseconds to acquire it. In a video game, where you have to get a picture on the screen every 16 ms or 33 ms (depending on if you’re running at 60hz or 30hz) a stall that takes more than a millisecond is terrible. Especially if you’re literally stalling all threads. (as was happening here) In our case we were able to make the problem go away by replacing spinlocks with mutexes."
In a comment by MuQSS lead developer Con Kolivas, Malte responded, "I know that we were not the only developers who had problems with the scheduler on Stadia. And Google is very aware of the problem. They care a lot about latency because latency is super important for the Stadia experience. And one of the ways they’ve reduced latency is to run games at 60hz that run at 30hz on console. But that means you only have 16ms to get a frame on the screen, and if the scheduler gives you a random hitch of a millisecond, you’re screwed. So this might be an opportunity for you to get [the MuQSS] scheduler used by more people and to maybe eventually get it into the mainline kernel. If you can solve the problem that ticket_spinlock ran into, I would recommend your scheduler over the default scheduler unreservedly. And maybe you can reach out to Google and see if they want to use your scheduler for Stadia."
He also posted some mutex benchmark code that I'm now looking at for possible PTS usage in comparing the kernels. Read more particularly on the spinlocks vs. mutexes performance via this blog post.
Let's hope for scheduler improvements to the Linux kernel in 2020 and maybe even seeing MuQSS mainlined if there becomes enough support.
89 Comments