A Low-Latency Kernel For Linux Gaming

minuseins replied

23 June 2012, 09:11 PM
@YoungManKlaus: Thanks for the link!

@ownagefool: Due to the already lengthy post I had to cut and edit a few things. My 100-120 FPS figure is only related to non-interactive usage to determine if a motion is more fluid and natural, and the picture is stable. It is also only applicable for 2D content.
3D content needs about 100-120FPS per eye and it is one of the few applications where you'll notice differences more easily if you raise that to 200-240 FPS per eye.

As for the non-interactive, 2D content: there isn't much difference between 120 FPS and 240 FPS, or for those with a 50Hz power grid here: 100 and 200 FPS. Please do not confuse that with those 100/200/400Hz televisions or projectors. I'm talking about individual frames. Not repeated or computed approximate in-between frames.

The is also a slight difference between interactive usage and non-interactive usage, which mean you're going to have at least a 8-10ms (100/120 Hz) delay, before you receive the visual feedback. Interactive usage also improves if you speed things up here.

But a 1k FPS system? At the moment I see only to technologies to be capable to do that: Laser-projection and OLEDS (once we change how we control those displays, because imagine a 4k resolution image at 4 Bytes per pixel and you'll come very soon to problem of how to transport ~33 GB/s, ever wanted to fill a 2TB Harddisk in a minute?).

Then you would have to eliminate network gaming and international competitions or only allow those in LANs, because you'll be never able to achieve true 1k FPS if the network data needs more than 1ms to be transported.

So while I do not challenge that no one will ever notice the difference between true 120 FPS and true 1000 FPS, because of my experience with high FPS systems, I have my doubts that a 1k FPS display would be that "clearly much better experience". Especially on its own.
Leave a comment:
ownagefool replied

23 June 2012, 11:49 AM
Originally posted by YoungManKlaus View Post

That was a most awesome read, minuseins
Another great demo of how even little input lag can be important to perception for those who still don't believe that this is noticeable: http://techcrunch.com/2012/03/09/mic...-it-to-market/

Thats an awesome video, and pretty much debuffs anyone talking about higher figures.

As for minuseins post, it was very good, and I seems like he knows his stuff, but I'd still suggest high end players can tell the difference in numbers beyond those suggested, as shown in the video above, one would assume we're looking at the difference between 100fps and 1000fps type displays, and the 1000 is clearly a much better experience.
Leave a comment:
YoungManKlaus replied

23 June 2012, 09:36 AM
That was a most awesome read, minuseins
Another great demo of how even little input lag can be important to perception for those who still don't believe that this is noticeable: http://techcrunch.com/2012/03/09/mic...-it-to-market/
Leave a comment:
minuseins replied

23 June 2012, 09:26 AM
History it seem repeats itself... (sorry for the rather lengthy post, but at the end I got carried away...)

... or at least for those who are using Linux a little bit longer... The discussion about RT-Kernerls for Gaming, Servers, Default-Kernel, or default Nose-Hair-Smell is about as old as it gets (and were fought especially hard when they started to introduce 250 different process scheduler schemes (was it 2.0; 2.2 or 2.4?)). As are the benchmarks, and the "findings".

As for the "article"... I know that within the Linux world this site is often called "Moronix", and certainly there are reasons for that, but this article is indeed unnecessary, but I do appreciate the effort. It is one thing to just run artificial benchmarks on default configurations and claim then which distribution is the fastest, but it is - sorry about that - a dumb thing to do that to "proof" that a common optimization-"myth" doesn't work.
I think the German proverb for that is: "Wer viel misst, misst Mist!"

Especially RT-Kernels (next to compiler options) are very tricky and in almost all cases require far more maintenance and configuration than a simple "apt-get install" to get the full benefits. And it usually requires an application specifically written to utilize such an environment

And the benefit of an RT-Kernel is in almost all cases not raw throughput. Which was measured here with an arbitrary FPS. No, RT-Kernels do not improve performance by default. Actually the raw performance of a single big, CPU hungry application usually suffers. Why? Well that was already explained. RT-Kernels usually only guarantee that each process, interrupt, semaphore, etc. is granted an max. reaction/response time to their "request", and with that comes a certain amount of guaranteed CPU time to process the request (I am dumbing it down significantly, because of the complexity). And with a x86 configurations that's usually just guaranteed for the CPU, because most other components (PCIe Bus, GFX.) DO NOT GUARANTEE any response time or processing time. To compensate, any x86 configuration usually uses a lot of caching. That's why real RT systems are usually dedicated systems with dedicated components, with far less caching options and only one or two applications running on them (which also are written specifically for that environment and purpose).

Because the scheduler and working principles of an RT-Kernel treat each process equally (Same response time, only a fixed CPU time, before the process is force-switched to wait until it's turn has come), there is less overall CPU time available for one single heavy process which in return limits it's raw throughput. That's why - for example - you see Doom 3 (older application written mostly as monolithic single threaded process) suffers mostly from the use of a "default" RT-Kernel, while Uniengine (newer application, written with multi-processing and mult-threading in mind) is equal or actually performs slightly (very very slightly) better.

If you want higher FPS, don't use an RT-Kernel. Even a "renice -20" would work better. No, if it is just the FPS you are after, there are several better options available to you, like change the WM/Desktop System (Compiz/Unity is still hogging performance), change the Distribution (less background bloat crap), compiler options, X only when the application needs it, etc...

However, as it was pointed out already: The overall responsiveness of the system CAN improve using an RT-Kernel, which in return makes gaming more fluid and fun. Especially if the system is already taxed to the max. Nothing is more frustrating than a drop in frames every now an then, which usually occurs at moments when you absolutely can not use it. And that's what you can usually see with an RT-Kernel is a more steady FPS while the system is under load. Which in return reduces the overall latency of the whole system (including the device sitting in front of the display)

As for the Human reaction time... There are some significant issues/Myths here, especially with most of the numbers out there. Those numbers usually refer to the untrained, conscious decision making. People who care THAT much about their system latency are usually not untrained in what they are doing, and it has been proven that the decision making during high-volatile real-time reaction games is far closer to an unconscious/instinctive reaction than the usual multi-purpose conscious decision making those numbers usually refer to (and let's not forget Peripheral vision here, well let's because it would make it more complicated). I don't have the numbers on me, but I think only at about 100-120 FPS we are truly incapable of telling/feeling the difference between 120 FPS and let's say 240 FPS (which would suggest a threshold limit of 10ms, not 100ms). That's why I do have an issue with people who throw numbers around or say that humans are incapable of telling the difference. There is a difference between tricking the brain to assume a "fluid" motion at 24 FPS (cinema) and a truly stable picture and fluid motion. I appreciate that most of you will not notice any differences immediately and consciously , but I bet - even if you can't put a finger on it - you will see the difference, once you had the chance to compare those side-by-side.

Also modern interactive computing/gaming is basically applied time-dilation, because what we see and react to, is usually already several frames in the past, compared to what we think is happening now. And that's where the issue with the Display-Latency comes from.

Mathematically it is really easy to determine that if we want to "react" in time (now), at 60 FPS, we've got about 15ms (actual 16.6, but 15 is nicer to calculate with) for each frame, that includes computing, moving the data around quite a lot, until the final picture is coded for transfer to the monitor and de-coded. And in fact, that's never going to happen. Not because humans are to slow, no in fact the computers are to slow, and there are some limitations governed by the law of physics. At least for a straightforward approach. The bandwidth alone to update a single picture, flawless, lossless, on an LCD is mind boggling. If we wouldn't resort to some dark magic , use only one single line to transmit all the bits and bites in time, the HDMI cable could only be about 7 cm long to ENSURE that the picture is transferred in time (1080p/60). That's without any additional data like protocol, overhead, sound, etc.

Any longer and you'll experience delay/latency just because of the physical distance between computer and monitor. And while the overall bandwidth of HDMI (single-link) is about 3 times higher than my example requires, all it effectively means that the HDMI cable could be about 3 times longer, before you would experience any delay, just because of that physical distance. I know the actual value is higher, but only because of some of those dark magic tricks we use these days.

To complete my argument here: The minute you see the picture it is already older than you'd think, you are reacting to the past and because of fancy trickery we can compensate for most of the time-dilation (The most fun we have here with the network-code, warping anyone?) and it feels "real-time" and the reason why the input lag became so important is, that it ADDS lag to an already "old" situation and if the input lag is at 15ms (at least one frame, most likely 2 and in extreme cases 3) you can add up to 45ms just before the picture is actually shown to a human. And again, it doesn't include lag by computing the actual picture, lag by network, lag by usb polling, lag by io-interrupts, etc... These days, in the best of all worlds, you're trailing about 4-5 pictures already, compared to the "current" situation and only 1-2 frames away to what common science would attribute humans to be able to distinguish consciously (~100ms).

That's the real reason, why input-lag became so important for gamers, because it nearly doubled the existing lag and ~150 - 200 ms delay ARE noticeable, even to the slower ones under us. 50-75ms just drops the bar from noticeable bad to noticeable but not bad, or even unnoticeable. Also something some of use should be still familiar with, if you used CRTs extensively. CRTs at 60Hz are just plain bad. CRTs at 75Hz still gave me headaches (as do cinema movies in the movie theater if played at 24p). CRTs at 85Hz were bearable, but only at 100Hz I felt I had a stable picture. Anything beyond that I didn't really notice. I was so sensitive to that, that I could tell, from 10m away, if a monitor was set to 60/75 or 85 for the 100 I need to go in front of the computer. So whenever somebody tells you that humans can't hear beyond 44kHz, can't see beyond 30fps, tell the difference between an MP3 conversion and the original: so please tell them, that

a) it's not true, because they can (Admittedly not everyone, and certainly not always consciously)
b) that there is a difference between not being bothered, and not being able to recognize
c) if you don't believe me... use VLC launch two videos, put them side-by-side launch them at the same time, and then use one instance to jump forward a frame or two. How many frames can you jump, before you'd notice that one video is actually ahead of the other, and how many frames can you jump before you notice that something is odd?

To finally close the circle:

Pro-Gaming usually has an application profile that benefits from constant performance over peak performance configuration. That's why constant lag (network & video & IO) is far more important than minimum lag or max FPS. Therefore RT-Kernels COULD help to create a more reliable system, but there are so many different things I would do first, before even considering using an RT-Kernel. Trust me, your time is better spent implementing QoS on your Router to optimize the flow of packets there, than to repeat this exercise.

But that's not as easy to benchmark as peak performance.

p.s.

The only reason to consider to use an RT-Kernel is if you'd try to create a new next generation PC hardware based console gaming platform, but then you'd change

a) the overall application profile of your environment and
b) you'd had the chance to write the games specifically to that environment

RT-Kernel Timing-Computing can be extremely powerful, as the cell processor has shown with the PS3 (ok, less RT-Kernel, than RT-Hardware, but let's keep this simple)
Leave a comment:
YoungManKlaus replied

23 June 2012, 07:44 AM
For everyone complaining about <1ms kernel lag, consider that Xorg with its defaults can cause lag up to 30ms:
https://bugs.freedesktop.org/show_bug.cgi?id=46578

Good point ... X is an abomination per se
Luckily, with wayland coming up, we have a fair chance that X will get eliminated from the default graphics stack in the future
Leave a comment:
ownagefool replied

23 June 2012, 07:15 AM
Originally posted by set135

I believe the latency numbers for rt kernels on good hardware are measured in *microseconds*, say average <10us and maxiumum <30us on a core2 duo e6300.

That'd make a lot more sense, and would be competitive with windows (better than windows 7, about the same as xp, from my xp). I was specifically replying to the folks above me, not trying to say the kernel was actually slow, because I had no idea what the actual numbers are.

The next worry to a gamer would be mouse acceleration, which is a problem with windows, and X/desktop environments make difficult to get rid of. Again, something like raw input would fix this, but I have no idea how linux tackles input from mice -> game.

Last edited by ownagefool; 23 June 2012, 07:20 AM.
Leave a comment:

silenceoftheass replied

23 June 2012, 06:51 AM

Low Latency

I have a little scheduling latency tester (I do realtime audio, so it's something of interest).

Basically it runs (zero or N*2) looping background threads (niced) and then 1 or N (high priority/realtime) threads that attempt to be scheduled using nanosleep to wake up at predefined times.

Take the numbers with a grain of salt, I've only used it for internal stuff. The testing could be better too, using locked memory and forcing swapping during execution.

I'm aware I'm not testing response to IO events here, but it does give some interesting numbers:

These figures are for kernels with wakeup frequency around 2756 hz (I test lots more frequencies, but around here is where the CPU shows up some issues for the laptop it's running on):

Quick explanation of the figures:

its = iterations
freq/frq = wakeup frequency
perNan = Nano delta between wakeups
perMic = Microseconds delta between wakeups
perMil = Milliseconds delta between wakeups
nhi = number high priority (SCHED_RR)
nlo = number regular priority (normal linux)
bbac = number background looping threads
fails = number of iterations that failed to meet the wakeup deadline
avr = average "oversleep"
mnan = maximum "oversleep" in nanoseconds
mmic = maximum "oversleep" in microseconds
mmil = maximum "oversleep" in milliseconds

Code:

2.6.39-preempt-gentoo-r3
Frequency pass: freq(2756) perNan(362844.70) perMic(  362.84) perMil( 0.36)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(00) fails(    1) avr(   78600.29) mnan(   383190) mmic(  383.19) mmil( 0.38)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(08) fails(    0) avr(   50682.76) mnan(   228595) mmic(  228.59) mmil( 0.23)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(00) fails(   13) avr(   73451.72) mnan(   978501) mmic(  978.50) mmil( 0.98)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(08) fails(    8) avr(   54678.49) mnan(  2271642) mmic( 2271.64) mmil( 2.27)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(00) fails(    5) avr(   36309.15) mnan(   839371) mmic(  839.37) mmil( 0.84)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(08) fails(    1) avr(    4686.12) mnan(  1854080) mmic( 1854.08) mmil( 1.85)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(00) fails(    0) avr(   22805.04) mnan(   172117) mmic(  172.12) mmil( 0.17)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(08) fails(    0) avr(    6372.65) mnan(    63957) mmic(   63.96) mmil( 0.06)

For the above, there are some failures even when using high priority tasks i.e. nhi != 0 and missed it's wake up by an excessive margin

Code:

3.3.2-preempt-c2T7400-2.16
Frequency pass: freq(2756) perNan(362844.70) perMic(  362.84) perMil( 0.36)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(00) fails(    0) avr(   88775.06) mnan(   211109) mmic(  211.11) mmil( 0.21)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(08) fails(    0) avr(   54785.60) mnan(   239303) mmic(  239.30) mmil( 0.24)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(00) fails(    0) avr(   90745.55) mnan(   341535) mmic(  341.54) mmil( 0.34)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(08) fails(    1) avr(   52708.94) mnan(   543801) mmic(  543.80) mmil( 0.54)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(00) fails(    0) avr(   87453.27) mnan(   154366) mmic(  154.37) mmil( 0.15)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(08) fails(    0) avr(    4731.63) mnan(    17300) mmic(   17.30) mmil( 0.02)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(00) fails(    0) avr(   35838.07) mnan(   168112) mmic(  168.11) mmil( 0.17)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(08) fails(    0) avr(    5730.70) mnan(    71996) mmic(   72.00) mmil( 0.07)

For the above, we see it's a little better, less failures, and on the whole the oversleep amount (avr) a lot less.

Code:

3.0.14-rt31-preempt-c2T7400-2.16
Frequency pass: freq(2756) perNan(362844.70) perMic(  362.84) perMil( 0.36)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(00) fails(    0) avr(   78913.17) mnan(   345092) mmic(  345.09) mmil( 0.35)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(01) bbac(08) fails(    0) avr(   52385.20) mnan(   162347) mmic(  162.35) mmil( 0.16)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(00) fails(    0) avr(   88372.10) mnan(   349752) mmic(  349.75) mmil( 0.35)
One pass: its(2000) frq( 2756.00) nhi(00) nlo(02) bbac(08) fails(    1) avr(   55378.42) mnan(   751052) mmic(  751.05) mmil( 0.75)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(00) fails(    0) avr(   28074.12) mnan(   109365) mmic(  109.36) mmil( 0.11)
One pass: its(2000) frq( 2756.00) nhi(01) nlo(00) bbac(08) fails(    0) avr(    5074.74) mnan(    18434) mmic(   18.43) mmil( 0.02)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(00) fails(    0) avr(   65620.09) mnan(   137933) mmic(  137.93) mmil( 0.14)
One pass: its(2000) frq( 2756.00) nhi(02) nlo(00) bbac(08) fails(    0) avr(    5588.25) mnan(    27105) mmic(   27.11) mmil( 0.03)

For the above, the realtime kernel fairs more or less in the same ballpark as the 3.3.2 kernel, but mostly with average longer wakeup latencies (avr).

Lets look at some heavier loads (higher wakeup hz)

Code:

3.3.2-preempt-c2T7400-2.16
Frequency pass: freq(3000) perNan(333333.33) perMic(  333.33) perMil( 0.33)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(01) bbac(00) fails(    0) avr(  147932.56) mnan(   189177) mmic(  189.18) mmil( 0.19)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(01) bbac(08) fails(    1) avr(   54985.65) mnan(   846406) mmic(  846.41) mmil( 0.85)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(02) bbac(00) fails(    1) avr(   99805.41) mnan(  1078824) mmic( 1078.82) mmil( 1.08)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(02) bbac(08) fails(    1) avr(   50858.96) mnan(   935250) mmic(  935.25) mmil( 0.94)
One pass: its(2000) frq( 3000.00) nhi(01) nlo(00) bbac(00) fails(    0) avr(   86170.02) mnan(   109395) mmic(  109.39) mmil( 0.11)
One pass: its(2000) frq( 3000.00) nhi(01) nlo(00) bbac(08) fails(    0) avr(    4755.44) mnan(    38145) mmic(   38.15) mmil( 0.04)
One pass: its(2000) frq( 3000.00) nhi(02) nlo(00) bbac(00) fails(    0) avr(   40428.37) mnan(   142389) mmic(  142.39) mmil( 0.14)
One pass: its(2000) frq( 3000.00) nhi(02) nlo(00) bbac(08) fails(    0) avr(    5128.04) mnan(    55881) mmic(   55.88) mmil( 0.06)

Code:

3.0.14-rt31-preempt-c2T7400-2.16
Frequency pass: freq(3000) perNan(333333.33) perMic(  333.33) perMil( 0.33)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(01) bbac(00) fails(    1) avr(   85895.76) mnan(   364207) mmic(  364.21) mmil( 0.36)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(01) bbac(08) fails(    2) avr(   55065.84) mnan(   995272) mmic(  995.27) mmil( 1.00)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(02) bbac(00) fails(    2) avr(   96075.35) mnan(   654147) mmic(  654.15) mmil( 0.65)
One pass: its(2000) frq( 3000.00) nhi(00) nlo(02) bbac(08) fails(    0) avr(   64900.84) mnan(   146107) mmic(  146.11) mmil( 0.15)
One pass: its(2000) frq( 3000.00) nhi(01) nlo(00) bbac(00) fails(    0) avr(   53375.44) mnan(   134737) mmic(  134.74) mmil( 0.13)
One pass: its(2000) frq( 3000.00) nhi(01) nlo(00) bbac(08) fails(    0) avr(    5056.19) mnan(     9248) mmic(    9.25) mmil( 0.01)
One pass: its(2000) frq( 3000.00) nhi(02) nlo(00) bbac(00) fails(    0) avr(   93658.56) mnan(   140582) mmic(  140.58) mmil( 0.14)
One pass: its(2000) frq( 3000.00) nhi(02) nlo(00) bbac(08) fails(    0) avr(    5230.94) mnan(     9641) mmic(    9.64) mmil( 0.01)

So even under heavy scheduling loads and timing constraints, the 3.3.2 kernel is actually pretty damn good when it comes to scheduling.

I've not yet found tight enough timing constraints that the realtime kernel fairs better than the stock 3.3.2.

Now for gaming it gets tricky - as other posters have mentioned up thread, you really want to be measuring the latency from say

mouse click -> graphical response

So in an ideal world, you create a USB device with a little light on it, and using a high speed camera you record when you click (light comes on) and when the screen displays a result. Then compute the latency based on the number of frames the camera recorded inbetween.

I know I should release it (peer review, eyes over the code etc) but for the moment it's tied to the internal code base here and I have higher priority fish to fry. I just thought some might find the figures interesting.

Cheers,

D

Last edited by silenceoftheass; 23 June 2012, 06:53 AM.

Announcement

A Low-Latency Kernel For Linux Gaming

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: