Announcement

Collapse
No announcement yet.

Ubuntu Looking At Applying Low-Latency Optimizations To Its Generic Kernel

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Volta
    replied
    Originally posted by ddriver View Post

    Hard-coded values are still stored in program memory you know, they can't be obtained by magic.
    Some values are accessed faster.

    It may theoretically be a tad cheaper to have an immediate value, but that would be negligible, and if I had to guess why it was hardcoded in the first place, I'd put "performance" after "legacy design" and "backward practices".
    I'd put smart design and performance first. I bet many more choices depends on this timer, so having this 'static' provides better optimizations rather than modifying all of dependent values on fly.

    I am also gonna guess they don't have a "dynamic version" for a side by side to compare and evaluate the hard-coded version as sufficiently superior so as to be worth the inconvenience.


    What's more, it doesn't really make sense for this variable to be that frequently accessed, something like that will likely be passed and kept in the cpu hardware timers, what's more reasonable is that it is only accessed when the timeout needs to be changed.
    When it's 'static' it doesn't need to change at all. When comes to Preemption behavior defined on boot it's not such obvious:

    The runtime overhead is negligible with HAVE_STATIC_CALL_INLINE enabled but if runtime patching is not available for the specific architecture then the potential overhead should be considered.​
    I guess the same applies to frequency.

    P.S. I wonder how would you want to program HPET timers for this?
    Last edited by Volta; 01 February 2024, 12:17 PM.

    Leave a comment:


  • ddriver
    replied
    Originally posted by Volta View Post
    I'm not so sure:
    We're probably talking about the most accessed variable ever, so it should matter.
    Hard-coded values are still stored in program memory you know, they can't be obtained by magic.

    It may theoretically be a tad cheaper to have an immediate value, but that would be negligible, and if I had to guess why it was hardcoded in the first place, I'd put "performance" after "legacy design" and "backward practices".

    I am also gonna guess they don't have a "dynamic version" for a side by side to compare and evaluate the hard-coded version as sufficiently superior so as to be worth the inconvenience.


    What's more, it doesn't really make sense for this variable to be that frequently accessed, something like that will likely be passed and kept in the cpu hardware timers, what's more reasonable is that it is only accessed when the timeout needs to be changed.



    Last edited by ddriver; 30 January 2024, 12:08 PM.

    Leave a comment:


  • Volta
    replied
    Originally posted by ddriver View Post


    Or it might have absolutely no negative impact whatsoever, it is just impossible to accomplish it, since dated legacy codebases are riddled with shortsighted hardcodng make it extremely cumbersome to facilitate design changes so you are committed to doing things the stupid way till you run yourself into a corner and have no choice but to to redo everything.
    When comes to dated legacy code base we're not talking about Windows here. I bet there's no sense to have this dynamic. Going to 0 for being tickles (in which scenarios?) is example of stupid design. Btw. does Windows set HPET timers to 0 as well in such case? Or perhaps switches to mentioned software timer for tickles mode? This is hardcoding.

    There's not much to optimize in passing a timeout value to a system timer. It doesn't matter if the value is static or dynamic in the slightest.
    I'm not so sure:

    When you declare a const in your program,

    ​Compiler can optimize away this const by not providing storage for this variable; instead it can be added to the symbol table. So a subsequent read just needs indirection into the symbol table rather than instructions to fetch value from memory.​


    We're probably talking about the most accessed variable ever, so it should matter.
    Last edited by Volta; 30 January 2024, 05:09 AM.

    Leave a comment:


  • ddriver
    replied
    Can't shake the feeling preemptive MT is just throwing performance away.

    It is quite amazing that in this day and age, CPUs are still executing blind binaries, that we can't pass buffering requirements down to prefetchers, so there's a lot of redundant prefetching, overshooting and cache pollution and wastage.

    I am not against preemptive as an option, but I'd like cooperative intrinsics and access requirement hints so that no cpu cycles are wasted preempting and prefetfhers know exactly what our requirements are for each buffered access duration and system resources are optimally distributed based on controllable dynamic behavior rather than hardcoding and hit or miss blackboxes.

    Optimizing for either or with a preemptive scheduler is leaving a lot of performance on the table. I'd say as a consequence of legacy design in the software and hardware, we are throwing away a good 20-25% of the performance.

    The "highest isolate throughput" is pretty much the "standard" for "benchmarking" hardware. And certain cpu vendors even design cpus to win such scores rather than do real world work.

    I've come to realize that a system where every single thing is "maximum throughput" on its own, as a system usually doesn't really run that well, because there's poor synergy between all those resource greedy operations when they are concurrently executing, in a structure and configuration that is impossible to know ahead of time.


    Mayhaps it is time to have a user programmable logic synchronization enclave and entirely offload that stuff from the compute pipeline? It wouldn't take much more than adding an api for explicit access to the prefetchers, telling them rather than having them taking time and chance guessing.
    Last edited by ddriver; 30 January 2024, 12:21 AM.

    Leave a comment:


  • mrg666
    replied
    I have been building the kernel with CONFIG_NO_HZ_FULL=y, CONFIG_PREEMPT=y, CONFIG_HZ=300, and CONFIG_SCHED_HRTICK=y. It is running great. I can build a kernel with all available threads (j=32 with Ryzen 5950) in the background and use the computer as usual without any slow down. IO Scheduler is BFQ.

    Leave a comment:


  • ddriver
    replied
    Originally posted by MadCatX View Post

    The length of a jiffy (= the internal kernel time unit) is derived from the CONFIG_HZ value. Conversions between jiffies and milliseconds are done with macros that, ideally, evaluate at compile time because the value of CONFIG_HZ is known. Making it adjustable might come with some performance penalty because the compiler would have less opportunities to optimize it. Additionally, we have had high resolution timers in the kernel for a very long time. Whenever the kernel needs to schedule something with sufficient precision, it will use those.

    Or it might have absolutely no negative impact whatsoever, it is just impossible to accomplish it, since dated legacy codebases are riddled with shortsighted hardcodng make it extremely cumbersome to facilitate design changes so you are committed to doing things the stupid way till you run yourself into a corner and have no choice but to to redo everything.

    There's not much to optimize in passing a timeout value to a system timer. It doesn't matter if the value is static or dynamic in the slightest.

    Leave a comment:


  • RealNC
    replied
    Originally posted by uid313 View Post
    How many Hz does Android, iOS, macOS, Windows and Windows Server have?
    Dunno about the others, but Windows is up to 2000Hz, but it dynamically alters it depending on what applications have requested (there's a win32 API for it) or if Windows has detected that a game is running. It can also become 0 if nothing requires timer interrupts (tickless.)
    Last edited by RealNC; 29 January 2024, 09:50 AM.

    Leave a comment:


  • darkbasic
    replied
    Originally posted by chromer View Post
    Please include Arch kernel in benchmarks.
    Arch Linux already has all of these except CONFIG_HZ which is set at 300.

    Leave a comment:


  • MadCatX
    replied
    Originally posted by ddriver View Post
    Typical ... let's change kernel to change this one hardcoded parameter value. Cuz it is not like it can be made a dynamic value configurable value the user can set to their needs on a per use case.
    The length of a jiffy (= the internal kernel time unit) is derived from the CONFIG_HZ value. Conversions between jiffies and milliseconds are done with macros that, ideally, evaluate at compile time because the value of CONFIG_HZ is known. Making it adjustable might come with some performance penalty because the compiler would have less opportunities to optimize it. Additionally, we have had high resolution timers in the kernel for a very long time. Whenever the kernel needs to schedule something with sufficient precision, it will use those.

    Leave a comment:


  • ddriver
    replied
    Typical ... let's change kernel to change this one hardcoded parameter value. Cuz it is not like it can be made a dynamic value configurable value the user can set to their needs on a per use case.

    Originally posted by uid313 View Post
    How many Hz does Android, iOS, macOS, Windows and Windows Server have?
    ​Android not many most likely - most android devices still can't do real time audio for example.

    iOS in contrast has real time audio since day one, so I assume its pretty responsive, they took core audio for their mobile devices, quite possible other core internals were also brought over from macos.

    Windows is barely ok, but the added bloatware often breaks realtime requirements
    Last edited by ddriver; 29 January 2024, 07:00 AM.

    Leave a comment:

Working...
X