Announcement

Collapse
No announcement yet.

The XanMod Kernel Is Working Well To Boost Ubuntu Desktop / Workstation Performance

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Linuxxx View Post

    Just recently I configured an AMD Ryzen laptop for someone, and this is what I observed:

    - the performance governor is not able to hit the boost speeds
    - ondemand would lead to noticable stuttering
    - best config: schedutil with /sys/devices/system/cpu/cpufreq/schedutil/rate_limit_us set to 0 (zero)

    All of the above with Ubuntu 19.10 with Linux 5.3 - lowlatency.
    Looks like it's time to revisit schedutil. I know it's historically behaved badly, especially on MuQSS, but maybe it's worth another try.

    As for performance not boosting properly, at least on the 5960x, the CPU only goes into turbo once there's mild load. If your system is actually idle, it tends to run the cores below boost frequencies (max frequency without boost). This behavior might be different on AMD processors.

    And as for hitting single core boosts with MuQSS, that's probably nearly impossible. MuQSS aggressively hits deadlines at a cost of migrating threads between cores. You would need to configure ondemand to aggressively ramp frequency up at lower core utilization to hit single core boosts, but this would destroy battery life for laptops.

    I think the right way forward (if you want to use MuQSS), is to look into why boosting performs so much worse on Ryzen over any modern Intel processor. My theory is just lack of attention since Ryzen processors are relatively new and most people don't run the performance governor to even notice.

    Comment


    • #22
      I tested schedutil briefly, it looks like it still ramps frequencies as aggressive as it used to. In the screenshot below, I have a load less than 1.0 with 8 cores, and it's running all cores except one over 2 GHZ. Its behavior is very similar to running straight "performance" on an Intel processor with MuQSS.

      [Meant to put screenshot here but apparently I don't have permission to add attachments]

      However, I can see the value if this lets Ryzen processors turbo boost correctly. This gets us the behavior of "performance" on intel processors and enables proper boosting on Ryzen processors. The only disadvantage is there's no way to configure schedutil as default for acpi-cpufreq, but also get performance on intel-pstate. At least, without changing default governor for intel-pstate in the source code, which I'm not against:

      We would just need to modify this code block to pick performance if schedutil is the default:

      Code:
      static int intel_pstate_cpu_init(struct cpufreq_policy *policy)
      {
              int ret = __intel_pstate_cpu_init(policy);
      
              if (ret)
                      return ret;
      
              if (IS_ENABLED(CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE))
                      policy->policy = CPUFREQ_POLICY_PERFORMANCE;
              else
                      policy->policy = CPUFREQ_POLICY_POWERSAVE;
      
              return 0;
      }
      And Linuxxx , setting rate_limit_us to zero didn't make any difference on my system. What affect do you get by changing it? According to the code that uses it, it's to stop cpufreq from changing the the core frequency too often. If you set it to zero, it always changes core frequency. I think maybe it's best to reduce this value but not set it to zero. For instance, set to 1000 so you restrict core updates to once every millisecond (1000 hz), or reduce it to rr_interval in MuQSS (2ms on Liquorix).

      Comment


      • #23
        Originally posted by damentz View Post
        And Linuxxx , setting rate_limit_us to zero didn't make any difference on my system. What affect do you get by changing it? According to the code that uses it, it's to stop cpufreq from changing the the core frequency too often. If you set it to zero, it always changes core frequency. I think maybe it's best to reduce this value but not set it to zero. For instance, set to 1000 so you restrict core updates to once every millisecond (1000 hz), or reduce it to rr_interval in MuQSS (2ms on Liquorix).
        Why use a value of zero?
        Simple:
        Because Android does it this way, too!

        The person I set up the Ryzen notebook for complained that his mid-range Android smartphone could scroll through heavy websites smoothly while a default install of Ubuntu 19.10 struggled to do the same.

        So googling a bit revealed that rate_limit_us is actually set to 0 on Android kernels in combination with schedutil!

        At first I also thought that this must lead to heavy overhead, but a simple "sudo cpupower monitor" showed that the deepest sleep state was still being hit.

        So the way I understood it was that schedutil with rate_limit_us equaling 0 is not changing the frequency constantly, but instead reacting to any change in load as quickly as possible, without any artificially defined limit.

        Really, I used to be a MuQSS user myself, but nowadays a Linux kernel with CFS + 1000 Hz + PREEMPT (i.e. Ubuntu's "lowlatency" kernel) is more than adequate to get the job done!

        Comment


        • #24
          Originally posted by Linuxxx View Post
          Why use a value of zero?
          Simple:
          Because Android does it this way, too!

          So the way I understood it was that schedutil with rate_limit_us equaling 0 is not changing the frequency constantly, but instead reacting to any change in load as quickly as possible, without any artificially defined limit.
          There is this patch from Feb 2017:

          This patch changes the way rate_limit_us is used, i.e. It now governs "How frequently we change the frequency" instead of "How frequently we reevaluate the frequency". One may think that this change may have increased the number of times we reevaluate the frequency after a period of rate_limit_us has expired since the last change, if the load isn't changing. But that is protected by the scheduler as normally it doesn't call into the schedutil governor before 1 ms (Hint: "decayed" in update_cfs_rq_load_avg()) since the last call.
          Its difficult to create a test case (tried rt-app as well) where this patch will show a lot of improvements as the target of this patch is a real corner case. I.e. Current load is X (resulting in freq change), load after rate_limit_us is also X, but right after that load becomes Y. Undoubtedly this patch would improve the responsiveness in such cases.
          I don't know if that got merged, but looking at the description for the property in the official kernel docs, it seems it hasn't?

          rate_limit_us
          Minimum time (in microseconds) that has to pass between two consecutive runs of governor computations (default: 1000 times the scaling driver’s transition latency).
          The purpose of this tunable is to reduce the scheduler context overhead of the governor which might be excessive without it.
          No idea about the Android kernel, but they might have patched it. Just referencing that their kernel sets the value to X, doesn't mean it will work well for us if their kernel potentially has other differences that could affect that?

          Comment


          • #25
            Originally posted by polarathene View Post

            There is this patch from Feb 2017:



            I don't know if that got merged, but looking at the description for the property in the official kernel docs, it seems it hasn't?



            No idea about the Android kernel, but they might have patched it. Just referencing that their kernel sets the value to X, doesn't mean it will work well for us if their kernel potentially has other differences that could affect that?
            I have since tried out schedutil with rate_limit_us set to 0 with intel_cpufreq (i.e. intel_pstate=passive) [default value with Ubuntu 18.04.3 LTS with Linux 5.0-lowlatency was 500] and here too I couldn't observe any overhead at all with "sudo cpupower monitor".
            While the system was idle, frequencies were at their minimum base clock, while loading the different cores would result in an appropriate response immediately.
            Also, smoothness while scrolling through some heavy websites was still a given.

            Maybe You could also give this a try & share Your experience with us?

            Comment


            • #26
              Originally posted by damentz View Post
              I tested schedutil briefly, it looks like it still ramps frequencies as aggressive as it used to. In the screenshot below, I have a load less than 1.0 with 8 cores, and it's running all cores except one over 2 GHZ. Its behavior is very similar to running straight "performance" on an Intel processor with MuQSS.
              What kernel did you test with? Linux 5.5 kernel is meant to get the "frequency invariance" patches that improve schedutil on Intel CPUs, and iirc that would be less aggressive with ramping frequencies? So the impact once that arrives may be less comparable to the "performance" mode of pstate.

              Btw, with your Liquorix kernel, you must have played with the timer enough, did you not find the recommended 100Hz value by Con for MuQSS to work well? I see quite a few users using 1000Hz with MuQSS as if it were another scheduler, but my understanding was that the issues some users brought up regarding it was because they weren't including the highres timer patches from linux-ck?

              I have noticed that Con at one point made more modifications regarding time management in the kernel and had shared some good insights about it on his blog, but at a later point reverted those, still he would continue to advise 100Hz for better throughput and lower power, while still getting low latency via the highres timer patches. In one of the updates comments he refers to it being equivalent of 10000Hz(10x), but it wasn't clear if that was from increasing the timer rate to 1000Hz vs his recommendation of 100Hz.

              Originally posted by damentz View Post
              I think maybe it's best to reduce this value but not set it to zero. For instance, set to 1000 so you restrict core updates to once every millisecond (1000 hz), or reduce it to rr_interval in MuQSS (2ms on Liquorix).
              As mentioned in previous post, a value of 0 apparently is equivalent to 1ms(if you drill all the way down the discussion, that's not entirely true, some circumstances are less, but deemed not likely to be a real issue, although it might have been based on changes in the patch, I can't recall).

              Originally posted by Linuxxx View Post
              Maybe You could also give this a try & share Your experience with us?
              Perhaps, I'm still on a research binge regarding proper optimizations for power-saving, and then performance with custom kernels. As mentioned to damentz above, it might be best to wait a bit first and try the changes with a kernel with the schedutil update. Is there a community(forum/reddit/discord/etc) in particular for this?

              I'm curious if it can actually do better than pstate govenor, what I'd really like though is proper benchmark/measurement/monitoring tools for such, rather than naive approach. I'd need to also look up the pstate docs again, but depending on age of the chip, at least with the active modes, the effectiveness of pstate can vary(newer models with HWP support for example). Note that you can also completely disable pstate rather than just passive mode, if you really want to evaluate schedutil, iirc passive mode isn't the same and thus you may get different results vs pstate completely disabled(probably less favorable).

              Comment


              • #27
                Originally posted by polarathene View Post

                Perhaps, I'm still on a research binge regarding proper optimizations for power-saving, and then performance with custom kernels. As mentioned to damentz above, it might be best to wait a bit first and try the changes with a kernel with the schedutil update. Is there a community(forum/reddit/discord/etc) in particular for this?

                I'm curious if it can actually do better than pstate govenor, what I'd really like though is proper benchmark/measurement/monitoring tools for such, rather than naive approach. I'd need to also look up the pstate docs again, but depending on age of the chip, at least with the active modes, the effectiveness of pstate can vary(newer models with HWP support for example). Note that you can also completely disable pstate rather than just passive mode, if you really want to evaluate schedutil, iirc passive mode isn't the same and thus you may get different results vs pstate completely disabled(probably less favorable).
                That's why I already had asked Michael if he could do another round of testing with the schedutil governor on Linux 5.5, since there is also the fundamental change of the CFS scheduler with closer integration of PELT [Per-Entity Load Tracking], which should also be of particular help for schedutil.

                Preferably those tests should also include a custom run with "rate_limit_us" equalling 0;
                that's why I think an AMD Ryzen system would be particularly interesting, since not only one doesn't need to worry about 'acpi-cpufreq' vs. 'intel-cpufreq', but also because the default value of "rate_limit_us" was set to an insanely high number!
                (10k, i.e. 10000 µs, so 10 ms - which is just way too high when you consider that a frame on a 60 Hz monitor only has 16.7 ms to be drawn; and now imagine using one of those shiny new 360(!) Hz monitors nVidia announced @ CES 2020!)

                Comment


                • #28
                  Originally posted by Linuxxx View Post
                  if he could do another round of testing
                  I'm sure he will once the new kernel is actually out. Slightly newer 5.4 kernels(5.4.10, maybe earlier) also got this merged:

                  Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes

                  For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period.

                  For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes.

                  This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions.

                  Source
                  Which might not be relevant to some users, but it's a rather notable boost in that situation. I think Michael wrote an article about it a while back?(I saw the source link shared in the recent zen article comments)

                  Originally posted by Linuxxx View Post
                  (10k, i.e. 10000 µs, so 10 ms - which is just way too high when you consider that a frame on a 60 Hz monitor only has 16.7 ms to be drawn; and now imagine using one of those shiny new 360(!) Hz monitors nVidia announced @ CES 2020!)
                  I don't think having a monitor refresh rate equates to outputting that many unique frames. Often in games stuff like physics runs at a lower rate anyway, even before the trend to push high refresh rates beyond 60Hz iirc. Really depends what's being done per frame, the higher the frame rate, the more likely optimizations may be in place to skip unnecessary work, especially if it would result in effectively the same frame being rendered.

                  Just like how laptops have PSR(Panel Self Refresh) as well as FBC(Frame Buffer Compression) and Intel has DRRS(Dynamic Refresh Rate Switching), these all reduce the need to update the display or run it at lower refresh rates when content isn't actually changing.

                  Adjusting frequencies or power-states rather in which to do so can involve some latencies afaik? AMD 4000 Renoir series reduced this by 80% compred to 3000 series, which made quite a difference apparently.

                  The patch I linked to improved on it so that the 10ms wasn't necessary a bad thing and was used more smartly iirc.

                  At 1000Hz, you'd only benefit from up to 1-2ms afaik, the MuQSS author points out how a bunch of kernel code timing is Hz limited(so 100Hz was no faster than 10ms, or 20ms reliably?), although afaik the highres timer patches mentioned there were reverted in future release, possibly from some problems some users experienced. This sort of thing is also what I see some users in comments on his release posts regarding adjusting the timer to 1000Hz instead of his advised 100ms, but iirc those users often(if they later mentioned it at all) weren't using the other highres timer patches that linux-ck provides separate from MuQSS:

                  Additionally, I've added a number of APIs to the kernel to do specified millisecond schedule timeouts which use the highres timers which are mandatory now for MuQSS. The reason for doing this is there are many timeouts in the kernel that specify values below 10ms and the timer resolution at 100Hz only guarantees timeouts under 20ms. - Source
                  I'm all for seeing tests that can show worthwhile gains before/after.

                  Did the friend with Ubuntu 19.10 try this tweak you suggest and get performance to be smooth like on their Android device? Both Chrome and Firefox(especially for smooth scrolling I hear) tend to out of the box not have the greatest hardware(GPU/HW accel) support enabled to perform well, whereas on Android it was probably a given, similar to how it can be with Windows. Wasn't clear if the issue was related to frequency of CPU(eg full clockspeeds resolve it, if you need boost for prolonged period for web page scrolling that doesn't sound good..), fixing it with CPU performance might work, but if using the GPU is possible it'd probably be much better/efficient.

                  Thus, you might be chasing after the wrong improvements to solve the issue correctly? (does sound like you got it working better, but such a specific setting/change seems a bit odd to resolve it)

                  Comment


                  • #29
                    Originally posted by polarathene View Post

                    Did the friend with Ubuntu 19.10 try this tweak you suggest and get performance to be smooth like on their Android device? Both Chrome and Firefox(especially for smooth scrolling I hear) tend to out of the box not have the greatest hardware(GPU/HW accel) support enabled to perform well, whereas on Android it was probably a given, similar to how it can be with Windows. Wasn't clear if the issue was related to frequency of CPU(eg full clockspeeds resolve it, if you need boost for prolonged period for web page scrolling that doesn't sound good..), fixing it with CPU performance might work, but if using the GPU is possible it'd probably be much better/efficient.

                    Thus, you might be chasing after the wrong improvements to solve the issue correctly? (does sound like you got it working better, but such a specific setting/change seems a bit odd to resolve it)
                    Yes, schedutil with rate_limit_us set to 0 really did solve the issue (though admittedly I haven't tried with the default value of 10k, since I wrote a simple udev rule to automatically set the value to zero upon boot).

                    Also, if a value below 1 ms (so rate_limit_us with less than 1000) really shouldn't make a difference, than why does Intel prefer a value of 500 with "intel_cpufreq"?

                    Comment


                    • #30
                      Originally posted by Linuxxx View Post
                      Also, if a value below 1 ms (so rate_limit_us with less than 1000) really shouldn't make a difference, than why does Intel prefer a value of 500 with "intel_cpufreq"?
                      Dunno, if it's used by firmware internally it'd be a different story? According to the MuQSS release blogpost I quoted/linked, the 1ms value was to do with how often kernel calls or whatever could be performed(and 1ms needed 1000Hz kernel to manage that). I've been reading quite a lot, and might have mixed some of what I said up or misunderstood it. It's best just to confirm with proper tests/measurements.

                      MuQSS at that point (2016) with the linux-ck patches for that linked release modified parts of the kernel logic that were dependent on Hz to perform instead of the highres timers, since MuQSS was to provide the low 1000Hz latency but with 100Hz kernel(when combined with the highres timer patches in linux-ck), but as I mentioned, later that year some of that was reverted iirc.

                      All that matters in your case is that for you the default value caused scrolling issues in a web browser and setting it to 0 resolved it with no noticeable drawbacks. There might have been better ways to resolve it, or not, I don't seem to have such issues and damentz wasn't able to notice a difference when he compared? (just to clarify, I've not tested myself, I just don't seem to have any web browser scrolling issues personally).

                      Comment

                      Working...
                      X