No announcement yet.

How to tell if a driver is gallium or just mesa? (Slow renderng with radeon)

  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Disable unneeded drivers in mesa config to speed up compilation.
    You are right, but my bad... Mesa compilation was not that slow, but I should have disabled unneeded kernel modules in the kernel compilation because that was SLOW like hell... Will learn from this :-)


    • #12
      You can use the make localmodconfig
      That seems like a good idea...

      Btw I am now running on my own kernel - hurray! I think this is the first time for doing so ;-)

      The sad news is that barely anything have changed with the performance despite the whole lot of changes. In glxgears I can see the frame rate is now constantly 356..357 FPS which is 1-3 FPS better than ever before. In extreme tux racer - that I am also used to test every kind of configuration so far along with glxgears - the performance change is negligible. In the menu maybe I get 1-5 FPS more, but in the game it is either the same, or just a very small amount better and I actually think it is sometimes slower. I used to go though the same map and in the same pathway as far as I can do so and look at FPS at predefined points. One of these points is when I cross the goal and the finishing animation is played - it is not a video or that kind of animation, just the control is gone and the penguin turns towards the player so because I do not need to control the game I can always pay attention to how FPS count is and also the scene is very much the same. At this point now I saw that FPS change more variably than before: earlier it was basically always around 16 FPS and now it is changing rapidly from 14 to 18. But it can be that I just see the measurement takes place in a more fine-grained way and before this was smoothed out and avaraged to 16 FPS.

      In any ways there is maybe a very slight noticable difference so it seems to be a good idea to run with these settings, but the difference is barely noticable while the original problem is a "barely not noticable" performance drop from the earlier system.

      I think there is some bigger issue here and maybe we can rule out the kernel playing the biggest part in that. Still the changes made a very slight difference in speed here too.

      - Not really anything is running in the background
      - Kernel is now optimized like crazy (of course I can make multiple compilations with different config still to get 1-2% more)
      - Mesa is built from latest git and to me it seems there is only gallium drivers for r300 according to the source code
      - The /etc/xorg.conf is as shown above while /etc/X11/xorg.conf.d/ directory is empty and not overriding anything (is there other files?)
      - I have not built X myself so far but I doubt it helps. :-)

      I am not sure if AGP acceleration is on or not. From dmesg I only see one line for some driver being activated, but that is why I was trying "agp=try_unsupported" as an extra parameter in grub config. It made no change neither.

      PS.: For the latest runs I was not putting the video card into "high" profile so I might gain 1 more FPS. Also might gain one more in case the RETPOLINE=y removal did not remove all meltdown mitigations too. So maybe there is still 1-2% gain in this kernel if I configure more, but the original problem is like a 50-70% performance loss compared to my earlier system on the same machine with the open source radeon driver.

      PS.: Earlier I've used lightdm and now just start an x server and dwm manually. Is there any userspace hackz that a display manager might do and I might don't do? Also is there any comprehensive information anywhere about what to look for in the graphics pipeline? Maybe I miss something very easy that makes the performance bad... :-(


      • #13
        Btw I have uploaded the current kernel config here so that in the future people can see it if interested:

        Btw I am not sure if that is needed, but my user is added to the "video" group among others:

        [[email protected] ~]$ groups
        sys ftp log http games video storage wheel adm prenex
        Also I hope that "card0" is my video card here, so it is also having the video group on it properly:

        [[email protected] ~]$ lat /dev/dri/
        összesen 0
        crw-rw----+  1 root video  226,   0 máj   21 11.10 card0
        drwxr-xr-x  19 root root       3,2K máj   21 11.10 ..
        drwxr-xr-x   2 root root         80 máj   21 11.10 by-path
        drwxr-xr-x   3 root root        100 máj   21 11.10 .
        crw-rw-rw-   1 root render 226, 128 máj   21 11.10 renderD128
        I have no idea what the renderD128 is, but I have added my user to the render group now just in case and it didn't change a thing (I haven't rebooted though as I am writing this message). Also glxgears is the same speed regardless if I am root or not.


        • #14
          Okay... I went on and asked "perf" about where I am spending the around-100% CPU time in extreme-tux-racer.

          The perf recording and a simple text report are here:

          One better download the files because they show up badly in my browser for the lines are too wide for it.

          Also from the manpage of perf-report this is relevant to understand the textual output:
          The overhead can be shown in two columns as Children and Self when perf collects callchains. The self overhead is simply calculated by adding all period values of the entry - usually a function (symbol). This is the value that perf shows traditionally and sum of all the self overhead values should be 100%. The children overhead is calculated by adding all period values of the child functions so that it can show the total overhead of the higher level functions even if they don’t directly execute much. Children here means functions that are called from another (parent) function.
          So the "children" field is basically the sum of all the operations from child functions to the lever where it is printed and the file is sorted in that order. It shows that around 46% of the time there is a sysenter that goes to the radeon driver (actually making the kernel faster here helped I guess) and now without any big overhead we see that the radeon driver is spending the further 43% of the processing CPU time.

          Following the call stack it seems there is some memory move going on, as indicated by the name of this function: "ttm_bo_handle_move_mem".

          Most of the things from that on (38%) is spent in some "get_page_from_freelist" function. It might be some kind of problem with memory management in the driver - or of course it can mean that I would get the same results even on a system where I have 200FPS in extreme-tux-racer but there it would do the same things and not do it slowly. The perf tool can only measure where the CPU is, but it cannot tell easily if that point is GPU or I/O or whatever bound.

          Or maybe it is still the overhead that does the most harm? If I scroll down I get to this part:

              38.90%    36.28%  etr              [kernel.vmlinux]             [k] get_page_from_freelist
                      |          entry_SYSENTER_32
                      |          do_fast_syscall_32
                      |          sys_ioctl
                      |          do_vfs_ioctl
                      |          radeon_drm_ioctl
                      |          drm_ioctl
                      |          drm_ioctl_kernel
                      |          radeon_gem_create_ioctl
                      |          radeon_gem_object_create
                      |          radeon_bo_create
                      |          ttm_bo_init
                      |          ttm_bo_init_reserved
                      |          ttm_bo_validate
                      |          ttm_bo_handle_move_mem
                      |          ttm_tt_bind
                      |          radeon_ttm_tt_populate
                      |          ttm_populate_and_map_pages
                      |          ttm_pool_populate
                      |          __alloc_pages_nodemask
                      |          get_page_from_freelist
                                 |          smp_apic_timer_interrupt
                                 |          |          
                                 |           --1.26%--hrtimer_interrupt
                                 |                     |          
                                 |                      --0.95%--__hrtimer_run_queues.constprop.5
                                 |                                |          
                                 |                                 --0.64%--tick_sched_timer
          It was quite a long time ago I was analyzing perf output and then I was optimizing a code against cache and branch misses so used different parameters than now.... If the top part means that in this call stack the former functions do the most of the work then still the overhead is the biggest thing. In any ways we know that the most "hot code" the game ends up in is the get_page_from_freelist..

          Maybe there should not have been so many memmoves at all and that indicates some underlying (configuration?) problem.


          • #15
            I went further in this direction because I now have the source code for the kernel and mesa both at hand...

            One can find the relevant file in the linux kernel repository:

            In this file is this function (40.59% CPU child time according to perf):

            static int radeon_ttm_tt_populate(struct ttm_tt *ttm,
                        struct ttm_operation_ctx *ctx)
                struct radeon_ttm_tt *gtt = radeon_ttm_tt_to_gtt(ttm);
                struct radeon_device *rdev;
                bool slave = !!(ttm->page_flags & TTM_PAGE_FLAG_SG);
                if (gtt && gtt->userptr) {
                    ttm->sg = kzalloc(sizeof(struct sg_table), GFP_KERNEL);
                    if (!ttm->sg)
                        return -ENOMEM;
                    ttm->page_flags |= TTM_PAGE_FLAG_SG;
                    ttm->state = tt_unbound;
                    return 0;
                if (slave && ttm->sg) {
                    drm_prime_sg_to_page_addr_arrays(ttm->sg, ttm->pages,
                                     gtt->ttm.dma_address, ttm->num_pages);
                    ttm->state = tt_unbound;
                    return 0;
                rdev = radeon_get_rdev(ttm->bdev);
            #if IS_ENABLED(CONFIG_AGP)
                if (rdev->flags & RADEON_IS_AGP) {
                    return ttm_agp_tt_populate(ttm, ctx);
            #ifdef CONFIG_SWIOTLB
                if (rdev->need_swiotlb && swiotlb_nr_tbl()) {
                    return ttm_dma_populate(&gtt->ttm, rdev->dev, ctx);
                return ttm_populate_and_map_pages(rdev->dev, &gtt->ttm, ctx);
            Also according to the perf output I am always hitting "ttm_populate_and_map_pages" here. That means that neither the dma populate, neither the agp populate is used before either because ifdefs are false or because the expressions in the if branch is false for them.

            I can see my card in lspci output, but that doesn't mean it is PCIe and not AGP didn't it? I always set AGP values for it earlier but now at this system it seems they are not really doing anything even when in the xorg.conf? If the card is AGP-accelerated, but now cannot use that at all that would easily explain the radical performance drop from the earlier system of mine on the same machine...

            Looking at the kernel configuration the AGP support should be in kernel external module(s):

            # Graphics support
            This is from the kernel I built. But 'm' didn't mean that a module is compiled for it and when I copied the modules to their place it should be there and in effect? How can I know if this is missing?

            Btw the CONFIG_SWIOTLB is not found in the kernel config file of mine at all - not even commented out...

            It can also be this line tells my card is not agp while it is in real life:

            if (rdev->flags & RADEON_IS_AGP) {
            This would also explain why the function ends up calling the generic , non-agp solution instead. Now if I would still have my old system it would be really easy to see if "ttm_agp_tt_populate" can be called instead on that system.... I usually don't remove the old ones, but neither I had enough space, neither was it usable after a failed update to 18.04 ubuntu (that was when I thought that "if the update failed so miserably I can as much swap to a different distro where at least there is gain in doing the hand work and is more lightweight).

            But I kind of have the feeling that this is the problem and the card is capable of doing AGP while the system either thinks it is not - or configured to not use is...


            • #16
              A really ugly and dirty hack might be that I remove the tests around the call and force the kernel driver to call the agp version regardless, also it might help if I change the [m] in the configuration into a[*] so that the agp related stuff are not an external module, but built directly into the kernel of mine (it is made for my machine only now, so who cares).

              But I think there might be some configuration issue so before I make these radical things that might do bad stuff too, I should maybe look around more a bit...


              • #17
                Hmm... According to this page:


                This is what dmesg output should look like if agp acceleration is going on:

                Sep 19 11:29:54 Debian-G5 kernel: pmac_zilog: 0.6 (Benjamin Herrenschmidt <[email protected]>)
                Sep 19 11:29:54 Debian-G5 kernel: Linux agpgart interface v0.103
                Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: Apple U3 chipset
                Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: configuring for size idx: 64
                Sep 19 11:29:54 Debian-G5 kernel: agpgart-uninorth 0000:f0:0b.0: AGP aperture is 256M @ 0x0
                My dmesg only looks like this however:

                [   10.422564] battery: ACPI: Battery Slot [BAT0] (battery present)
                [   10.707533] Linux agpgart interface v0.103
                [   10.901497] asus_laptop: Asus Laptop Support version 0.42
                I remember it was like this before I was running my own kernel too, so the further lines are likely NOT missing because there is less debug level or anything, but I suspect they are just not there. But I do not really understand this now as I have the agp parameter both as xorg parameter(radeon.agpmode=8) and I have tried to add "Option "NoAccel" "True"" into my xorg.conf to see if for any reason that would make AGP acceleartion magically start working... Neither it became faster visibly then compared to "export LIBGL_ALWAYS_SOFTWARE=1;" neither logged differently in dmesg. Actually maybe it was around 5-6 FPS faster in glxgears compared to setting software rendering via the environment variable, but it might be because other things - like the browser running in the background... I would say setting NoAccel does not enable AGP for me so I only pay with no hardware acceleration at all but do not gain a better memory throughput contrary to what the linked page claims in their cases.

                I find it really likely however that AGP acceleration is now gone for my card and I have no idea why. That seems to be the cause of my problem and it would explain what I see really much...


                • #18
                  I think I see that my card have both an AGP and a PCI bridge (seems to be natural to me):

                  [[email protected] ~]$ lspci -knn
                  00:00.0 Host bridge [0600]: Advanced Micro Devices, Inc. [AMD/ATI] RC410 Host Bridge [1002:5a31] (rev 01)
                          Subsystem: ASUSTeK Computer Inc. RC410 Host Bridge [1043:13d7]
                          Kernel modules: ati_agp
                  00:01.0 PCI bridge [0604]: Advanced Micro Devices, Inc. [AMD/ATI] RC4xx/RS4xx PCI Bridge [int gfx] [1002:5a3f]
                  00:13.0 USB controller [0c03]: Advanced Micro Devices, Inc. [AMD/ATI] IXP SB4x0 USB Host Controller [1002:4374] (rev 80)
                          Subsystem: ASUSTeK Computer Inc. IXP SB4x0 USB Host Controller [1043:13d7]
                          Kernel driver in use: ohci-pci
                          Kernel modules: ohci_pci
                  Here it is visible that I am having ati_agp kernel module for the agp part. Maybe I should really compile that module into the kernel itself instead of using it as a module or I don't know what is wrong... Or maybe my card got into some AGP blacklist for some reason and even "try_unsupported" is not working... weird...


                  • #19
                    Use the command top to find out if dwm or other processes use too much hw resources. Arch Linux has too many moving parts to cause problems and when using single core only you will see big changes.

                    This is my top output now:

                    top - 15:10:35 up  1:19,  4 users,  load average: 1,36, 1,45, 1,21
                    Tasks:  78 total,   1 running,  75 sleeping,   0 stopped,   2 zombie
                    %Cpu(s): 64,9 us,  6,5 sy,  0,0 ni, 28,6 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
                    MiB Mem :   1380,8 total,    136,6 free,    478,1 used,    766,0 buff/cache
                    MiB Swap:    988,3 total,    985,6 free,      2,8 used.    719,4 avail Mem
                      PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
                      475 prenex     1   0 1004996 375044  95104 S  42,1  26,5  20:37.38 palemoon                                                                                           
                      349 prenex     1   0  161416  40212  22896 S   3,6   2,8   1:37.86 Xorg                                                                                               
                      361 prenex     1   0   22328  11544   8256 S   0,7   0,8   0:00.85 xterm                                                                                              
                      372 prenex     1   0    9096   2700   2388 S   0,3   0,2   0:00.32 dwm                                                                                                
                        1 root       1   0   16952   2696   2096 S   0,0   0,2   0:02.43 systemd                                                                                            
                        2 root       1   0       0      0      0 S   0,0   0,0   0:00.00 kthreadd                                                                                           
                        4 root       1 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/0:0H-kblockd                                                                               
                        6 root       1 -20       0      0      0 I   0,0   0,0   0:00.00 mm_percpu_wq                                                                                       
                        7 root       1   0       0      0      0 S   0,0   0,0   0:01.94 ksoftirqd/0                                                                                        
                        8 root     -51   0       0      0      0 S   0,0   0,0   0:00.00 idle_inject/
                    But palemoon is having 10 open tabs.

                    Memory usage:

                    [[email protected] zen-kernel-5.0.17-lqx1]$ free -m
                                  total        used        free      shared  buff/cache   available
                    Mem:           1380         462         169           0         748         735
                    Swap:           988           2         985
                    I think it is pretty reasonable for the browser having phoronix and arch forums, gmail, facebook, reddit and some other random wiki pages open for the various distros wiki pages.

                    For testing I always first close the browser. Will measure that for you with the browser closed as that is how I am running my tests. Also while running my tests there is nothing but the tested app running because not even pulseaudio is present and only alsa (and for glxgears not any sound needed anyways).

                    Also the "perf" output shows most of the CPU spends in the driver and I now highly suspect it is doing so because AGP is not accelerated for some reason. I might be wrong of course.


                    • #20
                      Top output after the browser has closed:

                      top - 15:19:08 up  1:28,  4 users,  load average: 0,1, 0,75, 1,00
                      Tasks:  79 total,   1 running,  76 sleeping,   0 stopped,   2 zombie
                      %Cpu(s):  1,4 us,  0,5 sy,  0,0 ni, 98,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
                      MiB Mem :   1380,8 total,    553,2 free,     78,9 used,    748,7 buff/cache
                      MiB Swap:    988,3 total,    985,6 free,      2,8 used.   1119,4 avail Mem
                        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                            
                        361 prenex     1   0   22460  11628   8256 S   0,3   0,8   0:01.23 xterm                                                                                              
                        612 prenex     6   0   10884   3304   2792 R   0,3   0,2   0:00.04 top                                                                                                
                          1 root       1   0   16952   2696   2096 S   0,0   0,2   0:02.51 systemd                                                                                            
                          2 root       1   0       0      0      0 S   0,0   0,0   0:00.00 kthreadd                                                                                           
                          4 root       1 -20       0      0      0 I   0,0   0,0   0:00.00 kworker/0:0H-kblockd                                                                               
                          6 root       1 -20       0      0      0 I   0,0   0,0   0:00.00 mm_percpu_wq                                                                                       
                          7 root       1   0       0      0      0 S   0,0   0,0   0:01.96 ksoftirqd/0
                      Free memory after the browser has closed:

                      [[email protected] zen-kernel-5.0.17-lqx1]$ free -m
                                    total        used        free      shared  buff/cache   available
                      Mem:           1380          81         542           0         757        1116
                      Swap:           988           2         985
                      The latter surpises me because after boot, the idle memory usage was only 50-60Mb originally so something got cached in still after the browser got closed. For all the performance measurements for games, glxgears anything I always first closed the browser because I can easily save its opened tabs and continue from that point.

                      I really start to feel it is an AGP acceleration problem. Looking around in the driver source code, the message that I find missing should have been written out. It is still in the code and at a relevant place it seems.

                      I have nothing against debian neither, once I even installed one on S/390 years ago with xfce (I see your nickname :-). I just I feel like I want to understand the root of the problem here and it seems like it is not a distro issue - or if it is still, then it is some valuable piece of configuration that I better know about anyways as it is key to good/acceptable 3D performance.