Announcement

Collapse
No announcement yet.

RadeonSI Lands Improved Scaling For Shader Compiler Threads

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RadeonSI Lands Improved Scaling For Shader Compiler Threads

    Phoronix: RadeonSI Lands Improved Scaling For Shader Compiler Threads

    Merged this week were some minor changes to AMD's RadeonSI Gallium3D OpenGL open-source driver around the shader selector code. One of the changes in particular though is noteworthy...

    https://www.phoronix.com/scan.php?pa...r-Comp-Threads

  • #2
    I'm a little curious how the drivers knows what to scale to. So for example, does it add another thread for every 10ms where it still isn't done compiling?

    Comment


    • #3
      Originally posted by schmidtbag View Post
      I'm a little curious how the drivers knows what to scale to. So for example, does it add another thread for every 10ms where it still isn't done compiling?
      I assume it does only scale when additional parallel work is available, and has a threadpool with a max that instantiates threads on needed basis.

      Comment


      • #4
        Originally posted by schmidtbag View Post
        I'm a little curious how the drivers knows what to scale to.
        Probably, each shader compilation is single-threaded. So, it should scale based on the number of shaders that need to be compiled (limited by available hardware threads, of course).

        I think we should get away from each process spinning up its own worker threads. The OS should provide work queues and dispatch work to available cores, for you. This lets the OS decide how to allocate cores between different processes, without unduly hampering them. It also supports co-scheduling tasks from a single app so minimal time is lost during synchronization, if the system isn't otherwise idle.

        It would also solve the problem of apps having multiple threadpools - one for each of several libraries - which then fight over available CPU cores. The problem gets even worse, if you have multiple such apps running at a given point in time.

        Comment


        • #5
          Originally posted by coder View Post
          Probably, each shader compilation is single-threaded. So, it should scale based on the number of shaders that need to be compiled (limited by available hardware threads, of course).

          I think we should get away from each process spinning up its own worker threads. The OS should provide work queues and dispatch work to available cores, for you. This lets the OS decide how to allocate cores between different processes, without unduly hampering them. It also supports co-scheduling tasks from a single app so minimal time is lost during synchronization, if the system isn't otherwise idle.

          It would also solve the problem of apps having multiple threadpools - one for each of several libraries - which then fight over available CPU cores. The problem gets even worse, if you have multiple such apps running at a given point in time.
          "This lets the OS decide how to allocate cores between different processes, without unduly hampering them."
          Except that is exactly what an OS doesn't know how to do... what you are suggesting is exactly backwards.

          Often what DOES know this information is the userland software...

          Comment


          • #6
            Originally posted by schmidtbag View Post
            I'm a little curious how the drivers knows what to scale to. So for example, does it add another thread for every 10ms where it still isn't done compiling?
            A prior patch in the same series is instructive:

            The original code waiting for the queue to be full before adding more threads. This makes the thread count grow slowly, especially if the queue also uses UTIL_QUEUE_INIT_RESIZE_IF_FULL.

            This commit changes this behavior: now a new thread is spawned if we're adding a job to a non-empty queue because this means that the existing threads fail to process jobs faster than they're queued.

            Comment


            • #7
              Originally posted by cb88 View Post
              "This lets the OS decide how to allocate cores between different processes, without unduly hampering them."
              Except that is exactly what an OS doesn't know how to do... what you are suggesting is exactly backwards.

              Often what DOES know this information is the userland software...
              I think it's best to consider an example:

              Process A spins up a worker thread for each vCPU core on the machine where it's running. When it has some big pile of work to do, it dispatches tasks to each of these workers and then waits for them to complete. The wait might be explicit, or maybe the next action is triggered by the completion of the last work item.

              What process A doesn't know is that processes B and C are also doing stuff, occupying some of the cores.

              Some thread from process A could get enough runtime to complete some of the work items, but gets pre-epmted in the middle of one. Now, the objective of process A gets delayed until that thread can get scheduled again. In the worst case, process A is pipelined and packing its work queues with more work items from different objectives. This means that thread won't get run until some time later, and the completion of that objective gets delayed until some other worker gets switched out to give some runtime to the first thread. The net effect is higher latency until objectives are completed and more context-switching.

              The problem is that the OS has too little visibility into what the threads are doing and what dependency relationships exist between them. So, it can't optimally schedule the threads. If the OS would instead provide a work queue API, then it could ensure that work items are completed according to how the submitting process requested. And it could avoid starting more work items, at any given time, than can actually be scheduled for that process.

              Another problem this addresses is if work items are using some shared resource or data structure. In this case, one could block on a mutex held by another work item, that's being owned by a worker thread which isn't currently running. Ideally, all of the threads of a process would be run at the same time. This would not only reduce lock contention & context-switching, but also improve cache efficiency. Yet, it can't happen if the process starts too many threads for the amount of available vCPU cores.

              Comment

              Working...
              X