Announcement

Collapse
No announcement yet.

amdgpu questions

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Thanks for the answers.
    I've got something new again this one is not about graphics in particular, but recent AMD GPUs and still applies to amdgpu, I hope

    According to this wiki info graphic (https://upload.wikimedia.org/wikiped...hics_stack.svg), there is already support for TrueAudio in Linux.
    I understand TrueAudio as a on-die DSP for audio, providing real-time sound mixing, applying effects, etc. I think this is primarily used for games as it is also used in Xbox and Playstation.
    1) Is this even true? I never read about it used on Linux, neither found detailed info.
    2) Could this DSP be utilised by ALSA or is this done in the application itself, e.g. a game or video player?

    Comment


    • #22
      Not 100% sure but I think ACP = TrueAudio.

      https://www.phoronix.com/scan.php?pa...osed-Linux-4.5

      Comment


      • #23
        The ACP support upstream just exposes the ACP DMA engine for feeding the i2s codec. The DSP is not currently supported. AFAIK, there is no audio API on Linux at the moment that could make use of it.

        Comment


        • #24
          Thanks so adding DSP support is not on your agenda, because you don't see it used by any software?


          Could you tell something about the GPU scheduler changes going on?
          I do know that in the GPU there is a GCP (graphics command processor) as well as some ACEs (asynchronous compute engines), the block diagram for Fiji also states 2 HWS (hardware scheduler??) as a replacement for 4 ACEs. Is the "gpu scheduler" support introduced last year in any way related to this units?

          If so, how does somebody access the ACEs/HWS to achieve the performance benefits*? Is this only possible with Vulkan's async compute features or could OpenGL or OpenCL workloads (graphics only, graphics+compute, compute only) also benefit from better scheduling?

          *I'm referring to game-targeting sites stating that ACEs do raise performance in console and DX12 games lately.

          Comment


          • #25
            Originally posted by juno View Post
            I do know that in the GPU there is a GCP (graphics command processor) as well as some ACEs (asynchronous compute engines), the block diagram for Fiji also states 2 HWS (hardware scheduler??) as a replacement for 4 ACEs. Is the "gpu scheduler" support introduced last year in any way related to this units?

            If so, how does somebody access the ACEs/HWS to achieve the performance benefits*? Is this only possible with Vulkan's async compute features or could OpenGL or OpenCL workloads (graphics only, graphics+compute, compute only) also benefit from better scheduling?
            There are two different "command processor" blocks in CI and above:

            - "ME" (Micro Engine, aka the graphics command processor, called CP on pre-CI when it was the only command processor in the chip)

            - "MEC" (Micro Engine Compute, aka the compute command processor)

            Some chips have two MEC's, other parts have only one. So far one MEC (up to 32 queues) seems to be more than enough to keep the shader core fully occupied.

            The MEC block has 4 independent threads, referred to as "pipes" in engineering and "ACEs" (Asynchronous Compute Engines) in marketing. One MEC => 4 ACEs, two MECs => 8 ACEs. Each pipe can manage 8 compute queues, or one of the pipes can run HW scheduler microcode which assigns "virtual" queues to queues on the other 3/7 pipes.

            IIRC the upstream radeon and amdgpu drivers expose a small number of compute queues via MEC today. I don't remember if clover OpenCL is using them - tstellar or agd5f would know.

            The GPU scheduler in amdgpu is similar but different - it is software that exposes per-process queues to the rest of the driver and then copies packets from per-process queues to the ME's HW queue in order to actually execute the work. It allows each process to stuff a lot of work into its queue without that process monopolizing the hardware. It also implements dependency logic to avoid having the HW locked up waiting for some other activity (eg an SDMA transfer) to complete when there is work from other processes that could run immediately.

            The GPU scheduler is also a pre-requisite for things like the OpenGL robustness extension, where driver software needs to have a fairly precise understanding of exactly what work is in flight on the GPU so that if a GPU reset is required the driver can minimize the amount of lost work.

            Don't think OpenGL is using the MEC queues yet, but there is some work going on to let OpenCL use them. HSA has used them from day one, and any language running over the HSA stack takes advantage of them automatically. DirectX compute has been used by game developers for a long time, but AFAIK use of OpenGL's corresponding compute features is quite recent.

            We may end up having to implement a SW scheduler for the HSA / Radeon Open Compute stack in order to support things like tight cgroup integration, but so far the HW scheduler in MEC has been very useful.

            Even if we implemented a SW scheduler for the HSA/ROC stack we would still run it over the HWS since HWS takes care of things like Compute Wave Save/Restore (CWSR), ie the ability to preempt shader programs running on the shader core during execution, run shader programs from different queues, then resume the original shader programs later.
            Last edited by bridgman; 12 March 2016, 02:49 PM.

            Comment


            • #26
              wow, that was very much detail.

              These marketing infos are confusing, e.g. this presentation from hot chips: http://www.hotchips.org/wp-content/u...i-AMD-GPU2.pdf
              It suggests on slide 14 that there are in fact 4 ACEs as independent hw blocks and also 2 "HWS". The text even states "4 core Async Compute"
              I also didn't find this specific info in the ISA books

              So by "one MEC (up to 32 queues) seems to be more than enough" you mean one MEC would be enough even for Fiji with its 64 CUs?

              Originally posted by bridgman View Post
              Don't think OpenGL is using the MEC queues yet, but there is some work going on to let OpenCL use them. HSA has used them from day one, and any language running over the HSA stack takes advantage of them automatically. DirectX compute has been used by game developers for a long time, but AFAIK use of OpenGL's corresponding compute features is quite recent.
              Given the info from the marketing side and the media I was thinking the ACEs are separate hardware blocks that have never been used yet until Vulkan and DirectX12 brought async compute features. Based on your info above, it now seems to me like it's quite common using "ACEs"/the MEC (as it is the compute command processor) for pure compute tasks. But that contradicts to your statement about "work going on to let OpenCL use them". Could you clearify that for me?
              I guess the same goes then for DirectX' Compute Shaders and OpenGL 4.3's ARB_Compute_Shader.

              Comment


              • #27
                Is the Vulkan driver part of the new hybrid driver? How do they depend on each other? Is one ready and waiting for the other?

                Comment


                • #28
                  Originally posted by juno View Post
                  These marketing infos are confusing, e.g. this presentation from hot chips: http://www.hotchips.org/wp-content/u...i-AMD-GPU2.pdf It suggests on slide 14 that there are in fact 4 ACEs as independent hw blocks and also 2 "HWS". The text even states "4 core Async Compute"
                  I believe the slides were correct at the time they were presented -- we were thinking about running two HWS instances, and an HWS instance doesn't *have* to completely take over the pipe but IIRC we run it that way today -- and they might be correct in the future as well. The MEC block has a bunch of sub-blocks so you could draw the ACEs separately if you wanted, but the driver code interacts with MEC as a unit.

                  For better or worse there are a lot of different ways to draw a complex system, and for each of those ways there is someone who believes that representation is clearer and easier to understand than all the others

                  Originally posted by juno View Post
                  I also didn't find this specific info in the ISA books
                  You won't see any mention in the ISA because this is one level of abstraction up from ISA. The ME/MEC blocks process DRAW and DISPATCH commands and work with other HW blocks in the graphics & compute pipelines to explode high level workloads (triangle lists/fans etc.. for graphics, N-dimensional ranges for compute) into a lot of individual shader threads, and each of those shader threads executes ISA code.

                  Originally posted by juno View Post
                  So by "one MEC (up to 32 queues) seems to be more than enough" you mean one MEC would be enough even for Fiji with its 64 CUs?
                  Yes, and the one MEC corresponds to the four ACE's in the diagram.

                  Originally posted by juno View Post
                  Given the info from the marketing side and the media I was thinking the ACEs are separate hardware blocks that have never been used yet until Vulkan and DirectX12 brought async compute features. Based on your info above, it now seems to me like it's quite common using "ACEs"/the MEC (as it is the compute command processor) for pure compute tasks. But that contradicts to your statement about "work going on to let OpenCL use them". Could you clearify that for me?
                  I guess the same goes then for DirectX' Compute Shaders and OpenGL 4.3's ARB_Compute_Shader.
                  It depends on the driver. For Catalyst, DX compute and OpenGL compute shaders pushed work to the GPU through the graphics queue, so async compute (using MEC queues) is relatively new there. I believe API support for async compute arrived in DX12, not sure if OpenGL has it or not. OpenCL may be using MEC queues today (it probably is, at least for OpenCL 2.0).

                  For amdgpu hybrid, work is being done now to use MEC queues for OpenCL, again I think it's for OpenCL 2.0 only. Again the HSA/ROC stack has used MEC queues from day one.

                  To add to the confusion, SI parts did not have MEC blocks but the ME block supported 1 graphics and 2 compute queues (so "two single-queue ACEs"). Note that "ring" and "queue" are interchangeable here since queues are implemented as a ring buffers - driver code generally talks about rings, while HW docco generally talks about queues.

                  Originally posted by ernstp View Post
                  Is the Vulkan driver part of the new hybrid driver? How do they depend on each other? Is one ready and waiting for the other?
                  Yes, part of the hybrid driver. The Vulkan userspace driver hooks in the same way as, say, OpenGL - command submission & resource management via the kernel driver, windowing interface via (not sure, haven't been keeping up on the details). The Vulkan userspace driver is the same for Linux & Windows, although the interface to kernel driver and windowing is obviously different.
                  Last edited by bridgman; 13 March 2016, 04:46 AM.

                  Comment


                  • #29
                    Originally posted by bridgman View Post
                    It depends on the driver. For Catalyst, DX compute and OpenGL compute shaders pushed work to the GPU through the graphics queue, so async compute (using MEC queues) is relatively new there. I believe API support for async compute arrived in DX12, not sure if OpenGL has it or not. OpenCL may be using MEC queues today (it probably is, at least for OpenCL 2.0).

                    For amdgpu hybrid, work is being done now to use MEC queues for OpenCL, again I think it's for OpenCL 2.0 only. Again the HSA/ROC stack has used MEC queues from day one.
                    Okay. Can you tell reasons why it has not been possible to use MECs before? Would it cause too much driver overhead to pick compute parts out of the single, sequential (graphics) command buffer, get a separate compute buffer and assign it to MECs? I mean, from my unqualified point of view, this approach would have allowed the asynchronous computation of gfx and compute tasks, even with DX<12 and OpenGL?!
                    Also, there is this statement about driver overhead in the document "GCN Performance Tweets" (http://amd-dev.wpengine.netdna-cdn.c...ceTweets.pdf):
                    Avoid heavy switching between compute and rendering jobs. Jobs of the same type should be done consecutively.
                    Notes: GCN drivers have to perform surface synchronization tasks when switching between compute and rendering tasks. Heavy back-and-forth switching may therefore increase synchronization overhead and reduce performance.
                    Is this because of not using the MECs or is it just a 'natural restriction'?

                    Also, I think anandtech got their table very wrong... which is sad. They state 1 gfx and 8 compute queues simultaneously and 8 compute queues for compute-only as possible on both GCN 1.1 and 1.2.
                    http://www.anandtech.com/show/9124/a...ronous-shading

                    Comment


                    • #30
                      Originally posted by juno View Post
                      Okay. Can you tell reasons why it has not been possible to use MECs before? Would it cause too much driver overhead to pick compute parts out of the single, sequential (graphics) command buffer, get a separate compute buffer and assign it to MECs? I mean, from my unqualified point of view, this approach would have allowed the asynchronous computation of gfx and compute tasks, even with DX<12 and OpenGL?!
                      Just lack of API support AFAIK -- all the driver knows is that the app submitted a bunch of different tasks in sequence, it doesn't know which of them can be executed out of sequence although it can make a guess by looking at src/dst buffer info. The general problem though is that AFAIK apps weren't written with a lot of parallelizable work until the latest round of APIs, so the src/dst buffer info is likely to suggest that no parallelizing is possible.

                      Originally posted by juno View Post
                      Also, there is this statement about driver overhead in the document "GCN Performance Tweets" (http://amd-dev.wpengine.netdna-cdn.c...ceTweets.pdf):

                      Is this because of not using the MECs or is it just a 'natural restriction'?
                      Geez, you could have told me it was the very last tip and saved me a bunch of time

                      I think that's just a consequence of having separate compute and graphics pipelines. My guess is that the tip should have specified that synchronization is needed only for read-after-write cases, ie where the output of a compute operation is input to a render operation or vice versa, but not 100% sure. The doc seemed to pre-date use of async compute IIRC.

                      Originally posted by juno View Post
                      Also, I think anandtech got their table very wrong... which is sad. They state 1 gfx and 8 compute queues simultaneously and 8 compute queues for compute-only as possible on both GCN 1.1 and 1.2.
                      http://www.anandtech.com/show/9124/a...ronous-shading
                      Yeah, the table looks a bit off, seems to be mixing "driver supported queues", "HW supported queues", "HW supported ACEs" and other numbers.
                      Last edited by bridgman; 13 March 2016, 06:27 PM.

                      Comment

                      Working...
                      X