Announcement

Collapse
No announcement yet.

AMD Radeon HD 7970 On Linux

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by liam View Post
    I'm not sure I follow you. It SEEMS as though you are saying you will use the open sourced compiler for the old r600 while the new diver will switch to the new one. I apologize if I've misunderstood.
    It's confusing because we used to only talk about graphics workloads (with the assumption that compute could flow through the same path) but now we're talking about different paths for graphics and compute, at least in some cases.

    Current thinking (if I remember it correctly ) is to tweak the pipe driver API to accept different IR formats (ie other than TGSI) then :

    For 6xx-NI graphics the Gallium3D pipe driver will receive TGSI shader programs from Mesa then use the existing code which converts vector TGSI operations directly into VLIW GPU operations

    For 6xx-NI compute the Gallium3D pipe driver will receive LLVM IR kernel programs from clover (or other compute front ends), then use "part 2" of the newly released code to generate VLIW GPU instructions from the LLVM IR (may not make great use of the VLIW at first)

    For GCN graphics the Gallium3D pipe driver will receive TGSI shader programs from Mesa, use "part 1" of the newly released code to convert TGSI to LLVM IR, then use "part 2" to generate GCN GPU instructions from the LLVM IR

    For GCN compute the Gallium3D pipe driver will receive LLVM IR kernel programs from clover (or other compute front ends) then use "part 2" of the newly released code to generate GCN GPU instructions from the LLVM IR

    I drew a picture on a whiteboard at work of all this yesterday showing Catalyst OpenCL, clover OpenCL and Mesa with all the different code paths and scenarios. Should have taken a picture of it before erasing, sorry.

    Originally posted by liam View Post
    Lastly regarding drm I was attempting to give the impression that since hybrid mode will do most of the work through the CUs we could ignore the small bit that the uvd handles and thus not use uvd at all thus ignoring drm
    That is definitely doable, and we'll probably do more work in that direction if it turns out we can't open UVD etc.., but the "small bit" handled by the fixed function hardware is the entropy encode/decode (CABAC, CAVLC etc..) which is still stubbornly expensive to implement on CPU or GPU. Motion estimation, motion comp and filtering are inherently parallel and a decent fit for shaders, but I don't think anyone has come up with a sufficiently parallel algorithm for entropy encode/decode yet.
    Last edited by bridgman; 12-22-2011, 10:18 PM.

    Comment


    • #17
      Much clearer to me now.

      Originally posted by bridgman View Post
      It's confusing because we used to only talk about graphics workloads (with the assumption that compute could flow through the same path) but now we're talking about different paths for graphics and compute, at least in some cases.

      Current thinking (if I remember it correctly ) is to tweak the pipe driver API to accept different IR formats (ie other than TGSI) then :

      For 6xx-NI graphics the Gallium3D pipe driver will receive TGSI shader programs from Mesa then use the existing code which converts vector TGSI operations directly into VLIW GPU operations

      For 6xx-NI compute the Gallium3D pipe driver will receive LLVM IR kernel programs from clover (or other compute front ends), then use "part 2" of the newly released code to generate VLIW GPU instructions from the LLVM IR (may not make great use of the VLIW at first)

      For GCN graphics the Gallium3D pipe driver will receive TGSI shader programs from Mesa, use "part 1" of the newly released code to convert TGSI to LLVM IR, then use "part 2" to generate GCN GPU instructions from the LLVM IR

      For GCN compute the Gallium3D pipe driver will receive LLVM IR kernel programs from clover (or other compute front ends) then use "part 2" of the newly released code to generate GCN GPU instructions from the LLVM IR
      Obviously you would know better than the guys at Anandtech, but I got a different impression as to the amount of difference in the compute bitstream from VLIW->non-VLIW SIMD. They seemed to say that the compiler was the ALL IMPORTANT COMPONENT in order to get decent utilization (meaning it was neccessary to keep vast amounts of the program branches in memory and be REALLY good at best guesses for dependencies). Though they didn't say this part I assume that since the compiler was so important for <=NI it becomes less so with >=SI, and should even need to be rewritten for the new architecture.
      What you seem to be saying is that <=NI will make good use of the IR->VLIW, which makes sense (also, graphics stays the same).
      For >=SI EVERYTHING goes through the new code (which I didn't know, but I also didn't know there were 2 code drops). Again, what surprises me is that the same compiler code used to generate VLIW is also being used to generate the new SIMD code.
      Again, if what Anandtech says is accurate it seems like the old compiler was so hideously complex thay you'd want to jettison it as soon as possible instead of making it able to output to yet another kind of architecture (so presumable it now addresses VLIW4/5/SIMD).


      Originally posted by bridgman View Post
      I drew a picture on a whiteboard at work of all this yesterday showing Catalyst OpenCL, clover OpenCL and Mesa with all the different code paths and scenarios. Should have taken a picture of it before erasing, sorry.
      Well, it's is the holidays after all, so just this once

      Originally posted by bridgman View Post
      That is definitely doable, and we'll probably do more work in that direction if it turns out we can't open UVD etc.., but the "small bit" handled by the fixed function hardware is the entropy encode/decode (CABAC, CAVLC etc..) which is still stubbornly expensive to implement on CPU or GPU. Motion estimation, motion comp and filtering are inherently parallel and a decent fit for shaders, but I don't think anyone has come up with a sufficiently parallel algorithm for entropy encode/decode yet.
      Yikes! I didn't realise that was being run by the UVD but cavlc shouldn't be that big of a deal to hand off to the cpu so even if nothing comes of uvd (though we still hope, obviously) it's still good to know that there is some light emerging from the tunnel.

      Best/Liam

      Comment


      • #18
        Originally posted by liam View Post
        Again, what surprises me is that the same compiler code used to generate VLIW is also being used to generate the new SIMD code.
        That's not really the case. The "GLSL compiler" is what generates that TGSI output that is currently used by the gallium drivers and will be converted into llvm ir in the new one. However, this is only about 1/3rd of the "complete" compiler.

        It doesn't know anything about the underlying architecture of the code, it's just a front-end to the GLSL language. Much like a GCC frontend for C or C++ would be shared across multiple architectures, and there is a separate backend portion that takes care of all the hardware details.

        The new LLVM IR code drop from AMD is that part, and it will completely replace the hardware dependent portion of the compiler in Mesa.


        Compilers tend to have 3 parts - the frontend reads the source code and stores an intermediate reprentation of it (IR), and only depends on the language being parsed.

        The back end is responsible for converting that IR into something the hardware can understand (typically binary instructions, but you can also have backends that output other languages, such as the emscripten project).

        The middle is where optimizations take place, and is typically split between the front and back end. Generic optimizations for the language go in the shared front-end, while hardware specific optimizations go in the backend.

        In mesa, the GLSL compiler is the front end. TGSI and LLVM IR are both forms of IR used internally (so is the GLSL IR). The old 600g driver contains one backend, and the new code based on LLVM contains a new backend to be used for new hardware.

        Comment


        • #19
          The open-source driver support is not yet available. No patches have magically landed this morning and they haven't released any support work in advance of the hardware's availability.
          This is one of the biggest problems in Linux for both developers and users IMHO. There has to be an easy and safe way for end-users to install/update drivers without having to depend on distributions and wait for months in order for code to be merged into the latest kernel.
          Last edited by zoomblab; 12-23-2011, 05:31 AM.

          Comment


          • #20
            AMD recently had a developers conference that covered the technology. I believe Phoronix covered it. However I'm not sure if any of that was for open source development. The architecture is such a massive change that I suspect it could be a year or more before we see optimized code for GCN.

            Originally posted by rinthos View Post
            @Bridgman and others looking for the 6 month cycle reference...I believe it's here:

            http://www.anandtech.com/show/5261/a...-7970-review/2

            "As a result both NVIDIA and AMD have begun revealing their architectures to developers roughly six months before the first products launch. This is very similar to how CPU launches are handled, where the basic principles of an architecture are publically disclosed months in advance".

            (the second page of the Anandtech review)
            ---
            I'm not sure why so much attention is paid to Anandtechs site. Sometimes I think the site is owned by Intel.

            Comment


            • #21
              Southern Islands compute technologies.

              Would it be possible to work something out with Phoronix to write and article focused onthe new compute capabilities of Southern Islands under Linux. One thing I'm interested in is how compute loads impact graphics usage, are we to the point where long running compute jobs do not impact graphics significantly. I guess that is a question about support for threads. However any info that you are free to spill that gives us a better mental image of improvements to Southern Islands compute performance would be welcomed.

              This question you probably can't answer but when will we see a Fusion processor using Southern Islands technology?

              Originally posted by bridgman View Post
              I haven't looked much at UVD in the latest GPUs but I thought they were still UVD 3. Can't really talk much about UVD at this point because I don't know what we are going to be able to release, although we are going to take another look in the new year.

              I don't think we have looked at PowerTune specifically; my guess is that the first chance to do that will be a few months from now. There are some more fundamental power management improvements I would like to get out first, and they will probably be a pre-requisite for PowerTune anyways.

              Comment


              • #22
                Originally posted by zoomblab View Post
                There has to be an easy and safe way for end-users to install/update drivers without having to depend on distributions and wait for months in order for code to be merged into the latest kernel.
                There is already a way to do this on Linux. It's the same way that you do it on Windows.

                If you switched to Linux because you want a pure open source platform, then yes, you will have to wait for a PPA of a future kernel RC (if you use Ubuntu) or compile a kernel RC or compile a release kernel+drm next (if you use another distro). This is due to the DRM component of the open-source drivers residing in the kernel.

                I predict that patches providing open-source support for the Radeon HD7xxx series will appear in 4 to 6 weeks time, given the following:
                • Support for the Radeon HD5xxx series took about 4.5 months
                • Support for the Radeon HD6xxx series took about 2.5 months
                • AMD has publicly stated that they are aiming for launch-day open-source support for the upcoming Radeon HD8xxx series

                Unfortunately this might me too late for the 3.3 kernel merge window, unless Dave and/or Alex can talk Linus into having an extended merge window. The last extended merge window was a bit worrying though.

                Comment


                • #23
                  Originally posted by liam View Post
                  Lastly regarding drm I was attempting to give the impression that since hybrid mode will do most of the work through the CUs we could ignore the small bit that the uvd handles and thus not use uvd at all thus ignoring drm
                  I'm guessing it's probably not that simple because the closed source driver has to protect the content all the way to the display. If you say it does UVD -> *shader functions* -> display and the open source driver does CPU -> *shader functions* -> display and they share the same shader function it'd make it a lot easier to look for that same call in the closed source driver and dump protected content. DRM is poison to openness. Anyway as I understood hybrid mode that was mostly for GPU-assisted encoding, not decoding - though I suppose it could implement some of the same functions.

                  Comment


                  • #24
                    Originally posted by madbiologist View Post
                    Unfortunately this might me too late for the 3.3 kernel merge window, unless Dave and/or Alex can talk Linus into having an extended merge window. The last extended merge window was a bit worrying though.
                    Linus has always stated that post RC merges are allowed for bug fixes and support for bringing up new hardware - which this would qualify as

                    Comment


                    • #25
                      Originally posted by liam View Post
                      Obviously you would know better than the guys at Anandtech, but I got a different impression as to the amount of difference in the compute bitstream from VLIW->non-VLIW SIMD. They seemed to say that the compiler was the ALL IMPORTANT COMPONENT in order to get decent utilization (meaning it was neccessary to keep vast amounts of the program branches in memory and be REALLY good at best guesses for dependencies). Though they didn't say this part I assume that since the compiler was so important for <=NI it becomes less so with >=SI, and should even need to be rewritten for the new architecture.
                      Everything you are saying is correct, but you are assuming that the open source graphics driver already has that ALL IMPORTANT COMPILER for r6xx-NI and (quite reasonably) wondering why we don't use it for compute as well ? The answer is simple -- it doesn't have one.

                      The Catalyst driver has a fancy compiler but the r600/r600g open source drivers do not (at least they didn't the last time I looked). The current TGSI-VLIW compiler takes advantage of the fact that most of the TGSI instructions are 3- or 4-wide vector operations and translates them directly into 3- or 4-slot VLIW instructions.

                      The Catalyst shader compiler for pre-GCN analyzes dependencies and packs multiple operations into a single VLIW instruction... last time I looked the open source shader compiler did not. Either way, there is a lot of VLIW-specific code in the pre-GCN shader compilers which is not needed for GCN, to the point where it's easier to start over and leverage more conventional compilers which would have sucked on VLIW (absent something like LunarGLASS) but which should fit GCN peachy-fine.

                      Originally posted by liam View Post
                      What you seem to be saying is that <=NI will make good use of the IR->VLIW, which makes sense (also, graphics stays the same).
                      Honestly, we weren't expecting great efficiency at first no matter which path we took -- LLVM IR to VLIW or LLVM IR to TGSI to VLIW. We went with the LLVM IR to VLIW route for a few reasons :

                      - it was the shortest path to getting GPU acceleration into clover
                      - since the TGSI to VLIW path didn't have much relevent optimization we didn't think we would lose performance by going direct from LLVM IR to VLIW
                      - it gave us a way to test out the LLVM to GPU instruction code on available hardware before we had GCN boards
                      - it produced code which was more in line with what other developers were looking for in order to build other compute stacks

                      Neither approach would take much advantage of VLIW hardware for compute at first. If the graphics shader compiler gets more optimized in the future (or already is and we missed it ) we would probably try the LLVM IR to TGSI to VLIW path, but I think we would have started with this approach anyways because of the other benefits above.

                      Originally posted by liam View Post
                      For >=SI EVERYTHING goes through the new code (which I didn't know, but I also didn't know there were 2 code drops). Again, what surprises me is that the same compiler code used to generate VLIW is also being used to generate the new SIMD code.
                      I don't remember if it was 2 code drops or 1 drop with 2 parts. I think it was 1 drop with some LLVM patches and some Mesa/Gallium3D driver patches.

                      Originally posted by liam View Post
                      Again, if what Anandtech says is accurate it seems like the old compiler was so hideously complex thay you'd want to jettison it as soon as possible instead of making it able to output to yet another kind of architecture (so presumable it now addresses VLIW4/5/SIMD).
                      That's essentially what we are doing (even though in our case the old compiler was not hideously complex ).

                      For GCN both graphics and compute will go through the new LLVM paths.
                      Last edited by bridgman; 12-23-2011, 10:58 AM.

                      Comment


                      • #26
                        Originally posted by wizard69 View Post
                        Would it be possible to work something out with Phoronix to write and article focused onthe new compute capabilities of Southern Islands under Linux.
                        Don't know but I'll ask around.

                        Originally posted by wizard69 View Post
                        One thing I'm interested in is how compute loads impact graphics usage, are we to the point where long running compute jobs do not impact graphics significantly. I guess that is a question about support for threads. However any info that you are free to spill that gives us a better mental image of improvements to Southern Islands compute performance would be welcomed.
                        We pushed code for "multiple ring support" a couple of months ago :

                        http://www.phoronix.com/scan.php?pag...tem&px=MTAwNzg

                        A number of things will use that code in the future, but one of them is allowing compute operations to go through a separate command queue from graphics operations so that the hardware can flip between tasks at a fairly fine-grained level. The multiple ring support started with Cayman but GCN is the first generation where I expect we will really use it.

                        Originally posted by wizard69 View Post
                        This question you probably can't answer but when will we see a Fusion processor using Southern Islands technology?
                        Correct, I can't answer

                        Originally posted by FireBurn View Post
                        Linus has always stated that post RC merges are allowed for bug fixes and support for bringing up new hardware - which this would qualify as
                        We are trying to get all the invasive changes (multiple rings, memory management etc..) pushed out in time for the merge window. Hopefully the remaining changes for GCN will be specific to new HW, but I don't think we have discussed getting them in post-merge yet.

                        BTW from this point on I'm probably going to switch from talking about GCN to talking about SI (the first generation of GCN parts), partly because it's one less letter (I'm big into efficiency) and partly because that's the terminology we use internally and I'm getting tired of typing SI, backpacing over it and typing GCN instead.
                        Last edited by bridgman; 12-23-2011, 10:59 AM.

                        Comment


                        • #27
                          Bridgman, Anandtech states GCN cards have an IOMMU and can access the full system ram.

                          - how will this affect the driver?
                          - GPU malware. What, if anything, can the driver/kernel/compiler to against these?

                          Comment


                          • #28
                            Originally posted by curaga View Post
                            Bridgman, Anandtech states GCN cards have an IOMMU and can access the full system ram.

                            - how will this affect the driver?
                            - GPU malware. What, if anything, can the driver/kernel/compiler to against these?
                            The IOMMU will actually be in the CPU/NB, not the GPU :

                            http://www.anandtech.com/show/4455/a...-for-compute/6

                            Note that current GPUs can access the full system RAM already, and one of the important jobs of the kernel driver is making sure that the GPU only accesses the bits it is supposed to access. If you see "CS checker" mentioned in the driver discussions or patches that's the relevant piece.

                            The cool thing is that future GPUs will be able to use future IOMMUs to manage system RAM accesses rather than having to maintain a parallel implementation using different hardware. As a consequence, the GPU will need to work with virtual addresses rather than the pre-translated physical addresses it uses today. The memory management changes we are hoping to push for the upcoming merge window are a first step in that direction.

                            One of the design challenges is making sure that future GPUs can still work well on hardware which does not have an ATS/PRI-capable IOMMU, and the initial code we are pushing out will be aimed at the more general case ie running on existing CPU/NB hardware without relying on having an IOMMU or ATS/PRI support.
                            Last edited by bridgman; 12-23-2011, 11:59 AM.

                            Comment


                            • #29
                              Well, the AMD PR so far implied that the card could bypass the cpu and access system ram on its own. It was also mentioned that new frameworks could be used instead of just opencl and dx11 (c++, pointer support etc).

                              I believe the cs checker can only validate what gets sent to the gpu. Could you not write a GPGPU program that calculates a pointer address at runtime (= on the gpu, thus after the CS checker)?

                              Comment


                              • #30
                                Originally posted by curaga View Post
                                Well, the AMD PR so far implied that the card could bypass the cpu and access system ram on its own.
                                True, but that is not new. The new part is...

                                Originally posted by curaga View Post
                                It was also mentioned that new frameworks could be used instead of just opencl and dx11 (c++, pointer support etc).
                                Yep. AFAIK the big change with SI is that shader programs are able to contain and generate addresses, so...

                                Originally posted by curaga View Post
                                I believe the cs checker can only validate what gets sent to the gpu. Could you not write a GPGPU program that calculates a pointer address at runtime (= on the gpu, thus after the CS checker)?
                                Correct, and that's why we had to change the memory management design as a pre-requisite to implementing SI support.

                                On previous GPUs you could more or less control system memory accesses by controlling the commands used to set up the GPU, so the cs checker could do that. Starting with SI, we run all system memory accesses through the on-chip page tables; the page tables protect system memory, and cs checker protects the page tables.

                                The open source drivers have used the page tables for a few years; they're just getting used more now because they're the most practical way to (a) deal with relocations (translating handles to physical addresses) in shader programs and (b) limit shader program access to specific areas of memory. The alternative was to re-patch the shader programs every time a buffer moved and to inspect each shader program to make sure it couldn't go outside its allocated buffers.
                                Last edited by bridgman; 12-23-2011, 05:23 PM.

                                Comment

                                Working...
                                X