Announcement

Collapse
No announcement yet.

"Ask ATI" dev thread

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by bridgman View Post
    I can't really comment on unreleased products, unfortunately.

    My understanding is that the current F@H implementation uses essentially the same code for 6xx and 7xx parts, so it does not take advantage of LDS/GDS on the 7xx parts. I imagine that's where the discussion of "calculate twice vs store and re-use" comes from.
    It's strange that ATI didn't work with folding@home to fix such a high profile GPGPU application.

    I guess everything will be better and easier to fix when we have both opensource OpenCL implementation, and when (if) folding@home releases their code. Since FAH is based on gromacs, and gromacs is open source, I guess this shouldn't be that impossible. I'll go back to my little hole now, and wait for OpenCL and 57xx, in that order.

    Comment


    • Originally posted by nanonyme View Post
      Why not just ask them to cure cancer while they're at it? And maybe bring world peace too? It's practically impossible to hit every use case every user has in mind. Release versions are always buggy, period. Sure, you can probably stabilize it very far by doing what some enterprise distros do with their software, as in freeze the inclusion of new features for so long users get frustrated, but who would be happy if ATi announced there will be no new features for the next year, only bug fixes? (features including support for new kernels and X.org versions) The only other alternative is that each version has a bit different bugs but all have bugs. (regressions happens)
      O yes Regressions.. catalyst 9.10 beta/ubuntu9.10 release brings the black-skin bug in wine1.1.30/oblivion1.2 AGAIN! catalyst 9-7 fix the bug and now 9.10 is broken again..

      overall... radeon/radeonhd driver is the bugfix for the catalyst...

      Comment


      • When will the documentation for the RS780 series of graphics chips be coming out along with documentation on handling the power management features of these chips.

        Comment


        • Scroll down to the "Chipset Guides and Documentation" section :

          http://developer.amd.com/documentati...s/default.aspx

          Most of the information required for power management is already out there - the main issue is that dynamic power management really needs to be implemented in a KMS-enabled DRM so that the PM code can (a) have access to all the required activity information, and (b) avoid hardware access conflicts between PM code (which needs to be in drm) and modesetting code. The issue there is that a couple of register locations are used for both PM and modesetting functions.

          There are a couple of things we still need to document -- some missing bits in the AtomBIOS power-related tables and the on-chip fan controller for sure. They're next on the list after we get interrupts working on 6xx/7xx.
          Last edited by bridgman; 10-13-2009, 05:52 PM.

          Comment


          • Superscalar vs. VLIW

            I'm not sure this is the right thread, but I think it is better to ask AMD/ATI dev. Various journalists, portals, and forum members all around the internet call ATI R600-R800 architecture Superscalar.

            But, AFAIK the ATI architecture is VLIW, not superscalar. Both superscalar and VLIW ways achieve the same goals. But these implementations are different. Superscalar architecture use HW dependency checking among the instructions. This means the chip is bigger. On the other hand, VLIW use SW depenency checking, so it depends heavily on compiler thus chips can be smaler.

            So, it seems to me, ATI chose the VLIW way (HD5870 has 320 VLIW cores) and nVidia superscalar way (GT200 has 120 superscalar, or 240 scalar cores).

            Do I understand it right? Is it ATI architecture VLIW and relay heavily on compiler to do instruction dependency checking?

            Comment


            • To those of you that asked most of the questions in this thread, it doesn't look like AMD is going to ever finish the formal Q&A... So you can probably stop asking questions.
              Michael Larabel
              http://www.michaellarabel.com/

              Comment


              • Originally posted by next9 View Post
                I'm not sure this is the right thread, but I think it is better to ask AMD/ATI dev. Various journalists, portals, and forum members all around the internet call ATI R600-R800 architecture Superscalar.

                But, AFAIK the ATI architecture is VLIW, not superscalar. Both superscalar and VLIW ways achieve the same goals. But these implementations are different. Superscalar architecture use HW dependency checking among the instructions. This means the chip is bigger. On the other hand, VLIW use SW depenency checking, so it depends heavily on compiler thus chips can be smaler.

                So, it seems to me, ATI chose the VLIW way (HD5870 has 320 VLIW cores) and nVidia superscalar way (GT200 has 120 superscalar, or 240 scalar cores).

                Do I understand it right? Is it ATI architecture VLIW and relay heavily on compiler to do instruction dependency checking?
                Are you Spyhawk on Beyond3D ? I just answered the same question there

                Anyways, most definitions of superscalar include VLIW as a subset. Some distinguish between "static superscalar" (VLIW) and "dynamic superscalar". I haven't found any definitions of superscalar which exclude VLIW but I'm sure they exist.

                ATI GPUs are superscalar via VLIW, or just "VLIW" if you don't consider VLIW to be a subset of superscalar. They do depend on having the shader compiler identify instruction level parallelism, but since most graphics operations deal with 3- or 4-element vectors anyways (pixels are almost always RGBA, vertices and normals are either float3 or float4) you can get decently high ALU utilization even with a simple translator like we use in the r600 mesa driver today. The approach is similar to the vector+scalar ALUs we used in r3xx-r5xx, but more general and so more useful for compute workloads.

                Extracting instruction-level-parallelism in the compiler is much more difficult with a typical CPU workload, where most of the operations are scalar. It's the high proportion of short vectors in a graphics or HPC workload which makes a VLIW approach to superscalar GPU hardware attractive.
                Last edited by bridgman; 10-16-2009, 06:59 PM.

                Comment


                • Originally posted by bridgman
                  Are you Spyhawk on Beyond3D ? I just answered the same question there
                  No. I'm not.

                  Anyways, most definitions of superscalar include VLIW as a subset. Some distinguish between "static superscalar" (VLIW) and "dynamic superscalar". I haven't found any definitions of superscalar which exclude VLIW but I'm sure they exist.
                  There can be found many academic presentations, claiming that Superscalar and VLIW are opposite ways.

                  http://www.haenni.info/thesis/presen...tml/sld006.htm
                  http://csd.ijs.si/courses/trends/tsld008.htm

                  The most important thing is, Eric Demers claimed the same thing:
                  Originally posted by Eric Demers
                  Actually, it's not really superscalar...more like VLIW...
                  http://www.rage3d.com/interviews/atichats/undertheihs/

                  Thats why I'm asking, because it seems most of the sites just copy and paste the same nonsence.

                  Extracting instruction-level-parallelism in the compiler is much more difficult with a typical CPU workload, where most of the operations are scalar. It's the high proportion of short vectors in a graphics or HPC workload which makes a VLIW approach to superscalar GPU hardware attractive.
                  And what about GPGPU? What about scientific applications? Do they have to be compiled with VLIW in mind to run fast on Radeon? Or it is just a problem of driver compiler?

                  Comment


                  • Does Radeon 4200 support OpenCL? Does it support compute shaders in CAL? AMD has made big claims about 4200 being Stream-friendly so I am confused. Is it based on RV7xx SIMDs with shared memory and the whole enchilada?

                    Comment


                    • Originally posted by codedivine View Post
                      Does Radeon 4200 support OpenCL? Does it support compute shaders in CAL? AMD has made big claims about 4200 being Stream-friendly so I am confused. Is it based on RV7xx SIMDs with shared memory and the whole enchilada?
                      According to http://en.wikipedia.org/wiki/Compari....2C_HD_4xxx.29, the integrated HD 4200 GPU is a rv620 core, like my mobility radeon 3470, and as such it doesn't support double precision and other memory-related requisites for AMD's OpenCL driver. Real r700 or newer cores are required.

                      Bridgman, please, correct me.

                      Comment


                      • Originally posted by next9 View Post
                        There can be found many academic presentations, claiming that Superscalar and VLIW are opposite ways.

                        http://www.haenni.info/thesis/presen...tml/sld006.htm
                        http://csd.ijs.si/courses/trends/tsld008.htm
                        Yeah, that's the problem. Some academic presentations say one thing, others include VLIW in the definition of superscalar :

                        http://suif.stanford.edu/papers/isca90.pdf
                        http://courses.ece.ubc.ca/476/www200.../Lecture29.pdf

                        The second one is particularly interesting, since it distinguishes between "static superscalar" and VLIW, but using those definitions our core falls into the static superscalar bucket because instructions can use the results of the previous instruction.

                        I think there is a slight trend towards reserving the "superscalar" term for dynamic extraction of instruction-level parallelism and using "VLIW" for compile-time ILP extraction, but it seems to be pretty recent (ie after the chips were designed). Today you can find both definitions fairly easily.

                        Originally posted by next9 View Post
                        The most important thing is, Eric Demers claimed the same thing:

                        http://www.rage3d.com/interviews/atichats/undertheihs/

                        Thats why I'm asking, because it seems most of the sites just copy and paste the same nonsence.
                        If the trend towards defining "superscalar" to exclude VLIW I mentioned above is real I imagine we will shift our usage accordingly (and Eric's comment supports that). In the meantime I think the big question is "which definition of superscalar do you subscribe to ?". If you don't consider VLIW to be a subset of superscalar, then we're VLIW. If you do consider VLIW to be one a subset of superscalar, then we're superscalar via VLIW. I guess I don't understand all the fuss.

                        Originally posted by next9 View Post
                        And what about GPGPU? What about scientific applications? Do they have to be compiled with VLIW in mind to run fast on Radeon? Or it is just a problem of driver compiler?
                        The compiler usually seems to be able to optimize to the point where the algorithm is running fetch-limited, ie where further ALU optimization would not make a difference. Tweaking for a specific architecture (whether ours or someone elses) usually seems to focus on optimizing memory accesses more than ALU operations.

                        There are probably exceptions where tweaking the code to match the ALU architecture can get a speedup but in general it seems that optimizing I/O is what makes the biggest difference on all architectures these days.
                        Last edited by bridgman; 10-17-2009, 11:55 AM.

                        Comment


                        • Originally posted by codedivine View Post
                          Does Radeon 4200 support OpenCL? Does it support compute shaders in CAL? AMD has made big claims about 4200 being Stream-friendly so I am confused. Is it based on RV7xx SIMDs with shared memory and the whole enchilada?
                          As Loris said, the HD4200 IGP uses a 3D engine from the RV620 so it has Stream Processors (what we call the unified shaders introduced with r600) and supports the Stream framework (CAL etc..) but does not have all the features from the RV7xx 3D engine. It does not have the per-SIMD LDS, not sure about GDS. I don't believe the OpenCL implementation supports the HD4200, since OpenCL makes heavy use of the shared memory blocks.

                          Not sure about DX11 Compute Shaders but I believe they will run on the HD4200 hardware. Be aware that there are different levels of Compute Shader support, however (CS 4.0, 4.1, 5.0 IIRC), and Compute Shader 5.0 requires DX11 hardware (ie HD5xxx).
                          Last edited by bridgman; 10-17-2009, 01:32 PM.

                          Comment


                          • Originally posted by bridgman
                            If you don't consider VLIW to be a subset of superscalar, then we're VLIW. If you do consider VLIW to be one a subset of superscalar, then we're superscalar via VLIW. I guess I don't understand all the fuss.
                            Great. Now I understand. I prefer Engineers over marketing guys, and this seemed to me like:

                            "Hey. nVidia has a scalar architecure. Lets say to our customers, we have superscalar architetcure" - marketing bullshit.

                            nVidia started to use term "stream processor". After that, ATI started to use term "stream processor. But ATI SP and nVidia SP are something different. Higher number of SP seems to be better in marketing material, no matter these are apples to oranges. Thats how it works every day.

                            Thats why I ask developer or engineer, instead of asking marketing guy. No matter what definition we use, it is clear how it works.


                            The compiler usually seems to be able to optimize to the point where the algorithm is running fetch-limited, ie where further ALU optimization would not make a difference. Tweaking for a specific architecture (whether ours or someone else) usually seems to focus on optimizing memory accesses more than ALU operations.

                            There are probably exceptions where tweaking the code to match the ALU architecture can get a speedup but in general it seems that optimizing I/O is what makes the biggest difference on all architectures these days.
                            I think, it is clear. Let me ask another question. If VLIW does not mean the problem in GPGPU, what is the reason of lower Radeon performance in typical GPGPU popular application Folding@home? I have seen some graphs, where 9600/9800GT were faster than Radeon HD4890, which does not make a sense to me.

                            Comment


                            • Originally posted by next9 View Post
                              Great. Now I understand. I prefer Engineers over marketing guys, and this seemed to me like:

                              "Hey. nVidia has a scalar architecure. Lets say to our customers, we have superscalar architetcure" - marketing bullshit.
                              Yeah, I dread the day when someone develops an architecture that can reasonably be described as "superduperscalar".

                              For what it's worth, we did talk about the design as "superscalar" inside engineering, it's not just something marketing created. I suspect the tendency to exclude VLIW from the definition of superscalar mostly happened after the unified shader core was designed.

                              Originally posted by next9 View Post
                              nVidia started to use term "stream processor". After that, ATI started to use term "stream processor. But ATI SP and nVidia SP are something different. Higher number of SP seems to be better in marketing material, no matter these are apples to oranges. Thats how it works every day.

                              Thats why I ask developer or engineer, instead of asking marketing guy. No matter what definition we use, it is clear how it works.
                              AFAIK the SPs are relatively similar in terms of what they can do. The tradeoff is partly "a smaller number of SPs at a higher clock speed vs a larger number of SPs at a lower clock speed" and partly "scalar vs superscaler... err... VLIW". Every vendor chooses the approach they think is best, and eventually they converge on something that isn't quite what any of them had in mind at the start.

                              Originally posted by next9 View Post
                              I think, it is clear. Let me ask another question. If VLIW does not mean the problem in GPGPU, what is the reason of lower Radeon performance in typical GPGPU popular application Folding@home? I have seen some graphs, where 9600/9800GT were faster than Radeon HD4890, which does not make a sense to me.
                              Just going from what I have read, the core issue is that the F@H client is running basically the same code paths on 6xx and 7xx rather than taking advantage of the additional capabilities in 7xx hardware. Rather than rewriting the GPU2 client for 7xx and up I *think* the plan is to focus on OpenCL and the upcoming GPU3 client.

                              The current F@H implementation on ATI hardware seems to have to do the force calculations twice rather than being able to store and re-use them -- storing and re-using is feasible on newer ATI GPUs but not on the earlier 6xx parts. BTW it appears that FLOPs for the duplicated calculations are not counted in the stats.

                              There also seems to be a big variation in relative performance depending on the size of the protein, with ATI and competing hardware being quite close on large proteins even though we are doing some of the calculations twice. There have been a couple of requests from folding users to push large proteins to ATI users and small proteins to NVidia users, not sure of the status.

                              There also seem to be long threads about the way points are measured. Some of the discussions (see link, around page 4) imply that the performance difference on small proteins may be a quirk of the points mechanism rather than an actual difference in throughput, but I have to admit I don't fully understand the argument there :

                              http://foldingforum.org/viewtopic.php?f=50&t=8134.
                              Last edited by bridgman; 10-17-2009, 09:15 PM.

                              Comment


                              • Radeon 58xx vs Fermi

                                I saw that 57xx doesn't have double floating point precision support, so 57xx is out of the question for me. Will OpenCL implementation from AMD support double floating point precision emulation using GPU hardware?

                                Also, what are the numbers on integer crunching?

                                hat about parallel kernel execution support that is announced from nVidia?

                                Does AMD support parallel execution of multiple compute kernels?

                                Other things I've noticed that are cool about Fermi are ECC support, syscall support, developer configurable caching/manageable memory schemes (for SP local memory).

                                Comment

                                Working...
                                X