Announcement

Collapse
No announcement yet.

Patches For AMD GPUs On Loongson Point To "Massive Platform Bug" For These Chinese CPUs

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Patches For AMD GPUs On Loongson Point To "Massive Platform Bug" For These Chinese CPUs

    Phoronix: Patches For AMD GPUs On Loongson Point To "Massive Platform Bug" For These Chinese CPUs

    A set of patches were posted on Monday in aiming to get aging AMD Radeon GFX7/GFX8 era graphics processors working on Loongson LoongArch platforms. These patches for handling old Radeon Hawaii~Polaris GPUs on Loongson point to a "massive platform bug" with these domestic Chinese systems...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Does the same problem appear in MIPS? I thought LoongArch was just a fork of MIPS?

    Comment


    • #3
      Originally posted by Estranged1906 View Post
      Does the same problem appear in MIPS? I thought LoongArch was just a fork of MIPS?
      The architecture used for the CPU cores is unlikely to have much to do with bugs in their PCIe controller block.

      Comment


      • #4
        Originally posted by Estranged1906 View Post
        Does the same problem appear in MIPS? I thought LoongArch was just a fork of MIPS?
        Mm. You're conflating the MIPS ISA specification with the communications bus of the specific implementation, in this case Loongson's chipsets. It appears the problem is Loongson's implementation of the PCI-e bus standard is broken in whichever hardware revision(s) they're talking about.

        Reads as a pretty serious and potentially crippling hardware bug.

        Edit to add a bit more from Konig later on in the thread since I've read further:
        Well that sounds like the usual re-ordering problems we have seen
        patches for on Loongson multiple times now.

        And I can only repeat what I've wrote before: We don't accept
        workarounds in drivers for problems cause by severely platform issues.

        Especially when that is clearly against any PCIe specification.

        Regards,
        Christian.
        ​
        Loongson as a hardware implementation apparently has a history of hardware brokeness. Reminds me of VIA which also has a pretty notorious reputation for out-of-spec/broken hardware.
        Last edited by stormcrow; 18 June 2024, 09:16 AM. Reason: extra context from the thread...

        Comment


        • #5
          "Well to be honest on a platform where even two consecutive writes to the same location doesn't work I would have strong doubts that it is stable in general."
          This is rich coming from AMD who had processors that corrupted the stack pointer (and crashed as a result) if it was popped multiple times in a row. Ahhem.

          Come on! Each and every CPU manufacturer had their own painful erratas. Remember Intel's FDIV bug? Even today, all processors are full of bugs, they just either tend to be minor or are worked around in software. Intel and AMD just have been around long enough to burn through the worst of them and build up enough know-how to avoid any major ones today. Loongson is a relatively new player, of course they too have to go through the learning curve. This is no reason for one manufacturer to bash another, especially for somebody like AMD who had been in the same shoes before.

          Comment


          • #6
            Originally posted by ultimA View Post

            This is rich coming from AMD who had processors that corrupted the stack pointer (and crashed as a result) if it was popped multiple times in a row. Ahhem.

            Come on! Each and every CPU manufacturer had their own painful erratas. Remember Intel's FDIV bug? Even today, all processors are full of bugs, they just either tend to be minor or are worked around in software. Intel and AMD just have been around long enough to burn through the worst of them and build up enough know-how to avoid any major ones today. Loongson is a relatively new player, of course they too have to go through the learning curve. This is no reason for one manufacturer to bash another, especially for somebody like AMD who had been in the same shoes before.

            Intel FDIV bug was indeed massive, the AMD corruption of the stack pointer you cite does not seem of the same weight. I read it happens on:
            ... a very specific sequence of
            consecutive back-to-back pops and (near) return instructions, can
            create a condition where the process or incorrectly updates the
            stack pointer.​
            By the way a more proper massive issue for AMD was the TLB problem on the first Phenom processors.

            The issue from the Loongson architecture, where two consequent writes on the pci-ex need a workaround, also seems a quite important bug that hinders hardware compatibility in a very broad way.

            Comment


            • #7
              Originally posted by ultimA View Post

              This is rich coming from AMD who had processors that corrupted the stack pointer (and crashed as a result) if it was popped multiple times in a row. Ahhem.

              Come on! Each and every CPU manufacturer had their own painful erratas. Remember Intel's FDIV bug? Even today, all processors are full of bugs, they just either tend to be minor or are worked around in software. Intel and AMD just have been around long enough to burn through the worst of them and build up enough know-how to avoid any major ones today. Loongson is a relatively new player, of course they too have to go through the learning curve. This is no reason for one manufacturer to bash another, especially for somebody like AMD who had been in the same shoes before.
              You are correct. Every platform has bugs, every company makes mistakes. The big question is how does the company resolve those bugs. Intel responded to its FDIV bug by offering replacement hardware at a high cost to itself. After a bit of digging, I found the errata for the AMD stack bug you mentioned.


              721 Processor May Incorrectly Update Stack Pointer
              Description
              Under a highly specific and detailed set of internal timing conditions, the processor may incorrectly update the stack pointer after a long series of push and/or near-call instructions, or a long series of pop and/or near-return instructions. The processor must be in 64-bit mode for this erratum to occur.
              Potential Effect on System
              The stack pointer value jumps by a value of approximately 1024, either in the positive or negative direction. This incorrect stack pointer causes unpredictable program or system behavior, usually observed as a program exception or crash (for example, a #GP or #UD).
              Suggested Workaround
              System software may set MSRC001_1029[0] = 1b.
              Fix Planned
              No​
              Doing some more digging led me to the Linux kernel mailing list.


              In fact, there is a whole bunch of similar "CFG" registers in this range,
              and they are rather sparsely documented by AMD.

              Let's have a comment about that.

              MSRs in 0xC001102x range (and a few close to this range)
              allow to modify some internal actions of the pipeline.
              (There is one non-debug MSR in this range, introduced in Fam15h:
              MSR 0xC0011027 Address Mask For DR0 Breakpoints, aka DR0_ADDR_MASK).

              Sometimes these MSRs are used to fix erratas.​
              Denys Vlasenko then specifically calls out several dozen erratas mitigated by setting specific values to various registers. The code attached to the email indicates that those registers are handled in platform specific configuration files, where they can be set and not affect anyone else.

              These patches affecting Loongson weren't like that:
              The patches were immediately rejected as they disable behavior needed by the driver for other platforms.
              Yeah. That's not going to fly. Fixing your problems by shunting them onto other folks will make you enemies real fast.

              I ran into a mixed 8-bit/16-bit math bug on a microcontroller about a year ago and it was a pain to diagnose, but there was an errata issued for it by the manufacturer and working around it only took about five lines of code. Welcome to low-level and embedded programming, you get to deal with all the quirks of your platform.

              Comment


              • #8
                To be clear, this has nothing to do with the processor or ISA, it's presumably an issue with the PCIe implementation on the platform. This is actually a pretty common issue on small non-x86 platforms that add on PCIe in various strange ways. PCI (and PCIe) requires cache coherency with the host CPU (i.e., the bus must support snooping of the CPU's cache from the device). Some processors, e.g., ARM, require that the CPU include some special cache coherency IP to be included in order for this to work properly but not all ARM CPU vendors include it as they don't always have PCIe as part of the platform initially and then someone adds it on later.

                Comment


                • #9
                  Originally posted by agd5f View Post
                  (i.e., the bus must support snooping of the CPU's cache from the device).
                  Wait wat?

                  Comment


                  • #10
                    Originally posted by Jonjolt View Post
                    Wait wat?
                    PCI is required to be cache coherent with the host CPU.

                    Comment

                    Working...
                    X