Announcement

Collapse
No announcement yet.

PCIe bus errors on Linux with GTX 980

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PCIe bus errors on Linux with GTX 980

    I'm running Fedora 21 with Linux kernel 3.18.3-201.fc21.x86_64 and Nvidia binary driver 346.35. The problem I'll describe also happened with the previous driver and kernel. When I boot up, I immediately see error messages like this:

    [ 20.681038] pcieport 0000:00:01.1: AER: Multiple Corrected error received: id=0009
    [ 20.681051] pcieport 0000:00:01.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0009(Receiver ID)
    [ 20.681053] pcieport 0000:00:01.1: device [8086:2f03] error status/mask=00000001/00002000
    [ 20.681055] pcieport 0000:00:01.1: [ 0] Receiver Error (First)
    [ 20.803313] pcieport 0000:00:02.0: AER: Corrected error received: id=0010
    [ 20.803321] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0010(Receiver ID)
    [ 20.803322] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00000080/00002000
    [ 20.803324] pcieport 0000:00:02.0: [ 7] Bad DLLP

    Based on various google searches, there seems to be some suggestions that this is an issue related to the nvidia driver and my GTX 980 (EVGA GTX980 SC ACX 2.0). Also based on those same google searches, passing "pci=nomsi" makes the error messages disappear.

    Is anyone else seeing this?

  • #2
    I'll just go ahead and say this in this thread too,

    I doubt this will help, but I have seen a very similar error message on a radeon 6850 card using the OSS driver a while ago when 2d color tiling first got implemented. It did eventually get fixed, but for a while I had to leave that option disabled.

    It's probably entirely unrelated, but maybe it can help your searches?

    Comment


    • #3
      Okay, I'll move my posts here since they'll be far more on-topic.

      Asus released BIOS update 1001 yesterday for the Rampage V Extreme. It didn't fix the issue.

      I was able to do a bit of testing. It seems that booting into windows first doesn't help with the issue.

      Here's the errors that I see (interestingly I see 2f04 where as duby229 sees 2f03, but I doubt it matters):
      [Thu Jan 29 10:01:46 2015] pcieport 0000:00:02.0: AER: Corrected error received: id=0010
      [Thu Jan 29 10:01:46 2015] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)
      [Thu Jan 29 10:01:46 2015] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00000001/00002000
      [Thu Jan 29 10:01:46 2015] pcieport 0000:00:02.0: [ 0] Receiver Error
      [Thu Jan 29 10:01:47 2015] pcieport 0000:00:02.0: AER: Multiple Corrected error received: id=0010
      [Thu Jan 29 10:01:47 2015] pcieport 0000:00:02.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0010(Receiver ID)
      [Thu Jan 29 10:01:47 2015] pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00000001/00002000
      [Thu Jan 29 10:01:47 2015] pcieport 0000:00:02.0: [ 0] Receiver Error

      Here are my system specs in case it's helpful to anyone:
      i7 5960
      Asus Rampage V Extreme (x99 chipset)
      3ware 9750-i4 PCIE card (x8)
      32GB memory

      Comment


      • #4
        part of the problem for me is that i don't understand this problem deeply enough to know what needs to be fixed? Is it a X99 motherboard BIOS issue? Is it a Nvidia binary driver issue? Is it a Linux kernel issue? Or, a combination of? Has anyone contacted a vendor yet? (motherboard manufacturer or Nvidia or?)

        I've posted a bug report with the Fedora project for my case: https://bugzilla.redhat.com/show_bug.cgi?id=1181012

        But i'm not entirely sure this is to be solved by some Linux kernel update?

        Comment


        • #5
          Originally posted by BLinux View Post
          part of the problem for me is that i don't understand this problem deeply enough to know what needs to be fixed? Is it a X99 motherboard BIOS issue? Is it a Nvidia binary driver issue? Is it a Linux kernel issue? Or, a combination of? Has anyone contacted a vendor yet? (motherboard manufacturer or Nvidia or?)

          I've posted a bug report with the Fedora project for my case: https://bugzilla.redhat.com/show_bug.cgi?id=1181012

          But i'm not entirely sure this is to be solved by some Linux kernel update?
          That's just it... I don't think any of us really knows yet. We just know it's related to the X99 chipset... and probably with using nvidia GPU's as well (I haven't seen reports of windows users having this issue with AMD GPU's).

          Comment


          • #6
            Originally posted by hiryu View Post
            That's just it... I don't think any of us really knows yet. We just know it's related to the X99 chipset... and probably with using nvidia GPU's as well (I haven't seen reports of windows users having this issue with AMD GPU's).
            hiryu, you mentioned that you have some instability problems with your system when you have these error messages. have you tried the pci=nomsi option? if so, I'm wondering if your instability problems go away with it or not? just wondering if the problem is only masked by pci=nomsi or does it truly avoid the issue by going back to legacy interrupts?

            Comment


            • #7
              Originally posted by BLinux View Post
              hiryu, you mentioned that you have some instability problems with your system when you have these error messages. have you tried the pci=nomsi option? if so, I'm wondering if your instability problems go away with it or not? just wondering if the problem is only masked by pci=nomsi or does it truly avoid the issue by going back to legacy interrupts?
              Yes, I have. Probably the first week I had this computer. It makes the errors go away, but I saw no improvements to my stability.

              Comment


              • #8
                Originally posted by hiryu View Post
                Yes, I have. Probably the first week I had this computer. It makes the errors go away, but I saw no improvements to my stability.
                Ok thanks... for my system, everything is stable with pci=nomsi. I have a ASRock X99 Extreme11 board with i7-5960x, EVGA GTX 980 SC ACX 2.0.

                I think you had mentioned that some folks with the same problem on a X99 board got a BIOS update that seems to have fixed the problem. Can you please share a link to where you found that claim? I want to share it with ASRock and see if they might be motivated to look into it and see if perhaps they can do the same. They have not released any BIOS updates for my board yet. Also, a link to where Windows people are having the same problem might be helpful too, so ASRock doesn't think it's just a Linux problem.

                BTW, thanks for the dialogue/conversation on this issue. I haven't found any other Linux users to have this conversation with, and frankly, I'm really surprised that more people here haven't seen this issue? Is it because they are not using GTX 980 on X99 boards?

                Comment


                • #9
                  Interesting error, you would expect that only with a bad pci-e connection, which is not that likely with a new board, maybe refit the card or try another board (not X99). I have got lots of cards/boards but no GTX 980 nor X99 but there are sometimes firmware options to disable PCI-E error checking, maybe then it would go away too - the nvidia kernel module has an option to disable msi too.

                  Comment


                  • #10
                    Originally posted by BLinux View Post
                    Ok thanks... for my system, everything is stable with pci=nomsi. I have a ASRock X99 Extreme11 board with i7-5960x, EVGA GTX 980 SC ACX 2.0.

                    I think you had mentioned that some folks with the same problem on a X99 board got a BIOS update that seems to have fixed the problem. Can you please share a link to where you found that claim? I want to share it with ASRock and see if they might be motivated to look into it and see if perhaps they can do the same. They have not released any BIOS updates for my board yet. Also, a link to where Windows people are having the same problem might be helpful too, so ASRock doesn't think it's just a Linux problem.

                    BTW, thanks for the dialogue/conversation on this issue. I haven't found any other Linux users to have this conversation with, and frankly, I'm really surprised that more people here haven't seen this issue? Is it because they are not using GTX 980 on X99 boards?
                    Here's some links:

                    We see other linux users with this problem:
                    I have a 780Ti on an EVGA X99 board, 2 LG monitors attached to a single card. I am getting random hard-locks that are more frequent when doing graphics intensive things, like playing with compositing or gaming. The locks never occur when I am on a TTY - only when I’m on X. The system is not overclocked, and it is mprime and memtest stable as far as I can tell (over an hour in each.) I get things like this in my logs if I do not boot with pci=nommconf in my kernel parameters: [ 3.986470] ...

                    I'm dual posting here due to the PCIE errors that I am getting in my system log - I'm also posting at NVidia because I cannot decipher which could be at fault: I have an NVidia 780Ti in PCIEport 4 at x16 mode. That is the only card I have installed.The processor is a 5930k (not overclocked) and...


                    Windows:
                    http://forums.evga.com/WheaLogger-Ev...-m2217198.aspx (Look for "I've been getting the same error in event viewer on my x99 FTW motherboard too and bios 1.06.")

                    I think I found 1-2 more windows specific theads in the past, but I can't find them at the moment.

                    One of the threads mentions that people have had some luck with forcing Gen2. For me, this reduces the rate of the errors, and does improve my stability somewhat, but that's all it does for me.

                    The problem doesn't seem to be GTX 9xx specific. People have issues with GTX 680's too. It seems to be more of an X99 issue with nvidia GPU's (if someone sees this with a Radeon on X99, drop us a line).

                    There was an issue with using an external sound cards with X99 boards. When X99 boards were first released, people were unable to get their soundcards working until a BIOS update or few. I have an external sound card too (and it works for me), but removing the sound card didn't fix this issue for me. However, between the fact that external sound cards used to be broken and our nvidia problems suggests to me that probably the PCIE controller found on X99 boards is really finicky.

                    Comment

                    Working...
                    X