Announcement

Collapse
No announcement yet.

Gigabyte Motherboard WMI Temperature Driver Queued Ahead Of Linux 5.13

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gigabyte Motherboard WMI Temperature Driver Queued Ahead Of Linux 5.13

    Phoronix: Gigabyte Motherboard WMI Temperature Driver Queued Ahead Of Linux 5.13

    Earlier this month I reported on a WMI temperature driver for Gigabyte motherboards being worked on by an independent developer. That "gigabyte-wmi" driver is now slated for inclusion in the upcoming Linux 5.13 cycle...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    Meanwhile ASUS users are screwed.

    Comment


    • #3
      A screenshot where to see the result of this feature would be appreciated.

      Comment


      • #4
        Originally posted by Azrael5 View Post
        A screenshot where to see the result of this feature would be appreciated.
        Type "sensors" in the terminal. I posted my terminal output in the last thread. This added the gigabyte_wmi-virtual-0 entries.

        Comment


        • #5
          Knowing your temps is nice I guess. But knowing when you have memory errors is even more important. A board (and/or CPU) that doesn't support ECC is useless. Memory sizes are large enough now that you WILL experience bit flips in RAM. From just this morning, dmesg on my home server reports the following two events:

          [14401.374881] mce: [Hardware Error]: Machine check events logged
          [14401.374909] [Hardware Error]: Corrected error, no action required.
          [14401.374971] [Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c5c400002080a13
          [14401.375053] [Hardware Error]: Error Addr: 0x0000001adc63ffc0
          [14401.375098] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
          [14401.375177] EDAC MC1: 1 CE on mc#1csrow#0channel#1 (csrow:0 channel:1 page:0x1adc63f offset:0xfc0 grain:0 syndrome:0x2b8)
          [14401.375181] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)

          [14851.395420] mce: [Hardware Error]: Machine check events logged
          [14851.395440] [Hardware Error]: Corrected error, no action required.
          [14851.395488] [Hardware Error]: CPU:4 (15:2:0) MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c5c400002080a13
          [14851.395545] [Hardware Error]: Error Addr: 0x0000001adc63ffc0
          [14851.395578] [Hardware Error]: MC4 Error (node 1): DRAM ECC error detected on the NB.
          [14851.395634] EDAC MC1: 1 CE on mc#1csrow#0channel#1 (csrow:0 channel:1 page:0x1adc63f offset:0xfc0 grain:0 syndrome:0x2b8)
          [14851.395637] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: RD, part-proc: RES (no timeout)


          Notice where it says "Corrected error, no action required"? That's ECC doing its job. Without ECC, the errors would not be corrected, and they would not even be logged. Silent errors, which would lead to data corruption and/or software crash. I'm with Linus on this one, ECC is a required feature.

          Comment


          • #6
            Originally posted by torsionbar28 View Post
            Notice where it says "Corrected error, no action required"? That's ECC doing its job. Without ECC, the errors would not be corrected, and they would not even be logged. Silent errors, which would lead to data corruption and/or software crash. I'm with Linus on this one, ECC is a required feature.
            I'm pretty sure you're at least partially wrong there. I had those same messages in dmesg - including the "corrected error, no action required" part, but I don't have no ECC RAM. I suspect these errors are the same that got reported for Windows as "WHEA" errors and they occur with certain BIOS versions (more likely when overclocking ).

            Other than that I agree that ECC would be a good thing to have everywhere.

            On topic: it's great to get more support in the sensors area - but it seems to me that all the big motherboard manufacturers (Asus, ASRock, MSI, Gigabyte) are similarily bad with supporting linux. So we depend on guesswork and independent contributors (thanks a lot to Thomas Weißschuh!).
            For example my motherboard, a Gigabyte X570 Aorus Master does also have the IT8688 chip, but no sensors are exposed via WMI. There is a second ITE chip that is recognized by the it87 driver and that does show some sensors, but I have no idea if that's all of them or how the chips work together. And Gigabyte doesn't give out any info, not even to developers doing their work for them
            Last edited by mazumoto; 14 April 2021, 09:24 AM.

            Comment


            • #7
              Originally posted by mazumoto View Post
              I'm pretty sure you're at least partially wrong there. I had those same messages in dmesg - including the "corrected error, no action required" part, but I don't have no ECC RAM. I suspect these errors are the same that got reported for Windows as "WHEA" errors and they occur with certain BIOS versions (more likely when overclocking ).
              The errors are in a standard format, so while they may appear similar at first glance, the details are important. AFAIK modern x86 CPU caches utilize ECC even if the main memory doesn't. So potentially you could have seen a similar error on a PC lacking ECC main memory, if a bit flip occurs in L2/L3 cache. In the examples I posted above however, the message specifically states "detected on the NB". NB being the "North Bridge" aka the main memory controller. So this error must have occurred in main memory, as the L2/L3 cache access do not go through the NB.

              To be clear, if you don't have ECC main memory, and a bit flip occurs there, you will never be notified - not by BIOS not by WHEA not by anything - as that capability does not exist without ECC. And the error will not be corrected, which means you'll have unexplained corrupted data and/or software crashes.
              Last edited by torsionbar28; 14 April 2021, 10:51 AM.

              Comment


              • #8
                Originally posted by mazumoto View Post

                I'm pretty sure you're at least partially wrong there. I had those same messages in dmesg - including the "corrected error, no action required" part, but I don't have no ECC RAM. I suspect these errors are the same that got reported for Windows as "WHEA" errors and they occur with certain BIOS versions (more likely when overclocking ).

                Other than that I agree that ECC would be a good thing to have everywhere.

                On topic: it's great to get more support in the sensors area - but it seems to me that all the big motherboard manufacturers (Asus, ASRock, MSI, Gigabyte) are similarily bad with supporting linux. So we depend on guesswork and independent contributors (thanks a lot to Thomas Weißschuh!).
                For example my motherboard, a Gigabyte X570 Aorus Master does also have the IT8688 chip, but no sensors are exposed via WMI. There is a second ITE chip that is recognized by the it87 driver and that does show some sensors, but I have no idea if that's all of them or how the chips work together. And Gigabyte doesn't give out any info, not even to developers doing their work for them

                My Aorus Gaming K7 X370 also has two sets of sensors. The standard IT87 driver detects one set, but those are not all the sensors available. I need to use the patched IT87 driver and add acpi_enforce_resources=lax to my kernel boot line so that the other set of sensors gets exposed. Hopefully with this I no longer have to do this.

                Comment


                • #9
                  Originally posted by birdie View Post
                  Meanwhile ASUS users are screwed.
                  I was super hopeful that I just didn't have the experience to understand the issue... Crap

                  Comment


                  • #10
                    Originally posted by birdie View Post
                    Meanwhile ASUS users are screwed.
                    Lol, I just love how you're linking this "discussion" you had with the kernel devs, instead of trying to hide it in a deep dark cave and erase its existence from the world (as I would do if I had been publicly ridiculed like that). Could it be that you're actually proud of yourself acting like a 12yo?

                    Some choice quotes from that bug report:

                    Originally posted by The Troll
                    And this still paints Linux in a very bad light as users hardly care about if ACPI is implemented according to the specifications or not: however what they really care is whether their hardware works or being supported under Linux regardless out of the box. Most Linux users don't even know `dmesg` exists, so they have no way of knowing how to fix the issue.
                    No dude, you have it all wrong: we Linux users DO care if ACPI (or whatever other standard) is properly implemented according to spec or not. We take pride in a platform that is technically sound. We do NOT want a platform that's full of hacks and workarounds, just to cater to vendors like ASUS and their half-broken crap.

                    Also, hardware DOES work properly under Linux because it's not Linux that makes it work, it's the firmware. That's what controls my fans and monitors the system's temps and voltages. What does NOT work in some cases under Linux is MONITORING those sensors, which is something that MOST USERS won't ever want or need to do.

                    Originally posted by Sane & polite kernel dev #1
                    This isn't a bug - the ACPI tables claim the resource in question, and there's no way we can verify there are no conflicts between ACPI methods that touch that range and the native driver. If you're confident that this is safe on your system then you can boot with acpi_enforce_resources=lax, but we can't make that the default. This will still produce the warning, but the driver will be permitted to load.
                    Originally posted by The Troll
                    This bug needs to be fixed because

                    1) It doesn't affect Windows
                    2) Average people will never know how to deal with issue
                    3) I cannot ask my motherboard vendor (ASUS) to fix this issue in BIOS because they don't provide support for Linux - they barely provide any support at all.
                    Originally posted by Sane & polite kernel dev #1
                    tl;dr - the kernel message you're seeing is correct. Avoiding it requires a new driver to be written. If you *personally* feel safe in ignoring the risks, you can pass the acpi_enforce_resources=lax option, but that can't be the default because it's unsafe in the general case, and so it isn't the solution to the wider problem.
                    Originally posted by The Troll
                    This message [talking about a new message he's proposed to be printed out, even though he's just been told the current message is correct] will at least allow various Linux distros to enable the option by default [talking about "acpi_enforce_resources=lax" even though he's just been told it's freaking NOT SAFE to be enabled by default] because many are not aware of the bug.
                    Originally posted by Sane & polite kernel dev #2
                    The root cause here is the production model used by world of Windows and world of Linux (and besides the downsides like above I prefer the latter). For Windows the drivers are made for *THE product* while in *nix world the drivers try to cover as many products as they can with regard to the similarities and compatibility of the corresponding IPs.

                    That's why people often see "oh, hey, it works in Windows!" Yes, it works, but if and only if you are using the very same *THE product*. Step right or left will be a suicidal in that model. The Windows model is very fragile because of this and requires 10x times more resources to develop the code.
                    Originally posted by Sane & polite kernel dev #3
                    Hmm,

                    asus-wmi-sensors also is not such a great solution, it seems the WMI interface is buggy on some boards and causes fans to stop or get stuck at max speed, which is quite bad, see:



                    So it seems that the situation with sensors on these boards simply sucks and Asus is to blame here. If even the "official" method of accessing the sensors is buggy then Asus needs to get their firmware fixed and until that is done users are better of without sensors support.
                    Originally posted by Sane & polite kernel dev #3
                    Matthew rightly advises against using "acpi_enforce_resources=lax" because that opens races between the firmware and Linux which could result in writing to another superIO register then intended. This can definitely lead to e.g. stopping the fans even though the CPU is running hot, which is not good but all modern CPUs have builtin overtemp protection, so at the worst the system will simply shutdown (1).
                    Originally posted by The Troll
                    Multiple users use acpi_enforce_resources=lax and I haven't seen a single report that it's ever broken anything.
                    Originally posted by Sane & polite kernel dev #3
                    <sigh> Yet I have been on the receiving end of a bug-report where I had to explain to a user that the lm_sensors sensors-detect script had overvolted his RAM ruining both his expensive high-end RAM as well as his expensive top of the line CPU. The user was surprisingly relaxed about all this, which I really appreciated. And that was while the script was not doing anything which we (the developers) considered dangerous. But the motherboard had a funky setup causing a SMbus *read* transaction to change the voltage. Mucking with this stuff can be dangerous and as Matthew has explained in his thorough analysis of the DSDT the DSDT is actually accessing the superio and if that races with a Linux kernel access a wrong register may be read from, or worse written to. Using acpi_enforce_resources=lax simply is dangerous and we are not going to change the default, period, full-stop. I welcome further discussions here about how we can *safely* solve hwmon access on various motherboards. Please stop discussing acpi_enforce_resources=lax, that is not a safe option to use and more discussion about it is not productive.
                    Originally posted by The Troll
                    I'm not a programmer let alone a person who understand the innards of the Linux kernel to even attempt to fix the issue
                    TL;DR #1: Then please STFU and go back to your Windows.

                    TL;DR #2: In other words, what we have here is a self-declared non-programmer who's b*tching to the effing KERNEL programmers (i.e. not your average script kid) about something that he's arbitrarily decided to perceive as a kernel issue on the basis that "it works on Windows lol", and who demands that it be fixed by THEM because he "can't go asking the vendors to support Linux"; all the while the kernel devs kindly trying to explain to him that it's not a problem of Linux but a problem of the vendors, and that unfortunately it's one of those things that without official vendor support the kernel devs can't provide a proper solution, like e.g. Nouveau.

                    But what am I saying - this is the same guy that's been bitching for years and years against Linux and the open source model, and praising Nvidia for their utterly horrible handling of GPU driver support or Microsoft for their great product that is Windows (not).

                    Seriously dude, you're a very sad person.

                    Comment

                    Working...
                    X