Announcement

Collapse
No announcement yet.

The dangers of Linux kernel development

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • The dangers of Linux kernel development

    Hello,
    I have recently started to take part in kernel dev, for the time being by helping people debug their problems.
    I am a bit worried about the possibility of damaging someone's hardware.
    How much risk there is? Has this happened to anybody here?

    I have heard someone here destroyed someone's else stuff by making them dump the registers (IO or MMIO space) from a wide range.

    On
    http://training.linuxfoundation.org/...g-requirements
    it is written:
    "However, please beware that we can't be responsible if your system winds up getting damaged. This warning is particularly important for classes in Kernel Internals and Device Drivers, where you will be compiling and installing kernels and kernel modules. Such operating system damage, while rare, is still possible. "
    Last edited by Mat2; 08-04-2014, 02:41 PM.

  • #2
    Bump!

    Comment


    • #3
      Bump! Anyone?
      This may happen to anybody, even to the best kernel hackers.

      Comment


      • #4
        I don't think anyone can give you a specific answer. Schmidt's law applies here just like everywhere else...

        If you mess with a thing long enough, it'll break
        ... so I guess there are two ways to interpret the question you are asking.

        1. Since hardware fails at a certain rate during normal use, does that rate increase when doing debugging of kernel and driver issues and if so by how much ?

        2. If I am helping someone to troubleshoot an issue, is there a risk that the hardware will fail during that effort and that I will be considered at fault ?

        For question 1, my impression is that if you are constantly locking up and resetting a PC you do increase the chances of a random register read/write hitting something that "lets the smoke out", but (a) HW manufacturers try really hard to protect against this, to the point that I haven't heard about it happening recently, (b) the actual risk is a function of the implementation details of the specific hardware components in each PC, and (c) while failures do happen the increase in failure rate relative to normal use seems to be sufficiently low that it hasn't been considered measurable, although it does seem to be non-zero.

        For question 2, this is something you need to consider every time you give advice since systems are going to fail whether or not you are debugging kernel/driver issues. One time I was showing someone how to configure an email client and the system died part way through. To this day I think they suspect I broke their hardware. There's no good answer here other than discussing with the HW owner before you start and getting a clear agreement that you are only helping and take no liability.
        AFAICS the disclaimer you mentioned in the original post is aimed primarily at the second question -- system failure during development/debugging whether or not the failure is actually related to what you are doing.

        I don't think there are any good answers here. Most countries/provinces have established "Good Samaritan" laws for cases where bystanders help people in serious trouble -- without them even helping someone out of a burning vehicle (which I've had to do a couple of times) puts you at risk if something goes wrong. Those laws don't apply to kernel/driver debugging, of course, at least not in Canada.

        Bottom line / recap: There's no good answer here other than discussing with the HW owner before you start and getting a clear agreement that you are only helping and take no liability. Never hurts to make sure their system is backed up before you start, since most people don't seem to do that either.
        Last edited by bridgman; 08-09-2014, 12:59 PM.

        Comment


        • #5
          Lots of thank You for Your answer.
          Originally posted by bridgman View Post
          I don't think anyone can give you a specific answer. Schmidt's law applies here just like everywhere else...
          I'm sorry, but didn't find anything about the Schmidt law.

          Originally posted by bridgman View Post
          ... so I guess there are two ways to interpret the question you are asking.

          1. Since hardware fails at a certain rate during normal use, does that rate increase when doing debugging of kernel and driver issues and if so by how much ?

          2. If I am helping someone to troubleshoot an issue, is there a risk that the hardware will fail during that effort and that I will be considered at fault ?
          I meant directly breaking the hardware, i.e. modifying the bios by mistake or setting some settings in a chip so that removing a battery is required to reenable this (this may be problematic if the battery is non-repleaceable).


          Originally posted by bridgman View Post
          For question 1, my impression is that if you are constantly locking up and resetting a PC you do increase the chances of a random register read/write hitting something that "lets the smoke out", but (a) HW manufacturers try really hard to protect against this, to the point that I haven't heard about it happening recently, (b) the actual risk is a function of the implementation details of the specific hardware components in each PC, and (c) while failures do happen the increase in failure rate relative to normal use seems to be sufficiently low that it hasn't been considered measurable, although it does seem to be non-zero.
          Could You give any rules of how to prevent it? What things are especially risky?

          For example is this risky?

          + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
          + unsigned char output;
          + i8042_command(&output, i);
          + }*/
          This is reading all bytes from the internal RAM of the 8042 keyboard controller. Since modern 8042s are probably implemented in software in the Embedded controller and such commands are not executed nowadays (nobody is manually interfacing with 8042s) there is a possibility that this was not accounted for by the programmer / designer and may make the controller lock up to the point that taking the battery out will be neccessary.

          After dumping the contents of the RAM in a different way it turned out that bytes at the address other then 0 really do not exist - their meaning was never defined so the EC does not implement them.

          Originally posted by bridgman View Post
          For question 2, this is something you need to consider every time you give advice since systems are going to fail whether or not you are debugging kernel/driver issues. One time I was showing someone how to configure an email client and the system died part way through. To this day I think they suspect I broke their hardware. There's no good answer here other than discussing with the HW owner before you start and getting a clear agreement that you are only helping and take no liability.
          AFAICS the disclaimer you mentioned in the original post is aimed primarily at the second question -- system failure during development/debugging whether or not the failure is actually related to what you are doing.

          I don't think there are any good answers here. Most countries/provinces have established "Good Samaritan" laws for cases where bystanders help people in serious trouble -- without them even helping someone out of a burning vehicle (which I've had to do a couple of times) puts you at risk if something goes wrong. Those laws don't apply to kernel/driver debugging, of course, at least not in Canada.
          Originally posted by bridgman View Post
          Bottom line / recap: There's no good answer here other than discussing with the HW owner before you start and getting a clear agreement that you are only helping and take no liability. Never hurts to make sure their system is backed up before you start, since most people don't seem to do that either.
          That's the best way to make people do not apply your patches, unfortunately.
          I also have to make a backup now, thank You.

          Have You ever personally destroyed some hardware by hacking? (I mean general-purpose PC, not embedded ones).

          BTW, if we did not take any risks, we would do nothing at all.
          Last edited by Mat2; 08-09-2014, 02:13 PM.

          Comment


          • #6
            Originally posted by Mat2 View Post
            Lots of thank You for Your answer.

            I'm sorry, but didn't find anything about the Schmidt law.


            I meant directly breaking the hardware, i.e. modifying the bios by mistake or setting some settings in a chip so that removing a battery is required to reenable this (this may be problematic if the battery is non-repleaceable).



            Could You give any rules of how to prevent it? What things are especially risky?

            For example is this risky?

            + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
            + unsigned char output;
            + i8042_command(&output, i);
            + }*/
            This is reading all bytes from the internal RAM of the 8042 keyboard controller. Since modern 8042s are probably implemented in software in the Embedded controller and such commands are not executed nowadays (nobody is manually interfacing with 8042s) there is a possibility that this was not accounted for by the programmer / designer and may make the controller lock up to the point that taking the battery out will be neccessary.

            After dumping the contents of the RAM in a different way it turned out that bytes at the address other then 0 really do not exist - their meaning was never defined so the EC does not implement them.



            That's the best way to make people do not apply your patches, unfortunately.
            I also have to make a backup now, thank You.

            Have You ever personally destroyed some hardware by hacking? (I mean general-purpose PC, not embedded ones).

            BTW, if we did not take any risks, we would do nothing at all.
            Bridgman is a pretty well known contributor, I would take what he says very seriously. I don't pay much attention and I know who he is, that should tell you something.

            Comment


            • #7
              Originally posted by Mat2 View Post
              I'm sorry, but didn't find anything about the Schmidt law.
              It was just the text I quoted -- "If you mess with a thing long enough, it'll break". Sort of a joke, but with an important message -- you need to consider not only failures that happen as a direct result of your advice but also failures which are not actually related but may appear that way because of the circumstances.

              Originally posted by Mat2 View Post
              I meant directly breaking the hardware, i.e. modifying the bios by mistake or setting some settings in a chip so that removing a battery is required to reenable this (this may be problematic if the battery is non-repleaceable). Could You give any rules of how to prevent it? What things are especially risky? For example is this risky?

              + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
              + unsigned char output;
              + i8042_command(&output, i);
              + }*/

              This is reading all bytes from the internal RAM of the 8042 keyboard controller. Since modern 8042s are probably implemented in software in the Embedded controller and such commands are not executed nowadays (nobody is manually interfacing with 8042s) there is a possibility that this was not accounted for by the programmer / designer and may make the controller lock up to the point that taking the battery out will be necessary. After dumping the contents of the RAM in a different way it turned out that bytes at the address other then 0 really do not exist - their meaning was never defined so the EC does not implement them.
              I know you're looking for more specific advice but I think the general rule here is to treat everything as potentially risky unless there is anecdotal evidence to the contrary. Dumping specific registers based on someone else's advice is generally safe, but dumping the whole range is more likely to have side effects so I wouldn't recommend that to a stranger UNLESS there is enough evidence that others have done this successfully on the same hardware.

              Originally posted by Mat2 View Post
              That's the best way to make people do not apply your patches, unfortunately.
              True, but IMO that may be the right answer in that specific case. You can't make decisions for others unless you're willing to take responsibility for the consequences, so the best you can do is give them your best understanding of risk, point them to online cases where others have done the same thing (if they exist) and let them evaluate the trade-off between potentially getting their problem fixed and potentially mucking up their system. You shouldn't be doing that for them.

              IMO this is the right thing to do even when talking hacker-to-hacker, only difference is that the other person will understand & decide a *lot* faster

              Originally posted by Mat2 View Post
              Have You ever personally destroyed some hardware by hacking? (I mean general-purpose PC, not embedded ones).
              No, other than the embarrassing time where I apparently mucked up the PCIE drivers (output blocks in the HW, not SW drivers) on a chipset by not double-checking the power was really off before pulling the gfx card. New box, with reset & power buttons in opposite locations from what I was used to.

              Originally posted by Mat2 View Post
              BTW, if we did not take any risks, we would do nothing at all.
              Yep. There's always a risk when you do anything -- all you can do is try to make sure the hardware owner understands the risk before going ahead. You're asking the right questions, even if there are not easy answers.
              Last edited by bridgman; 08-10-2014, 10:43 AM.

              Comment


              • #8
                One last point...

                A lot of devs end up using a crappy old machine for all their important programs/data and using the shiny new machine for development & hacking. Seems backwards but there's a lot to be said for it -- just remember that the other person probably doesn't think that way.

                Comment


                • #9
                  Don't do your driver development kind of things on a machine with irreplaceable data. The hardware is expendable-ish. Your data, priceless.

                  Comment


                  • #10
                    If I may chime in here ...

                    I'd like to make one point here: Do not just give an advice or instruction, educate your "clients" to the point that they *understand* the possible consequences. I know that americans generally prefer simple do-this-do-that instructions - but this leaves you with the dilemma of liability if something brakes. I you take the longer road of education, your "client" is able to decide by himself about any risks.

                    After all linux is open-source, which means that there can *real knowledge exchange* happen!

                    Comment


                    • #11
                      Code:
                      + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
                      + unsigned char output;
                      + i8042_command(&output, i);
                      + }*/
                      Obvious danger: A future version of the compiler results in an unsigned int being a different size then you expect. In which case, 0x0120 might instead be 0x00000120, and your code fails. Or the range you are trying to dump gets expanded (or worse, shrunk); is there some way to get these values from the device, or do they have to be hardcoded? And so on.

                      As a general rule, the more assumptions you have to make, the more likely things will eventually break.

                      Comment


                      • #12
                        Originally posted by lowflyer View Post
                        I'd like to make one point here: Do not just give an advice or instruction, educate your "clients" to the point that they *understand* the possible consequences. I know that americans generally prefer simple do-this-do-that instructions - but this leaves you with the dilemma of liability if something brakes.
                        LOL. I am not an American.

                        Originally posted by gamerk2 View Post
                        Code:
                        + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
                        + unsigned char output;
                        + i8042_command(&output, i);
                        + }*/
                        Obvious danger: A future version of the compiler results in an unsigned int being a different size then you expect. In which case, 0x0120 might instead be 0x00000120, and your code fails. Or the range you are trying to dump gets expanded (or worse, shrunk); is there some way to get these values from the device, or do they have to be hardcoded? And so on.

                        As a general rule, the more assumptions you have to make, the more likely things will eventually break.
                        It was just an example and a debugging hack.
                        The 8042 commands are described here:
                        http://wiki.osdev.org/%228042%22_PS/...oller_Commands
                        0x21 to 0x3F Read "byte N" from internal RAM (where 'N' is the command byte & 0x1F) Unknown (only the first byte of internal RAM has a standard purpose)

                        I would claim that the risk there would be really small. It was specified as RAM not I/O registers. As the 8042 is currently implemented in software AFAIK, I just thought that if I were lucky it would just dump the Embedded Controller RAM (which would probably be very similar to the EC being dumped using the ACPI protocol).
                        There was a risk that an implementation would steal these opcodes for itself, though.
                        In the end I opted for a different debugging method.

                        Comment


                        • #13
                          I broke Linus's printer once and I'm still here

                          https://lkml.org/lkml/2009/12/14/538

                          Comment


                          • #14
                            Originally posted by Mat2 View Post
                            Hello,
                            I have recently started to take part in kernel dev, for the time being by helping people debug their problems.
                            I am a bit worried about the possibility of damaging someone's hardware.
                            How much risk there is? Has this happened to anybody here?

                            I have heard someone here destroyed someone's else stuff by making them dump the registers (IO or MMIO space) from a wide range.

                            On
                            http://training.linuxfoundation.org/...g-requirements
                            it is written:
                            "However, please beware that we can't be responsible if your system winds up getting damaged. This warning is particularly important for classes in Kernel Internals and Device Drivers, where you will be compiling and installing kernels and kernel modules. Such operating system damage, while rare, is still possible. "


                            there was this problem with e1000 driver destroying either eeprom data or something in hardware, most direct vector of problem is the graphic card with wrong refresh rate, which can occasionally fry the display hardware. it probably doesn't happen anymore nowadays with smarter displays. maybe something botched in power/thermal management could damage the hardware, but i am not sure about something directly harmful.

                            Comment


                            • #15
                              Originally posted by yoshi314 View Post
                              there was this problem with e1000 driver destroying either eeprom data or something in hardware, most direct vector of problem is the graphic card with wrong refresh rate, which can occasionally fry the display hardware. it probably doesn't happen anymore nowadays with smarter displays. maybe something botched in power/thermal management could damage the hardware, but i am not sure about something directly harmful.
                              How a network card driver of a PCI card can do something based on the display refresh rate?

                              Yep, I am aware that there were some issues with lm-sensors flashing the BIOS and even today sensors-detect carries a big warning.

                              EDIT: http://lwn.net/Articles/300202/
                              http://blog.vodkamelone.de/archives/...interface.html

                              Fortunately many of the problems when Linux "breaks" hardware can be fixed, just like with the Samsung laptop and this NIC.
                              Last edited by Mat2; 08-16-2014, 04:48 AM.

                              Comment

                              Working...
                              X