Announcement

Collapse
No announcement yet.

The dangers of Linux kernel development

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Code:
    + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
    + unsigned char output;
    + i8042_command(&output, i);
    + }*/
    Obvious danger: A future version of the compiler results in an unsigned int being a different size then you expect. In which case, 0x0120 might instead be 0x00000120, and your code fails. Or the range you are trying to dump gets expanded (or worse, shrunk); is there some way to get these values from the device, or do they have to be hardcoded? And so on.

    As a general rule, the more assumptions you have to make, the more likely things will eventually break.

    Comment


    • #12
      Originally posted by lowflyer View Post
      I'd like to make one point here: Do not just give an advice or instruction, educate your "clients" to the point that they *understand* the possible consequences. I know that americans generally prefer simple do-this-do-that instructions - but this leaves you with the dilemma of liability if something brakes.
      LOL. I am not an American.

      Originally posted by gamerk2 View Post
      Code:
      + /* for (unsigned int i = 0x0120; i <= 0x013f; ++i) {
      + unsigned char output;
      + i8042_command(&output, i);
      + }*/
      Obvious danger: A future version of the compiler results in an unsigned int being a different size then you expect. In which case, 0x0120 might instead be 0x00000120, and your code fails. Or the range you are trying to dump gets expanded (or worse, shrunk); is there some way to get these values from the device, or do they have to be hardcoded? And so on.

      As a general rule, the more assumptions you have to make, the more likely things will eventually break.
      It was just an example and a debugging hack.
      The 8042 commands are described here:

      0x21 to 0x3F Read "byte N" from internal RAM (where 'N' is the command byte & 0x1F) Unknown (only the first byte of internal RAM has a standard purpose)

      I would claim that the risk there would be really small. It was specified as RAM not I/O registers. As the 8042 is currently implemented in software AFAIK, I just thought that if I were lucky it would just dump the Embedded Controller RAM (which would probably be very similar to the EC being dumped using the ACPI protocol).
      There was a risk that an implementation would steal these opcodes for itself, though.
      In the end I opted for a different debugging method.

      Comment


      • #13
        I broke Linus's printer once and I'm still here

        Comment


        • #14
          Originally posted by Mat2 View Post
          Hello,
          I have recently started to take part in kernel dev, for the time being by helping people debug their problems.
          I am a bit worried about the possibility of damaging someone's hardware.
          How much risk there is? Has this happened to anybody here?

          I have heard someone here destroyed someone's else stuff by making them dump the registers (IO or MMIO space) from a wide range.

          On

          it is written:
          "However, please beware that we can't be responsible if your system winds up getting damaged. This warning is particularly important for classes in Kernel Internals and Device Drivers, where you will be compiling and installing kernels and kernel modules. Such operating system damage, while rare, is still possible. "


          there was this problem with e1000 driver destroying either eeprom data or something in hardware, most direct vector of problem is the graphic card with wrong refresh rate, which can occasionally fry the display hardware. it probably doesn't happen anymore nowadays with smarter displays. maybe something botched in power/thermal management could damage the hardware, but i am not sure about something directly harmful.

          Comment


          • #15
            Originally posted by yoshi314 View Post
            there was this problem with e1000 driver destroying either eeprom data or something in hardware, most direct vector of problem is the graphic card with wrong refresh rate, which can occasionally fry the display hardware. it probably doesn't happen anymore nowadays with smarter displays. maybe something botched in power/thermal management could damage the hardware, but i am not sure about something directly harmful.
            How a network card driver of a PCI card can do something based on the display refresh rate?

            Yep, I am aware that there were some issues with lm-sensors flashing the BIOS and even today sensors-detect carries a big warning.

            EDIT: http://lwn.net/Articles/300202/


            Fortunately many of the problems when Linux "breaks" hardware can be fixed, just like with the Samsung laptop and this NIC.
            Last edited by Mat2; 16 August 2014, 04:48 AM.

            Comment


            • #16
              " There is one thing that will not have changed, though. Testers of unstable software - especially the kernel - have often been warned that said software can do all kinds of terrible things to their systems. It is easy to ignore those warnings; even -rc1 kernels actually work for most people, most of the time. But, as we have seen in this case, the potential for catastrophic bugs is real. Development code can brick your network adapter, scramble your filesystems, open up severe security holes, or save your documents as OOXML. When experimenting with unstable code - even if it has been neatly packaged by your distributor - it is always prudent to have good backups and an even better sense of humor."

              Comment


              • #17

                "So in the interests of adding some closure to this bug. The issue turns out to have never been the e1000e driver's fault. The fault lies with the CONFIG_DYNAMIC_FTRACE option. So specifically when the FTRACE code was enabled, it was doing a locked cmpxchg instruction on memory that had been previously used as __INIT code from some other module.

                a) some other module loads
                b) that module's init code calls into ftrace which stores the EIP
                c) that module discards its init code
                d) e1000e loads
                e) e1000e asks the kernel for memory to ioremap onto, and gets the memory location of the code at b) and maps the flash/NVM control registers there.
                f) ftraced runs and rewrites onto bytes 4-8 of the memory location from b/e
                g) since the lock/cmpxchg instruction is undefined for memory mapped registers, random junk is written to the b/e location
                h) depending on the contents of the junk in g) the NVM is either byte corrupted or block erased, which is detected the next time the e1000e driver is loaded.

                a short term workaround is in 2.6.27.1 (disable CONFIG_DYNAMIC_FTRACE) and the longer term fix is rewrites of the cmpxchg code (which is already done and will be in 2.6.28-rc1)"

                Comment


                • #18
                  I think it really depends on what exactly you're working on. If you're working on GPU drivers in combination with fan control you might be able to break someone's hardware by accidentally disabling the fan while having the GPU load at 100%. If you're working on disk drivers you might be able to mess up someone's data bad enough that they need to reinstall, basically destroying their data. In most cases the worst that should happen is crashing someone else's computer. Of course you might trigger a hidden hardware bug rendering a random piece of hardware nonfunctional.

                  The questions you should be asking yourself is: What exactly am I working on? What is the worst that could happen? How can I make sure it never does? The answer would probably be along the line of: Be aware of what you're doing, think carefully about what might happen if there's a bug and test early and often.

                  However, even if the chance of permanently damaging stuff is very rare, you can't guarantee that it'll never happen. Which is why you tell people not to test this kind of stuff on production machines and to have a backup handy. Breaking hardware is far less likely than corrupting data and in most cases replacing hardware is simply a question of money, data on the other hand can be irreplaceable/invaluable. Which is why you put the warning sticker on there. You don't want to be held responsible when Corporate Boss decides it's a good idea to test a new kernel on all production machines and ends up causing his company multi-million loss because the machines are unavailable for a few days and need to be reinstalled.

                  TL;DR: Those warnings and liability waivers are there to protect you no matter how small the chance of actually breaking things is. The cost associated with a tiny flaw could be astronomical based on where and when it happens.

                  Comment


                  • #19
                    Hello,
                    I once messed up hard enough with a harddisk that it took fire (Well, that was just a chip on the controller which took fire).
                    Just some notes about it : It was a harddisk I didn't cared anymore about; an old Locked-to-the-machine 10GB IDE disk that used to be in an Xbox until I replaced it with a custom one.
                    So, I was just playing around with it, the disk attached to an USB port. I can't remember exactly, but I was playing around with a serial connection and a HEX editor.
                    When it took fire, I quickly unplugged it, then re-plugged it. Took fire again. It finished its life opened up, so I now have a demonstration harddisk to show how it's made...
                    (Just keep in ming I was really young at the moment, maybe ~12 years old, and I didn't cared about this peculiar hardware, so, I am pretty sure your computer won't take fire if I help you trough the Internet).

                    So, my point is : yes, pretty bad things can happen, even if you just mess up with the software side, but you have to be looking for breakage for it to actually happen, I think.
                    This just confirm Schmidt's law. However, I can't assure you that bad things won't happen, even if you just started to mess up with something.

                    Comment


                    • #20
                      Originally posted by M@yeulC View Post
                      (Just keep in ming I was really young at the moment, maybe ~12 years old, and I didn't cared about this peculiar hardware, so, I am pretty sure your computer won't take fire if I help you trough the Internet).
                      Born a hacker!

                      Comment

                      Working...
                      X