Announcement

Collapse
No announcement yet.

The Linux 4.0 Kernel Currently Has An EXT4 Corruption Issue

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #41
    Originally posted by waxhead View Post

    Debian pulls in the Linux kernel and if you are a bit boring and a bit happy with things being rock solid you stick with the stable distribution branch of Debian. Debian is the mother distro for many other distributions and indirectly you can say that the QA process Debian has benefits the whole ecosystem. The way I see it QA for Linux is a non issue simply because of the many users that test it, Any serious usage would tend to use a older kernel and this is considered a non-issue. Remember there is always one more bug
    That is completely besides the point. Any critical modern project should be using modern QA methodologies as part of being the upstream as opposed to relying upon others to try to do it for you. That includes continuous integration, and an appropriate test suite that should be run on every pull request. Your users should be the last thing relied upon to discover regressions, not the first/only.

    Comment


    • #42
      Originally posted by torsionbar28 View Post
      A data corruption bug that only affects those who have invested extra effort in system stability and reliability is the polar opposite of a "positive note".
      It's positive because those are the same people who:
      • Have a sane backup strategy
      • Don't jump onto the latest kernel in production
      • Test their setup on a testing machine before deploying it to production
      So the likelihood of any actual data being lost is quite lower than that of people who have no RAID, no backups, and install whatever, whenever.

      Comment


      • #43
        Guys, in case anyone is wondering this bug affects systems with kernel 3.19.7+ or 4.0.2+ running any filesystem on top of md raid0 that supports and enables TRIM. If you don't use fstrim or have the discard option enabled in fstab, both which are not enabled by default in most distros, then you wouldn't be affected. Fedora 21 is also vulnerable for those asking, which is where I originally ran into this problem and debugged it. When I found out the latest snapshot of Fedora 22 also caused random corruption I started to dig into the issue. BTW, it has a number of other annoying raid installer bugs. I reported the issue here, https://bugzilla.kernel.org/show_bug.cgi?id=98501 in case anyone wants to see the exact fix. It's basically a 1 line change that was improperly placed. The original bug has been commited since April 10th but went unnoticed until a large number of users updated their kernel. This week Fedora updated to 3.19.7 and Arch Linux to 4.0.3.

        Those of us who are software developers know there are corner cases you can't always catch. Anyone suggesting coverage testing on millions of lines of code is crazy given the rapid rate of change. They'd need an insane amount of hardware not to mention how many tests they would need to write. I would agree RAID 0 on SSDs is quite common now, but if you look at where the bug is located it's not obvious at all that TRIM would be affected and hence needed testing. The original committer has admitted he made a mistake and has explained at least why it's related to the fix. I had a total filesystem melt down like the rest of you, but I knew this could be a remote possibility by choosing Fedora instead of Debian stable or CentOS/RHEL. Let's just hope it's not common.

        Comment


        • #44
          Originally posted by wargames View Post
          It's fun to see the Linux zealots in this thread trying to defend the indefensible. Linux 4.0.2 is considered a stable kernel, so stop saying things like "you should not use it in production" as a lame excuse for the horrendous development model. NTFS is rock solid and has been for more than a decade.
          1. I'll start by pointing out that I'm a former Windows system (and kernel) developer, and that most likely I know that internals of Windows at least as good as you (if not far better).

          2. 4.0.2 is not a stable kernel. I would venture and say that anything above what-ever-currently-being-used-by-EL-distributions (RHEL, CentOS, Oracle, SUSE, Debian-stable) should not be considered stable. Though I personally use Fedora for certain production servers (and well aware of the risk attached to using non-Enterprise Linux for production) 99.99% of all sys admin should opt to using stable EL-distributions.

          3. Last and not least: Bug happens, even FS trashing ones.
          Back in Windows NT4 SP4 was released, I stumbled upon an idiotic kernel bug within the NT kernel that will BSOD the Windows on the spot, and, at in 20-30%, will trigger NTFS FS corruption (to the point of no-boot). The bug was triggered by modifying the security token of the currently running process (*From user space*) with certain sequence of unaligned Sid's. As my company was not a valued customer (what it was called back then) we could not report this bug, and this bug lingered on for close to a *decade*. As far as I remember, I could no longer trigger this bug when Windows 2K8 server (!!!!) was released. (I no longer have access to the code that triggered this bug, so I can't really test Windows 7 and 8, but I would imagine that whatever fix that was applied to Windows 2K8 Server, was also applied to Windows 7 and 8).

          - Gilboa
          Last edited by gilboa; 21 May 2015, 03:18 AM.
          oVirt-HV1: Intel S2600C0, 2xE5-2658V2, 128GB, 8x2TB, 4x480GB SSD, GTX1080 (to-VM), Dell U3219Q, U2415, U2412M.
          oVirt-HV2: Intel S2400GP2, 2xE5-2448L, 120GB, 8x2TB, 4x480GB SSD, GTX730 (to-VM).
          oVirt-HV3: Gigabyte B85M-HD3, E3-1245V3, 32GB, 4x1TB, 2x480GB SSD, GTX980 (to-VM).
          Devel-2: Asus H110M-K, i5-6500, 16GB, 3x1TB + 128GB-SSD, F33.

          Comment


          • #45
            Originally posted by wargames View Post
            It's fun to see the Linux zealots in this thread trying to defend the indefensible. Linux 4.0.2 is considered a stable kernel, so stop saying things like "you should not use it in production" as a lame excuse for the horrendous development model. NTFS is rock solid and has been for more than a decade.
            Remember the Windows Home Server NTFS corruption bug?

            The one that Microsoft took *7 months* to fix?

            And the only workaround from MS was: "have a backup copy of any important program files before you store these files on a system that is running Windows Home Server."?

            "When certain programs are used to edit or transfer files that are stored on a Windows Home Server-based system that has more than one hard drive, the files may become corrupted"

            Comment


            • #46
              Shit happens. Anything from a kernel bug to a firmware bug to the R/W head of an HDD falling off the arm can trash your filesystem. I've seen ALL of these happen over the years and have plenty of dead hard drives. Thanks to multiple backups I don't lose data, though when something happens to one of my RAID 0 arrays of three 2TB drives, it takes over 5 hours to repopulate the replacement filesystem. Last time that happened I was changing out an iffy drive when ANOTHER drive chose that moment to drop off the SATA bus forever, so I was already committed to copying back everything anyway. ANY AND ALL data you have on one device only should be considered volatile, don't count on seeing it again after an arbitrary number of reboots. BTW, don't count on a backup drive that is always in the same machine as your main drive, a PSU failure or a power surge can get both of them at once! That's another reason for at least three copies of everything: two are mounted during any copyback job.

              Comment

              Working...
              X