Announcement

Collapse
No announcement yet.

Making Use Of Btrfs 3-Copy/4-Copy Support For RAID1 With Linux 5.5+

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by polarathene View Post

    Oh, cool tool, thanks for that link

    So 1TB in that case, and if the initial disk setup was only 2x 1TB and 1x 500GB, then the two 1TB disks would lose half of their available storage capacity right?(only 500GB available in RAID1C3) At least until more capacity/devices is added to the device pool.
    Quite right

    Comment


    • #12
      Originally posted by polarathene View Post
      How does the feature work with with differing device storage capacities?
      The same as it works with btrfs RAID1. As long as it can split the data on multiple drives it's fine, it does not strictly need same size drives.

      It makes eyeballing final size a bit more complicated though.

      Comment


      • #13
        Originally posted by scineram View Post
        What happens currently when one device fails in RAID1?
        btrfs throws an error and then starts writing in Single mode in the remaining drives in the array (i.e. it picks a random drive and writes into it even if the array is still seen as a single volume from the OS).

        If you reboot at this point, it will refuse to mount it read-write until you add a replacement disk and fix the issue (as below)

        After you replace the drive, you need to run a balance with the raid level you wanted so all these Single writes are converted and written to all drives as normal.

        Comment


        • #14
          Originally posted by starshipeleven View Post
          The same as it works with btrfs RAID1. As long as it can split the data on multiple drives it's fine, it does not strictly need same size drives.

          It makes eyeballing final size a bit more complicated though.
          RAID1 at least for HDDS iirc is used for metadata, but that's unwise for SSDs. So these new RAID1 versions I guess for HDDs could also allocate extra copies to the same disk?(not sure if this only worked for metadata or needed some other mode, might have actually have been called DUP or something)

          In RAID1 if metadata copy corrupts, then it's detected but it can't repair as it's not sure which copy is valid or something? With data, it can though based on related metadata I think? So RAID1C3 for metadata and RAID1 for data would be a good setup?

          Comment


          • #15
            Originally posted by polarathene View Post
            RAID1 at least for HDDS iirc is used for metadata, but that's unwise for SSDs.
            That applies if you are running RAID1 on a single drive (the "dup").

            That's open to debate. It all depends on how the SSD controller deals with that. If it tries to be smart and is running de-duplication it will deduplicate the RAID1 data on its own and you are not getting any real benefit in practice as both copies are in fact the same cells in hardware.

            I personally doubt most SSDs are smart enough (or powerful enough) to do that on a regular basis unless they are under serious space pressure (space is running out) as they always favor performance over space efficiency.

            Of course if you are placing multiple SSDs in RAID1 then this is not an issue at all as the two copies of RAID1 will be on different drives alltogether.

            So these new RAID1 versions I guess for HDDs could also allocate extra copies to the same disk?(not sure if this only worked for metadata or needed some other mode, might have actually have been called DUP or something)
            No these work only for different drives. The "RAID1 in the same volume" mode has always been called "dup". If you want to have multiple copies on the same disk with this system you must make more partitions on the same drive.

            For example you can make 2 partitions on a single drive and make a Btrfs RAID1 setup on this drive as now btrfs sees two different volumes. Even if technically it's the same drive, it will not complain that what you are doing is weird and possibly dumb.

            In RAID1 if metadata copy corrupts, then it's detected but it can't repair as it's not sure which copy is valid or something?
            in RAID1 if metadata corrupts it will be fixed because it has checksums to detect the error and since it is in RAID1 it has two copies of metadata (and usually one of them is good)
            The same for data.

            With data, it can though based on related metadata I think?
            Kind of. File system metadata means "file system infrastructure".

            In a filesystem without any redundancy:

            -If you have metadata corruption you have lost a large amount (or all) of your drive's data as now there is no way to know where data is in the drive and you may be unable to mount the filesystem at all. I remember fondly the times when a sudden power off in XP or Win98 would corrupt the drive like this. This is the most dangerous type of corruption.

            -If you have data corruption you have lost only the file that was corrupted. It's bad but other files on the drive are still fine.

            Decent filesystems have some form of metadata protection, either journaling like NTFS/ext4/whatever that allow to recover a consistent filesystem metadata in case of crash or sudden shutdown while writing, but journaling can't protect data (so whatever you were writing at the moment of the crash will most likely be corrupted in some way).

            Btrfs and ZFS are CoW filesystems so even in the event of a crash or power off in a single partition without redundancy they won't corrupt data or metadata, also the same is true for log-based filesystems like Nilfs and F2FS and UDF.

            So RAID1C3 for metadata and RAID1 for data would be a good setup?
            if you have more than one drive, yeah you can do that although it's not strictly required.
            Btrfs RAID1 is plenty reliable enough for a small setup with 4-8 drives, this RAID1 with multiple redundancy is more for huge arrays with dozens of drives (where the chances of more than one drive failing are actually significant).

            Metadata does not occupy a whole lot of space though so if you put metadata in RAID1C3 you won't be wasting space at all as whatever is unaccessible to metadata (but seriously metadata in your 1TB/500GB setup isn't going to use anywhere near 500GB anyway) will be used by data.
            Last edited by starshipeleven; 12 February 2020, 07:08 AM.

            Comment


            • #16
              Originally posted by starshipeleven View Post
              in RAID1 if metadata corrupts it will be fixed because it has checksums to detect the error and since it is in RAID1 it has two copies of metadata (and usually one of them is good)
              The same for data.
              Ah ok, the checksums allow for identifying which metadata is valid for repair. While nocow data has no checksums, does it's related metadata still use CoW and thus checksums? Does that allow for identifying which copy is corrupt and which is valid for repair? I thought there was some scenario where RAID1 with BTRFS was able to detect corruption but not repair it as it wasn't able to assert which copy was bad, just that they were no longer the same, I think it was with nocow data?

              If so, then I guess RAID1C3 on metadata chunks doesn't help as much as it would on nocow data chunks?

              If I'm totally off base here, then I need to go look through my notes, but it was one of the concerns I had with adopting BTRFS, other than just keeping regular backups elsewhere.

              Comment


              • #17
                Originally posted by polarathene View Post
                Ah ok, the checksums allow for identifying which metadata is valid for repair.
                checksums are at block level, each block written has a checksum, be it metadata or data. Corruption detection logic in btrfs is the same and does not care about what is actually that block.

                While nocow data has no checksums, does it's related metadata still use CoW and thus checksums?
                yes, the options for that are called nodatacow and nodatasum after all.

                There is no sane reason to NOT protect metadata with checksum and CoW, performance impact is negligible and corruption of metadata usually hoses the filesystem (any filesystem).
                Afaik there is no option to turn off CoW and checksums for metadata in btrfs, for this reason.

                Does that allow for identifying which copy is corrupt and which is valid for repair? I thought there was some scenario where RAID1 with BTRFS was able to detect corruption but not repair it as it wasn't able to assert which copy was bad, just that they were no longer the same, I think it was with nocow data?
                nodatacow and nodatasum have the same effect (both CoW and checksums are disabled for data in either case).

                If you have no checksums it can't detect any issues. Kind of obvious.

                The only scenario where btrfs can detect but not fix corruption is when you have checksums enabled but you don't have a good copy. If your data is "Single" for example (you have a single partition formatted with btrfs and default settings) there is a checksum but only one copy. If the checksum check detects corruption it will throw an error but won't be able to correct it as there is no good copy it can use.

                In RAID1 this can only happen in case a drive fails, as I said if a drive fails it will start writing in "Single" mode in the array, so there is only one copy of that now. This is irrelevant while it's degraded as if you lose another drive you have lost the whole array anyway.

                After you replace the drive you MUST run a btrfs balance -draid1 -mraid1 /path/to/filesystem to convert all the stuff written in "Single" mode back to RAID1. If you don't do that, you might still have some stuff written as "Single", and this will be an issue later if there is corruption or another drive fails.

                If so, then I guess RAID1C3 on metadata chunks doesn't help as much as it would on nocow data chunks?
                RAID1C3 with nodatacow/nodatasum does not make much sense as you are storing three copies of stuff but have no way of knowing which of them is the right one.

                It will work like a (more flexible as it can still use drives of different sizes) mdadm or hardware RAID1, in the sense that yes it will survive drive failures, but now you are relying on the drive's own ECC functionality to always report good data even if a sector fails (which usually happens a few times before a whole drive actually dies) which defies the whole point of Btrfs.

                Comment


                • #18
                  Originally posted by starshipeleven View Post
                  The only scenario where btrfs can detect but not fix corruption is when you have checksums enabled but you don't have a good copy. If your data is "Single" for example (you have a single partition formatted with btrfs and default settings) there is a checksum but only one copy. If the checksum check detects corruption it will throw an error but won't be able to correct it as there is no good copy it can use.

                  RAID1C3 with nodatacow/nodatasum does not make much sense as you are storing three copies of stuff but have no way of knowing which of them is the right one.

                  It will work like a (more flexible as it can still use drives of different sizes) mdadm or hardware RAID1, in the sense that yes it will survive drive failures, but now you are relying on the drive's own ECC functionality to always report good data even if a sector fails (which usually happens a few times before a whole drive actually dies) which defies the whole point of Btrfs.
                  Ah thanks, just looked through my notes and it was the single copy bitrot detection but inability to repair that I had been thinking of all this time, nothing to do with RAID copies Sorry about all that. If Phoronix had the equivalent of reddit gold I'd gift some!

                  I guess any nodatacow content will just need to rely on regular backups then. Cheers for all the clarification and advice

                  Comment

                  Working...
                  X