No announcement yet.

Some Quick Tests With ZFS, F2FS, Btrfs & Friends On Linux 4.4

  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Originally posted by lumks View Post
    Would be good to see those tests for USB2 + USB3 Flashdrives in addition with exfat and vfat. because i never know whats the best to use in there.
    Don't forget another possibility which wasn't mentioned yet, but works marvelously well : UDF (see also here)
    • it works out of the box on all major OSes (Linux, Mac OS X and even Windows - using a small partition trick).
    • basically if your device can access CD & DVD, it can access UDF sticks too (even some driver less video player still have ISO9660 and UDF driver to be able to open disk images) (only your photo camera might be limited)
    • doesn't have the limitation of FAT32 (it can work with files bigger than 2GiB)
    • and is gentler on the flash wear, due to being a log-structured file system (that's why it is used for packet writing on CDs and DVDs. If you taught that the "can only erase full erase blocks, not single sectors" limitation of flash were silly, think that some optical media (CD-RW) can only be erase the whole media in one go)

    That's much better than the alternative:
    • FAT32 - Is accessible universaly. BUT has severe size limitations (e.g.: max file size is 2GiB). And the file allocation table structure giving its name to the filesystem is horrendous.
    • NTFS - used to be the best recommendation for reformatting by Windows users. But is a Microsoft-only filesystem. Can be supported on unices, but it requires installing a 3rd party drivers - typically the NTFS-3G FUSE driver (and Mac OS X users aren't that much into this kind of "not-out-of-the-box" approaches). Is supported by some embed systems due to its popularity (and because these embed usually run Linux which can use NTFS-3G). Has a journal log which can put some milder stress on the wear levelling.
    • exFAT - the current recommendation by Windows users. Also mandatory to put a "SDXC" logo on something (there are no otherwise difference with SDHC flash media). Can be supported on unices, but it requires installing a 3rd party driver (and - in case of linux - some legal questions remain problematic regarding IP rights). Supported by some modern photo cameras (those with SDXC instead of SDHC logos). Still a file allocation table-based system. I haven't that much experience with it (on the records of using UDF instead) but that can't be that light on the wear-leveling.
    • ext2/3/4 - As Linux is very popular in embed, nearly every one of these device can support it (even if they forgot to advertise it). And because there are [url=]more[/ul] than one possibility to access it from windows, some have actually considered it as a filesystem for flash media.

    I would DEFINITELY recommand using UDF on flash media that has to be shared between lots of different computers running different OSes.

    Originally posted by grigi View Post
    A word of warning with flash drives: They are often optimised to with with their pre-formatted fs, iow vFAT or ExFAT.

    You can get rock solid performance using ext4, but you likely have to profile the flash unit and custom configure the partition offset and configure striping for optimal performance.
    In theory, YES. Some flash could be more optimized for FAT32.

    What I've read about online:
    • weird partition alignement (both the starting sector of the partition and the internal structure of the FAT32), so that the partition's boot sector ends up in the same not often rewritten erase block as the partition table, and then each subsequent FAT is in its own erase block, and the main directory is tweak to exactly fill yet another erase block
    • the whole flash drive has an internal offset, so that by default FAT32 various sections end-up on different erase blocks as mentioned above
    • BUT SSDs (where such optimisation tend to be useful) will actually advertise such peculiarities, and Linux partitionner can take them into account.
    • BUT flash drives are much small and thus the whole FAT32 metadata spans a single eraseblock anyway. There's no point in splitting it
    • In practice all the cards I've seen have rather regular layouts.

    Also other things
    • in theory different sections of the drives could be expected to be written to differently (FAT are more often written to than data).
    • thus in theory flash media could use different erase block sizes or different flash technologies
    • BUT in practice I've never seen that
    • in practice, what I've seen is SSD (and some SD cards) have a separate SLC zone that is used as a cache. Blocks (no matter which) often written-to tend to stay in the SLC which can handle much more erasing/rewriting. Blocks which are less written-to will eventually get moved to the MLC/TLC
    • such flash media is of course better suited for FAT/exFAT than regular flash media (the file allocation table wil naturally stay in the SLC cache due to being ofter over-written)
    • but such flash could as well be used for any other file system (or even... GASP... Swap! On flash!) as the most rewritten block will stay on the flash technology that is much more resilient to repeated erase/rewrite cycles.

    Basically, except for the very few weird NoName flash media that you've bought from china over ebay (And You've been lucky, because very often those are just scams) that use some of the weirder optimisation scheme, most of the flash media (specially from big brands) will ever encounter is rather simple.
    • you need to align partition to 1MiB or 16MiB boundaries to coincide with erase blocks (when in doubt, look at how your SD card was paritionned before. For SATA drives, your SSD and fdisk will communicate properly and handle this automatically
    • you need to minimize the erase/rewrite cycles for longevity
    • you need to minimize the read/modify/rewrite cycles for performance
    • (also, TLC specially tends to slowly degrate over time, needing to periodically re-write then to keep performance, but that is usually handled by the firmware : static levelling, etc.)

    And for the last 2 points, it boils down to the technology of the file system:
    • file allocation table - like FAT32 and exFAT - are the worst offender. Their tables are constantly written to. That was the only viable solution with the limited computing resources 30 years ago back when introduced but that's absolutely no excuse to still keeping to use this kind of shit in the current century (even less inventing yet a new one and making it mandatory like Microsoft and exFAT). Basically anything bigger than an Arduino today tends to have at least an ARM core which has drivers for better filesystems. Luckily at least some flashmedia uses some form of cashing to aleviate the problems.
    • journal log - like ext3 and above, NTFS, etc. - to avoid ending up with a corrupted filesystem in case of powerloss/crash, these systems keep a log of modification they intend to perform on the filesystem. (So after a power failure, the systems now how to return back to a consistent state, by replaying the log). This tend to put some mild stress on the wear-leveling
    • copy on write (COW) - like BTRFS and ZFS - and log structured - like UDF and F2FS - by design they never overwrite.
      • in log structured filesystems, the filesystem it self is the log. Each new write operation writes a new additional log entry. i.e.: each incremental modification of the file system is a new line in the log. The last line of the log contains the latest state of the file system. Simply move back to roll back to earlier version. Eventually when the file system is full, the system start overwriting the oldest entries which aren't revelant any more (it sort-of garbage collects the earlier version of files that have been changed sine). Think of it as a COW system on which the old copies are kept for as long as possible before being claimed back for free space. (That's why it can also be used with write once media like packet writing on CD-R and DVD-R : you only apend to it most of the time). Think of it as a sort of giant ring buffer. That's also why there is no such thing as "udf fsck" - you can't corrupt the whole filesystem, you never modify it, only appends to it. Fixing corrupt simply means rolling back latest modification, which is simply moving up a few lines in the log.
      • in COW filesystems - you never overwrite previous data. You always write a new copy of the data that you modify. Then eventually you can claim the old copy for free space. Or keep it as part of a different "snapshot". You can think of it as a log-structured file system, where the log is as short as possible and as quickly as possible garbage collected, and where the log is multi-headed and can have several concurrent tip (one per volume). That's why snapshots are easy and efficient in ZFS and BTRFS, it's just a result of the COW technology. Also that's also why there's less need of fsck : there's very often an older copy you can roll back to. COW still have some minimalistic form of journal.

      Given these tendencies of not overwriting the data, COW and log-structured file systems are much nicer to flash media which doesn't like constant rewriting (because of the constant erasing and moving data around which is needed). That's why most of the flash oriented file-system (F2FS for Flash, and UDF even for optical media) are log structured. The draw back is that by design, these filesystems will tend to fragment a lot more than ext3/4, which has a negative impact on spinning media. But with the rise of solid state media, the draw back's impact is minimised.

    Note that BTRFS is still under heavy development, so it's not as stable as ZFS. On the other hand, thanks to COW, it's not as aweful as other still-in-development filesystem are, and it's easy to backup thanks to send/receive.

    Originally posted by lumks View Post
    for internal SSD there is only f2fs for me. because it works well and even after a crash my data was still there
    This resilience of F2FS is built-in due to its log structured by design. F2FS never overwrites (hence the "F2" - "Flash friendly" - part of its name. Very few of the evil erase/rewrite cycles). Your data is still there because it has always been there. New things written have been written as new different entries in the log. If an entry is corrupt (due to crash/power failure) just move back in the log until the last consistent entry. That entry will almost never be overwritten, until the system is full and it gets garbage collected.

    Originally posted by grigi View Post
    One example is on the Samsun EVO+ 32GB microsd card I have. if I do 4k random write bench on the pre-formatted vFAT, I got about 3.1MB/s on a straight re-format to ext4 I got about 0.6MB/s, and using flashbench to determine the flash cell sizes and then aligning the partition and configuring striping correctly on EXT4, I got 3.1MB/s 4k random write again.
    It's funny because with a very similar card (Samsung EVO 128GB and Samsung EVO+ 128GB) I haven't seen any alignement problem. On the other hand I've went straight to the "align it to 32k sector boundary".
    Maybe your previous vFAT partition was aligned a boundary-1 (so the partition boot sector is at the end of the previous boundary and the FATs start aligned directly with the boundary).

    It's also funny that your card was vFAT formatted. Usually 32GB cards tend to be sold as SDXC and thus exFAT formated. (but not necessarily. 32GiB is the largest capacity admitted by the SDHC format - and also the largest FAT32 that Windows accepts).
    exFAT is slightly diffent inside: the boot sector has pointer to the actual position of the fat copies. Thus you can align a partition to boundaries AND still align your fat: just put them on proper alignement and specify corresponding cluster address in boot sector.

    Originally posted by grigi View Post
    The controllers often also have oddities such as they perform much better in the first 64MB where the FAT tables sit, and then after that their performance drops severely.
    As I've said above: I've read about this too, but from my experience, the more recent card (e.g.: a few "Transcend industrial" flash media) and lots of SSD tend to have cache for heavily accessed blocks (RAM and SLC). But are less critical regarding exact address on media.

    Originally posted by grigi View Post
    There are lots of useful info from the RPi community as to how to configure flash for better performance.
    Thanks for your pointer (including mentionning flashbench).
    I would also recommend contacting the engineers of the big brand card. I've got success getting more information about the last few transcend cards I've used.

    • I use UDF for my USB stick that need to be shared between multiple OS and hold big files
    • I use FAT32 for my booting USB stick (System Rescue CD) so that it can also boot from UEFI
    • I use BTRFS for my smartphone's SDCard. I would have went for F2FS probably, but as the rest of the phone is also powered by BTRFS (It's a Jolla Sailfish phone), I kept with the filesystem.


    • #12
      TL;DR (or until my longer post gets approved)

      - UDF is also a valid solution for flash media shared between OSes (supported by Linux, Mac OS X and Windows)
      - But format it using tools such as this or this because of a partition trick for Windows
      - UDF is log structured so very gentle toward flash (and write only media), and very resilient against data loss
      (log structured doesn't overwrite data, but appends new entries in the log documenting the change)

      - F2FS is also log structured so that too is gently toward flash, and that's why it survived crashes

      - BTRFS and ZFS are copy on write.
      - COW too, makes your flash happy by also avoiding overwriting data, but instead making copies.
      - COW also makes these FS a little bit more resilient against data loss.
      - (but BTRFS is still in development)

      - spinning media doesn't like COW and log-structure that much: they tend to fragment data (by design)


      • #13
        I don't get it. Appearing from the performance benchmark, there could be a reason using NTFS as main file system? I. e. when I want to share the same home partition with Windows?


        • #14
          With the 4.2 and 4.3 tests, F2FS was clearly ahead of EXT4 and XFS but is now behind with 4.4. Did those two filesystems get big improvements in the 4.4 kernel or did F2FS regress?


          • #15
            Originally posted by Steffo View Post
            I don't get it. Appearing from the performance benchmark, there could be a reason using NTFS as main file system? I. e. when I want to share the same home partition with Windows?
            For that case ext2 would be more suitable. there is multiple ways to access it from windows.
            Ext2Fsd project also mounts ext3/4 as read-only:
            Ext2 File System Driver for Windows download. Ext2 File System Driver for Windows 2015-06-09 16:53:15 free download. Ext2 File System Driver for Windows A Linux ext2/ext3 file system driver for Windows


            • #16
              Originally posted by DrYak View Post
              32GiB is the largest capacity admitted by the SDHC format - and also the largest FAT32 that Windows accepts.
              IIRC, 32GB is the largest that Windows will format. If formatted by some other tool, Windows has no problems using larger FAT32 filesystems.


              • #17
                ZFS is certainly a peculiar filesystem - both due to lots of (old and no more valid) information around its innards, and the large number of tunables that make it adaptable to quite a lot of application spaces. One thing that nearly always improve performance substantially and is regarded to be a safe default, is the use of "-o ashift=12" during zpool create on modern disks. This will make the write alignment to 4K instead of 512B, which is the (wrongly) reported sector size for quite a lot of hard disks. For many workloads where access is paged in known sizes (like databases) it is also useful to use recordsize=<knownrecordsize> (for example 8K for MySQL and PostgreSQL).

                Another tunable is the prefetch feature:
                options zfs zfs_prefetch_disable=1 will disable the automatic prefetch heuristics, that tries to identify which part of the IO are sequential and may be helped by prefetch. On a purely SSD disk (and synthetic benchmarks) this is totally useless, and destroys the cache hit ratio (it is quite useful on SSD+rotational, though). There are lots of other tips possible, but would take an article on their own...


                • #18
                  Originally posted by DrYak View Post
                  - BTRFS and ZFS are copy on write.
                  - COW too, makes your flash happy by also avoiding overwriting data, but instead making copies.
                  - COW also makes these FS a little bit more resilient against data loss.
                  - (but BTRFS is still in development)

                  - spinning media doesn't like COW and log-structure that much: they tend to fragment data (by design)
                  lol but still the advantages of btrfs and zfs are mainly on big harddisks and raid systems with harddisks. because where do I fear bitrot? not on a system-ssd, where do I not care about speed? not system-ssds. where do I need raid like features primary, again not on your system-ssd.

                  Maybe you are right and I am to brainwashed from the zfs propaganda, and they made a big design error targeting specificly big data centers with zfs, and stuff like freenas do really suck and people just dont know it.

                  And belive me I know what you are talking about had to investigate a few weeks ago how my harddisk with maybe 30% size data of his full amount told me it was full and I had to find a esoteric commannd to be able to access my free space and linux stopped lieing (imho) that it was full.

                  (btrfs biggest weakness is its horrible integration in standard linux tools like df) that was on a system ssd btw so I dont see here the advantage except maybe in extrem circumstages, that you can control how much stuff gots rewritten, what does not matter in usual consumer loads, ssd will last nearly forever without cow like fses.

                  sadly btrfs at one point dont eats your data, thats good, but its very hard to make any use of any of its advantages, its very hard to mainntain you run even with very conservative usage into such problems that your fs iis full and you have to type cracy magic fs specific commands to get space back from it.

                  So even your data is save, I start avoiding using it again. heck even fedora boot installer dont installs a new grub when you have installed your btrfs directly on your hd, if you call grub-install and grub2-mkconfig manual they both work without problem but dnf post-kernel-script will not do it. and I got a wontfix from the redhat guy that was responsible for that part.

                  so you dont can skip this partitianing part too. at least on a systemdisk. dedup seems too me also not very accessable, speed we dont have to talk... yes still fast enough. there is also no easy useful way of using automaticly the snapshot feature.

                  So I guess the integration of the fs into linux is the main problem, if gnome or something or systemd would use it automaticly in some nice way. btw the very slow sqlite performance does also not make btrfs a good system ssd fs.

                  But if you were right, the first non-cow fs with some features like antii bitrot would then easily make zfs nearly useless, until ssds are standart for fileservers.


                  • #19
                    Some obvious things to benchmark:
                    1) Mechanical drives. Uhm, running filesystem on 80Gb drive is nice, but it only tests "system drive" scenario. SSDs are still too expensive for storing terabytes of data.
                    2) I wonder if Michael performs full-surface TRIM (+ some waiting, letting drive to actually perform erases) before trying next filesystem in his list, to put all competitors in equal starting conditiosn aka "factory drive state".
                    3) Some RAIDs can be nice. Though these are advanced setups, not meant for Average Joe. For SSDs it can be meaningful to try to adjust blocking factors and/or force-enable SSD tuning (if fileystem allows it). But enterprise admins would be jealous and would always suggest you do it wrong. Which may or may not be a case, or they can be just frustrated with benchmark results - happens a lot.
                    Last edited by SystemCrasher; 21 January 2016, 10:30 AM.


                    • #20
                      Originally posted by kobblestown View Post

                      IIRC, 32GB is the largest that Windows will format. If formatted by some other tool, Windows has no problems using larger FAT32 filesystems.
                      Exactly. Windows can READ larger FATs, but can't format above 32GiB. But larger FATs imply non-default cluster sizes. This means more lost space ("slack") on small files and it does not negates the fact file on FAT can't exceed 2^32 aka 4GiB size. Which can be quite annoying. E.g. transferring large movie or single-file backup would prove to be impossible thing to do. You can split file, but it's MS-DOS aged solution, screw that...