Announcement

**Weasel** · 13 August 2018, 07:50 AM

Originally posted by DrYak View Post

I've got a bit of experience with F2FS (mostly on my Pi as it is a bootable FS there even without any initrd because it's precompiled in the kernel),
got tons of experience with BTRFS from embed (data partition on Pi) all the way through smartphones (Jolla) and up to servers.

BOTH are file system that attempt to avoid in-place over writes as much as possible because that where most flash sucks (it leads to a write amplification : erase the whole erase block, write the new write, re-write again all the other bits of the erase blocks that weren't in the current write).

F2FS does it by being a log-structured file system (i.e.: the filesystem itself is the log, most write are append-only - think multisession CD - with the older parts eventually getting garbage collected. Thus rarely any over-write). As far as I've read, F2FS tries to optimize as much as possible by keeping a cache in ram and grouping writes (and over-writes), so that an I/O intensive software doesn't blow up the log but instead only writes a final copy at the end.
(So database-type "lots of random writes" doesn't suck too much in theory).

BTRFS does so by being a copy-on-write system : it never overwrites. It always writes a new copy of the data elsewhere, and only then (once the data is updated) it updates the pointers (and thus snapshots come literally for free : they are just 2 pointers point to 2 different versions !)
Performance wise, it's on the slower side, mainly because it tries to cram more features, like checksum of *everything* (not only the metadata) - that's great for data integrity, (it's one of the few FS offering this) but that its CPU and bits of IO. it also provide optional compression, great for saving bandwidth and space, but eats CPU (and latency).
Basically, BTRFS is in the same class as ZFS, whereas F2FS is in the same class as EXTn.

BTRFS might also potentially suffer from the "lots of random writes" problems, like any other non-overwriting filesystem. You basically got 3 otpions :
- mount with "autodefrag" which tries to detect and group together multiple writes (a bit like F2FS does) which might help mitigate the problems.
- manually "btrfs fi defrag" a single file that has become a too big labyrinth of indirect pointers.
- mark some files with extended attribute "+C" (no CoW) which is the way to go with large files with lots of random writes that usually feature their own internal integrity (databases, virtual machine disk images, torrent of large files, etc.)
(NOTE: you can't directly "chattr +C" a large file, only empty ones. So you need to "touch" to create a new empty, then "chattr", then "cat" to append the data from the original. Don't forget to check chmod/chown)
- mount a whole partition as nocow, which is a bit stupid because you lose one of the key characteristic of btrfs.

BTRFS eats more memory than F2FS (but comes with more features).

Warning: RAID5/6 on BTRFS still isn't considered as production-ready for environments where powerloss can happen. Use other raid modes instead.

Both F2FS and BTRFS being "never overwrite" systems means that even in case of powerfailure, you can usually recover an older version of the data.

I always appreciate your posts about filesystems because you have experience with many different setups and use cases, and actually know what you're talking about.

I have a question. How does F2FS compare to UDF for "dumb" flash media like USB sticks or SD cards? (TRIM support is irrelevant here, I know about that difference)

**starshipeleven** · 13 August 2018, 08:11 AM

Originally posted by discordian View Post

I don't see why not having extensive RAM and controllers to accommodate for unfit filesystems is a valid critism of eMMC.

I'm not critiquing eMMC, I'm just stating facts. It's not anywhere as good as a decent SSD, it's not even supposed to, that's a fact.

Filesystems catering for those characteristics (like E2FS) wont need a overly complex controller, and should still cause less overhead writes to Nand pages.

In practice on SSDs (that have to be designed with NTFS in mind) F2FS is trading blows with ext4 more often than not, it's NOT a clear winner.

Having a powerful controller and a drive cache basically offsets most of the gains you would get with F2FS.

If we were talking of some form of "Open-channel SSD" where the controller is dumber (on purpose) and does not try to optimize stuff on its own, then you will see what F2FS can actually do. I think it probably could do better, that's one of the main selling points of Open-channel SSDs (for company use anyway).

**caligula** · 13 August 2018, 08:20 AM

Originally posted by DrYak View Post

I've got a bit of experience with F2FS (mostly on my Pi as it is a bootable FS there even without any initrd because it's precompiled in the kernel),
got tons of experience with BTRFS from embed (data partition on Pi) all the way through smartphones (Jolla) and up to servers.

Starting up BTRFS is pretty slow. This is probably obvious in embedded context. It could spend few seconds just initializing the driver, benchmarking raid algorithms.

**discordian** · 13 August 2018, 08:23 AM

Originally posted by starshipeleven View Post

I'm not critiquing eMMC, I'm just stating facts. It's not anywhere as good as a decent SSD, it's not even supposed to, that's a fact.

We agree then

Originally posted by starshipeleven View Post

In practice on SSDs (that have to be designed with NTFS in mind) F2FS is trading blows with ext4 more often than not, it's NOT a clear winner.

In speed - certainly. The controller will still have to scrap together Nand pages that only have a few bytes changed, with enough cache and internal bandwidth you wont see this in benchmarks. The larger strain on the finite erase cycles will still be there, vs a filesystem which tries not to address less than a page.

I havent found tests comparing that, would be interested in practical numbers myself.

Originally posted by starshipeleven View Post

If we were talking of some form of "Open-channel SSD" where the controller is dumber (on purpose) and does not try to optimize stuff on its own, then you will see what F2FS can actually do. I think it probably could do better, that's one of the main selling points of Open-channel SSDs (for company use anyway).

You would need to get the characteristics right, that's probably a bit too much for a regular user.
btw. raw Nand was common just a few years ago, and I don't know anyone who misses it after eMMC took its place. Nand is really a bitch to support, even reading a page can result in neighbouring pages lose some state. I dont know what "Open-channel SSD" manages, but if its like plain Nand then its a nightmare to support.

**starshipeleven** · 13 August 2018, 09:33 AM

Originally posted by discordian View Post

In speed - certainly. The controller will still have to scrap together Nand pages that only have a few bytes changed, with enough cache and internal bandwidth you wont see this in benchmarks. The larger strain on the finite erase cycles will still be there, vs a filesystem which tries not to address less than a page.

The general consensus is that eMMC is used in devices where you either don't have (mostly read only embedded device) or don't care (the device won't last more than a few years).

SSDs are long past any sort of write limitations, at least for consumer use.

I havent found tests comparing that, would be interested in practical numbers myself.

Heh, I think it would involve investing a few hundred bucks in a few cheapo Windows tablets, then boot Linux, and format them as ext4 or f2fs and then run a script that writes and deletes stuff.

You would need to get the characteristics right, that's probably a bit too much for a regular user.

No, the point of OpenChannel SSDs is the same as mdadm software RAID. The OS usually knows better of what is going on than an embedded controller working at or below the block level.

btw. raw Nand was common just a few years ago, and I don't know anyone who misses it after eMMC took its place. Nand is really a bitch to support, even reading a page can result in neighbouring pages lose some state. I dont know what "Open-channel SSD" manages, but if its like plain Nand then its a nightmare to support.

I'm not sure about what you mean with "nand is a bitch to support", filesystems for NAND (with wear leveling and such) exist, like jffs2 or the more modern ubifs.

Although from what I understand the OpenChannel SSDs are moving the FTL (flash transition layer) and all the shenanigans done by the controller like wear leveling and so on from the embedded storage controller into the kernel, using a subsystem/driver called LightNVM, so yeah the kernel is more or less managing each flash chip individually, that's the whole point. It won't be using the shitty slow interfaces used to deal with NAND chips in embedded though, as each "chip" is a SSD-grade NAND package.

To the actual operating system though (the programs and stuff), the disk still appears as a device with a filesystem (either through emulating a block device so you can place a normal filesystem on top, or by using custom filesystems designed for this type of devices, or by giving applications direct control over it).

https://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdf

**DrYak** · 13 August 2018, 09:35 AM

Originally posted by starshipeleven View Post

In practice on SSDs (that have to be designed with NTFS in mind) F2FS is trading blows with ext4 more often than not, it's NOT a clear winner.
Having a powerful controller and a drive cache basically offsets most of the gains you would get with F2FS.

Very detailed LWN post about internal of flash media

Modern SSDs try to avoid actual write amplification, by basically having a huge cache in RAM for grouping writes together (I'm really over simplifying here), so that in the end they on flush (write-only) the cache to a brand new segment out of the unallocated rotating pool (the difference between, say the 100GiB of advertised capacity and the 128GiB power-of-two size of the actual flash chips used on the SSD), while also optionally copying over the last few pages of another block that hasn't been completely garbage collected thus freeing that block while achieving static wear-leveling at the same time (refreshing old pages that are nowadays only re-read and never overwritten. That helps against decay. See the Samsung firmware to prevent read speed degrading over time on static data).

So instead of doing the text book case :
- read all the non-overwritten pages
- erase the whole block
- write all the pages, both the newly overwritten and all the remain read on step 1.

(which takes a lot of time, even a single page overwrite require reading / erasing / writing the whole block)

What a modern SSD does :
- (optionally : read a few pages from old blocks)
- take an already erased ready-to-write new block from the wear levelling pool (no delay)
- start writing the new pages
- (optionally : write the pages from step one 1)
--- at this point the new data has finished writing, no more necessary delays
- mark the old content in the block deprecated
- if there is no more data left that still isn't deprecated, schedule the block for erasure and return it to the pool.
- (optionally : any old block from step one is also scheduled for erasing and returned to the pool).

thus writes take only the time to write the new content (plus some extra for refreshing old static data)

I'm grossly over simplifying, (and completely ignoring things like allocation unit) (also a multi gigabyte media is never going to keep track of every single page, so obviously you have layer of indirection and pooling)
The source from LWN above has all the tiny detail going under the hood.

The more RAM and the better the CPU in the controller, the better the SSD can do the above and you approach the situation where you virtually have a whole (block) journaling layer underneath the filesystem on the partition that can compensate asinine filesystems such as NTFS, FAT32, exFAT, etc.

(That's what's behind all these "optimized for FAT" flash media that you might hear on mail. It's not that they have some weird FAT-specific code in the firmware that the controller runs and that is going to break badly is you put anything else there. It's more likely they have enough RAM so that extremely-often overwritten structure like a FAT table can be correctly handled without any write amplification going on).

Still, a workstation has even more RAM and better CPUs and could do even more complex optimisations - that's the premise behind flash-oriented file systems (and strategies like the above mentioned open channel, where letting the application (filesystem, raw database partition, etc) organise its logs on a dumb flash is whole purpose).

Originally posted by caligula View Post

Starting up BTRFS is pretty slow. This is probably obvious in embedded context. It could spend few seconds just initializing the driver, benchmarking raid algorithms.

Depending on the use "a few seconds" isn't necessarily a huge sacrifice for things like checksum everywhere (even data) and optional compression (including CPU light algos like LZO and Zstd).
(The reason why the data partition on my Pi is BTRFS - I like the checksuming).

But yeah, BTRFS has definitely more stuff to initialize.
(Also on modern high end smartphone, it isn't that much noticeable).

That's very likely also the reason why F2FS is the by default proposed flash-friendly system on SBC and even on lots of smartphones.

Originally posted by Weasel View Post

I always appreciate your posts about filesystems because you have experience with many different setups and use cases, and actually know what you're talking about.

I'm mostly over obsessed with constantly researching all my possible options.

Originally posted by Weasel View Post

I have a question. How does F2FS compare to UDF for "dumb" flash media like USB sticks or SD cards? (TRIM support is irrelevant here, I know about that difference)

I've never used UDF for system storage (for internal partitions), only for throw away storage to copy files over, so I've never looked the performance side of things.
I use it mostly as a "read by almost any OS under the sun" cross-platform FS for copying files around including larger files than 2GiB (I use format-udf to get around the partitioning corner cases between windows and mac os x)

For the rest, well UDF is a log-structured filesystem too (so that it could be used efficiently on packed-writing, and on write-once-only media), so it's very gentle on flash in theory, and in practice I've never had problem with the USB keys I've been using it on.

I've had a single noname el cheapo USB key die on me while performing the initial TRIM and formatting, thus I ended up with an empty read-only UDF partition.
(But technically even there I haven't lost any data).

**discordian** · 13 August 2018, 10:07 AM

Originally posted by starshipeleven View Post

The general consensus is that eMMC is used in devices where you either don't have (mostly read only embedded device) or don't care (the device won't last more than a few years).

Well, you are wrong about that, they are preferred over SSDs in industry. first as I said - even if you only read then neighboring cells will be affected and information degrades (Read Disturb Errors), if you don't read cells then those might lose charge.

Originally posted by starshipeleven View Post

SSDs are long past any sort of write limitations, at least for consumer use.

One can be curious?

Originally posted by starshipeleven View Post

Heh, I think it would involve investing a few hundred bucks in a few cheapo Windows tablets, then boot Linux, and format them as ext4 or f2fs and then run a script that writes and deletes stuff.

I cant follow here. A simple test would be to use a ssd (better 2 identical SSDs so the second test doesnt start with degraded hardware), then run a workload once with ext4 and once with f2fs. Then figure out how much pages got erased (hopefully there is a way to get at these stats)

Originally posted by starshipeleven View Post

No, the point of OpenChannel SSDs is the same as mdadm software RAID. The OS usually knows better of what is going on than an embedded controller working at or below the block level.

The embedded controller is often custom made for the "below the block level".

Originally posted by starshipeleven View Post

I'm not sure about what you mean with "nand is a bitch to support", filesystems for NAND (with wear leveling and such) exist, like jffs2 or the more modern ubifs.

yes, and implementing support for Nand is a bitch, particularly because of many parameters. eMMC can do and does a better job dealing with the low-level stuff.
Ie a filesystem shouldn't have to count accesses to nand by storing on nand, to know when it needs to no recheck and copy.

Originally posted by starshipeleven View Post

Although from what I understand the OpenChannel SSDs are moving the FTL (flash transition layer) and all the shenanigans done by the controller like wear leveling and so on from the embedded storage controller into the kernel, using a subsystem/driver called LightNVM, so yeah the kernel is more or less managing each flash chip individually, that's the whole point. It won't be using the shitty slow interfaces used to deal with NAND chips in embedded though, as each "chip" is a SSD-grade NAND package.

I think this only makes sense if you merge the filesystem and Nand-layer. As said I dont know nothing about OpenChannel SSD, but I think not doing some basic block accounting and swapping/remapping at controller level is a big mistake. Maybe it does, but allows some control over that.

Originally posted by starshipeleven View Post

To the actual operating system though (the programs and stuff), the disk still appears as a device with a filesystem (either through emulating a block device so you can place a normal filesystem on top, or by using custom filesystems designed for this type of devices, or by giving applications direct control over it).

https://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdf

Ok, thanks, seems like page 5 indicates that they want to have control over buffer and write-allocation but wear-leveling and error handling should be done on controller level (long term goal). Thats a long way from raw Nand access then.

**starshipeleven** · 13 August 2018, 10:34 AM

Originally posted by Weasel View Post

I always appreciate your posts about filesystems because you have experience with many different setups and use cases, and actually know what you're talking about.

I have a question. How does F2FS compare to UDF for "dumb" flash media like USB sticks or SD cards? (TRIM support is irrelevant here, I know about that difference)

I can't talk of performance, but last I checked the UDF ecosystem stinks. Aka, you don't have a fsck at all on linux, and you may have weird issues once you are dealing with drives bigger than 80GB.
Other OSes have their own weird interactions with it

Using the UDF as a successor of FAT for USB sticks - Tanguy Ortolo

https://tanguy.ortolo.eu/blog/article93/usb-udf#c1376089556-1

USB keys are usually formatted with FAT, which has serious limitations. UDF can be used as an open alternative without these limitations.

On this alone I'd say F2FS wins. Although I do remember of a F2FS release where the fsck was actually hosing the partition on some OpenWrt ARM something something architecture, six months ago or so.

**starshipeleven** · 13 August 2018, 10:53 AM

Originally posted by discordian View Post

Well, you are wrong about that, they are preferred over SSDs in industry. first as I said - even if you only read then neighboring cells will be affected and information degrades (Read Disturb Errors), if you don't read cells then those might lose charge.

I thought SSDs run refreshes and checks regularly, as long as they are on, anyway.

Also, afaik most industrial devices aren't really writing heavily to their embedded flash or eMMC, so they would fall into the first category in my statement. Only when you need a good write endurance that you must get a SSD.

The embedded controller is often custom made for the "below the block level".

And caching is more effective the more info you have about the stuff you are caching.

**discordian** · 13 August 2018, 11:06 AM

Originally posted by starshipeleven View Post

I thought SSDs run refreshes and checks regularly, as long as they are on, anyway.

Yes, as does MMC (albeit at a lesser rate typically). The point is that that's something I would never transfer into the kernel or filesystem like it was with raw NAND.

Originally posted by starshipeleven View Post

Also, afaik most industrial devices aren't really writing heavily to their embedded flash or eMMC, so they would fall into the first category in my statement. Only when you need a good write endurance that you must get a SSD.

not heavily, but continuously (logging + persistent counters). There's a huge difference how logbased filesystems act in this case, a SSD with ext4 would not matter for performance but multiply the amount written.

And caching is more effective the more info you have about the stuff you are caching.

Sure.

Announcement

F2FS In Linux 4.19 Will Fix Big Performance Issue For Multi-Threaded Reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment