Announcement

**ferry** · 28 December 2017, 08:50 AM

Maybe now the time would be right to work on what has been called 'hot data tracking' in one place on the btrfs wiki, and 'hybrid data storage' in another. With mixed (hybrid) storage pools this should potentially work much better than caching as bcache does. Directly writing the data to the best suited devices (as determined by the 'temperature') eliminates the overhead introduced by caching.

Moreover, as there would be no cache hierarchy, ssd's or hdd's can be added/replaced over time with ease to adapt to changing workloads.

**linuxgeex** · 28 December 2017, 09:04 AM

"At the moment though Btrfs RAID makes no determination of the fastest RAID device for its writes."

Well, no... of course not. The writes need to be pushed to all affected volumes regardless so there's no point pushing it to the fastest first... if anything it would be ideal to push it to the slowest volume first so that a barrier can complete with the lowest total latency.

**waxhead** · 28 December 2017, 09:34 AM

Originally posted by linuxgeex View Post

"At the moment though Btrfs RAID makes no determination of the fastest RAID device for its writes."

Well, no... of course not. The writes need to be pushed to all affected volumes regardless so there's no point pushing it to the fastest first... if anything it would be ideal to push it to the slowest volume first so that a barrier can complete with the lowest total latency.

Not so fast , I think you miss the point a bit. BTRFS does "read balancing" using simply pid%mirror , the patch tries to make this smarter when you happen to have a hybrid setup of both spinning vs solid storage. If you have two copies and one of them is located on spinning while the other is on solid BTRFS will prefer to read from the solid one.

If you have a 4 disk (raid1 like) setup for example with 2x solid and 2x spinning it would make perfectly sense to try to write to the solid storage instead of the spinning ones and therefore it is perfectly correct to write that BTRFS makes no determination for the fastest RAID device for it's writes. Remember that BTRFS raid1 means two copies only regardless of the number of disks! In fact BTRFS usage of the RAID terminology is a huge mistake as this confuses people all the time when they don't know how it works.

Now if you really want to dive into the details then of course, preferring solid drives for writing always will cause other issues as well, in the example above the pair of solid storage would get full at some point and it would make sense to balance "half-and-half" between solid and spinning. One potential drawback with this patch (if it gets merged) is that reads will not be as distributed in a hyprid storage mix as they would otherwise be - since it makes BTRFS prefer the fastest devices which makes sense, so scrubbing would be even more important to verify that both copies are good!

**linuxgeex** · 06 January 2018, 03:50 PM

Originally posted by waxhead View Post

Not so fast , I think you miss the point a bit. BTRFS does "read balancing" using simply pid%mirror , the patch tries to make this smarter when you happen to have a hybrid setup of both spinning vs solid storage. If you have two copies and one of them is located on spinning while the other is on solid BTRFS will prefer to read from the solid one.

If you have a 4 disk (raid1 like) setup for example with 2x solid and 2x spinning it would make perfectly sense to try to write to the solid storage instead of the spinning ones and therefore it is perfectly correct to write that BTRFS makes no determination for the fastest RAID device for it's writes. Remember that BTRFS raid1 means two copies only regardless of the number of disks! In fact BTRFS usage of the RAID terminology is a huge mistake as this confuses people all the time when they don't know how it works.

Now if you really want to dive into the details then of course, preferring solid drives for writing always will cause other issues as well, in the example above the pair of solid storage would get full at some point and it would make sense to balance "half-and-half" between solid and spinning. One potential drawback with this patch (if it gets merged) is that reads will not be as distributed in a hyprid storage mix as they would otherwise be - since it makes BTRFS prefer the fastest devices which makes sense, so scrubbing would be even more important to verify that both copies are good!

I agree that in an ideal world BTRFS could use the SSDs as a write cache to the RAID volume in the manner that you are speaking of. That would require it to have a background thread busily mirroring the cache off to the rotating rust so that it could free up some of the SSD to be used as initial RAID1 redundancy. It's a great idea, but TBH that would be better implemented in a different layer than the FS, so that it could be used as a write acceleration layer for all MD storage, and avoid poorly duplicating the wheel within BTRFS. Heck that could serialize the writes so that even rotating rust could make an effective acceleration volume.

**scineram** · 22 January 2018, 04:28 AM

Originally posted by linuxgeex View Post

I agree that in an ideal world BTRFS could use the SSDs as a write cache to the RAID volume in the manner that you are speaking of. That would require it to have a background thread busily mirroring the cache off to the rotating rust so that it could free up some of the SSD to be used as initial RAID1 redundancy. It's a great idea, but TBH that would be better implemented in a different layer than the FS, so that it could be used as a write acceleration layer for all MD storage, and avoid poorly duplicating the wheel within BTRFS. Heck that could serialize the writes so that even rotating rust could make an effective acceleration volume.

So you invented ZFS with SLOG, or something close.

**waxhead** · 22 January 2018, 05:41 PM

Originally posted by linuxgeex View Post

I agree that in an ideal world BTRFS could use the SSDs as a write cache to the RAID volume in the manner that you are speaking of. That would require it to have a background thread busily mirroring the cache off to the rotating rust so that it could free up some of the SSD to be used as initial RAID1 redundancy. It's a great idea, but TBH that would be better implemented in a different layer than the FS, so that it could be used as a write acceleration layer for all MD storage, and avoid poorly duplicating the wheel within BTRFS. Heck that could serialize the writes so that even rotating rust could make an effective acceleration volume.

Absolutely agree with your points here. The duplicating of hot data to a cache device was abandoned in BTRFS some years back because this feature was considered for the virtual filesystem layer as a generic feature so in effect all filesystems could benefit from this transparently.

The implementation of such a feature would need to take lots of stuff into account to avoid doing operations on only hot data storage that may not be migrated to cold data storage. It can probably be done (bcache have done it), but it is difficult to do on a generic level totally transparent.

Some years ago I suggested that BTRFS creates it own "supercache". Imagine if you have a filesystem with 10 disks. In a RAID1 like configuration you would access at most 5 disks (one disk per thread). In a RAID10 configuration you would access at most 5 disks (5 disks pr. thread). What would be possible with BTRFS is that the filesystem itself creates a raid0 block spread over all the disks where it caches hot data for reads only of course. If a read from the raid0 array fails you could always try the original data which in raid1 or raid10 would have at least two copies.
If such a feature is ever implemented it makes sense to move hot data tracking to the filesystem instead of the VFS layer. While I agree with you that this would be a layering violation it would at the same time probably yield better results for non write heavy workloads , not to mention redundancy.

**linuxgeex** · 22 January 2018, 07:23 PM

Originally posted by waxhead View Post

Absolutely agree with your points here. The duplicating of hot data to a cache device was abandoned in BTRFS some years back because this feature was considered for the virtual filesystem layer as a generic feature so in effect all filesystems could benefit from this transparently.

The implementation of such a feature would need to take lots of stuff into account to avoid doing operations on only hot data storage that may not be migrated to cold data storage. It can probably be done (bcache have done it), but it is difficult to do on a generic level totally transparent.

Some years ago I suggested that BTRFS creates it own "supercache". Imagine if you have a filesystem with 10 disks. In a RAID1 like configuration you would access at most 5 disks (one disk per thread). In a RAID10 configuration you would access at most 5 disks (5 disks pr. thread). What would be possible with BTRFS is that the filesystem itself creates a raid0 block spread over all the disks where it caches hot data for reads only of course. If a read from the raid0 array fails you could always try the original data which in raid1 or raid10 would have at least two copies.
If such a feature is ever implemented it makes sense to move hot data tracking to the filesystem instead of the VFS layer. While I agree with you that this would be a layering violation it would at the same time probably yield better results for non write heavy workloads , not to mention redundancy.

"where it caches hot data for reads only of course." - write caching is fine where consistency isn't important...

To some extent the optimisation you speak of with hot data is already done in VFS - operations which don't encounter the writeback interval, fsync, or sync, are done entirely in memory and if the file is unlinked before then the changes never even make it to the FS layer. After the various sync/writeback/barrier operations then the best optimisations are write combining/serialization/striping, and choosing a faster storage device. :-)

I agree that a layering violation can even be ideal... where best use is made of the layers first and the layering violation is only a workaround... given that there's no serializing write layer existing in VFS, or MD/RAID, then I guess doing in it BTRFS is good so long as it's not masturbation... ie if it would be an equal effort to implement in VFS or block layer, and the devs decide to do it within the BTRFS FS code base anyhow. ZFS is definitely an example of that masturbation in action in many regards, and I fear that BTRFS attempting to be a better ZFS is destined to make the same mistake... but that's an armchair opinion. I would feel more entitled to say it with some authority if I was actually working on it, or MD, or VFS... and I'm not lol.

**Apteryx** · 10 June 2021, 02:48 PM

For those wondering, this patch hasn't materialized in something that can be used today (2021), but I was told in the #btrfs libera.chat channel that a framework appeared that should make implementing such a policy simple and configurable at run time. So don't expect much of a speed boost by adding an SSD to your HDDs RAID1 array just yet.

Announcement

Btrfs Gets A RAID1/10 Speed Patch, Helping Out SSDs

Btrfs Gets A RAID1/10 Speed Patch, Helping Out SSDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment