Hot-Data Tracking Still Baking For The Linux Kernel

KjetilK replied

14 May 2020, 05:34 AM
Anyone know what happened to this? Is there any alternative functionality in the kernel that does the same thing?

I still have the same use case, many TB on my home file server, but just a fraction is really hot, so a small SSD as part of the BTRFS would be a killer feature.
Leave a comment:
tomato replied

02 January 2013, 11:09 AM
Originally posted by LasseKongo View Post

The majority of Linux systems are not desktops. My personal fileserver at home has around 7 TB of data, that would be pretty expensive with just SSDs, my guess is that less than 10% of my data is "active", so hot data migration of these 10% to SSD would be ideal for me.

QFT. I have similar data use, most of those TB of data are media files (movies, pictures, software iso files). The difference between 0.1ms and 750ms access times to them are basically unnoticeable (in fact, I'm planning to migrate to a MAID as I can live with a 10 second initial access time for a 90 minute movie...). My mail archive or DE profile files are completely different matter...

A file system that was able to migrate data automagically would be a killer feature for any file server, IMNSHO.
Leave a comment:
LasseKongo replied

26 December 2012, 02:56 PM
Originally posted by johnc View Post

Still, I'm not really seeing the point of this. Any serious storage solution is going to use a caching mechanism for hot data.

And how will you cache if you don?t keep track the hot data ? Hot data tracking can be used both for caching and hot data migration.

Originally posted by johnc View Post

And it makes no sense in ordinary desktop use as a user concerned about performance would just use an SSD, where all accesses perform equally.

The majority of Linux systems are not desktops. My personal fileserver at home has around 7 TB of data, that would be pretty expensive with just SSDs, my guess is that less than 10% of my data is "active", so hot data migration of these 10% to SSD would be ideal for me.
Leave a comment:
johnc replied

25 December 2012, 12:59 PM
Still, I'm not really seeing the point of this. Any serious storage solution is going to use a caching mechanism for hot data.

And it makes no sense in ordinary desktop use as a user concerned about performance would just use an SSD, where all accesses perform equally.
Leave a comment:
LasseKongo replied

24 December 2012, 09:05 PM
Originally posted by ryao View Post

You certainly are familiar with ZFS. However, you are wrong about ZIL only speeding up metadata operations. It applies to data as well, although only for small synchronous writes.

You learn new stuff every day I played around with a Sun Unified Storage (horrible piece of crap by the way) about 3 years ago which had both a large L2ARC SSD and 2 x 18GB ZIL SSDs, since the ZIL disk were so small I assumed they were not used for caching data.

Originally posted by ryao View Post

Anyway, you are correct about the memory requirements of the L2ARC map. That is not much of a problem if you have a sufficiently large amount of memory to hold the L2ARC map in the memory ZFS has for metadata. It might be best to consider it separately from other metadata to eliminate the cannibalization of cache space. Avoiding a situation where the hottest metadata is forced to the L2ARC by virtue of the L2ARC map being large is definitely something that could be achieved by doing that.

I?m sitting right now with a 12TB pool with some deduped data that crashed because I had to little memory for the dedup tables, it has been trying to recover for last 8 days now, so memory is really critical when working with ZFS. Will abort it tomorrow and put in a new motherboard with 4 x the amount of memory and see if it helps.

Originally posted by ryao View Post

By the way, I don't know what you mean by "add or remove disks to vdevs". You can certainly take disks away, although doing that leaves the vdevs in a degraded state until you replace them.

I mean it is not possible to expand or shrink a vdev by adding and removing disks (except replacing all disks with larger disks by failing and replacing them one at at a time), which would mean re-balancing the blocks to keep the desired redundancy. Neither is it possible to remove a whole vdev from a pool without destroying the pool and recreating it with fewer disks. The former ZFS developers mentioned the "block pointer rewrite" feature back in 2009 which would be able to do this among other cool stuff like "re-dedup" & "re-compress" existing data. I have not heard anything about it in a while.

Here I believe BTRFS has a major advantage, it is very easy expand/reduce the filesystem online by adding/removing disks, and BTRFS transparently handles the rebalancing of the blocks to keep the desired redundancy level. This also makes it possible to change RAID level on the fly. I have played with it on an experimental level and it works well from what I can see, but we will need support for RAID5/6 as well before I?m happy. It also has online defragmentation which tends to be useful for COW filesystems that fragments data more than a regular *NIX filesystem.
It feels like BTRFS was designed from the beginning to allow these kinds of online restructuring of the filesystem.
Leave a comment:
ryao replied

24 December 2012, 02:14 PM
Originally posted by LasseKongo View Post

I am using L2ARC for caching 2 RAID-Z devices in my ZFS box, and yes, read operations will benefit if they are in the cache. There are a couple of downsides as I see it:

* After a clean boot the L2ARC is invalid, and depending on your setup it can take some time to warm up with new data.
* Just like with ZFS dedup there is a table keeping track of which blocks that are located on the L2ARC, I believe it is 320 bytes / block as with dedup tables. If you have a large L2ARC this consumes a lot of memory. If you have a 100GB L2ARC with 8K blocks this table consumes 4GB om RAM. In FreeBSD the default setting is to allow 25% of main memory to ZFS metadata, which mean I would need 16GB in the system to keep the table in memory.
* Writes still go to the slow disks first, and is then eventually copied to the L2ARC for caching. A ZIL can speed up metadata operations, but not data.

The data migration can be a scheduled job, it doesn?t have to be in real time, or it can be done on idle I/O cycles.

It will certainly be interesting to see how the BTRFS guys is going to use this. I heard the RAID5/6 patches are going to be included in 3.8 which would clear another obstacle for adoptation for me. I will stick with ZFS for the time being, but I think BTRFS will be really good in maybe a years time. Just the fact that I cannot add or remove disks to the vdevs, or even remove an entire vdev i ZFS is starting to piss me off.

You certainly are familiar with ZFS. However, you are wrong about ZIL only speeding up metadata operations. It applies to data as well, although only for small synchronous writes. You also always have a ZIL (unless you set sync=always on your datasets and zvols). You can make it external to your normal vdevs by using a SLOG device. As for L2ARC, it should work very well for those that do not reboot frequently. People have discussed making it persistent across reboots for a while, although the fact that the hottest data remains cached in RAM limits the usefulness of doing that. In theory, hibernation could be used instead of reboots, although I have yet to test that with swap on a ZFS zvol.

Anyway, you are correct about the memory requirements of the L2ARC map. That is not much of a problem if you have a sufficiently large amount of memory to hold the L2ARC map in the memory ZFS has for metadata. It might be best to consider it separately from other metadata to eliminate the cannibalization of cache space. Avoiding a situation where the hottest metadata is forced to the L2ARC by virtue of the L2ARC map being large is definitely something that could be achieved by doing that.

By the way, I don't know what you mean by "add or remove disks to vdevs". You can certainly take disks away, although doing that leaves the vdevs in a degraded state until you replace them.

Last edited by ryao; 24 December 2012, 02:30 PM.
Leave a comment:
liam replied

23 December 2012, 08:02 PM
Originally posted by ryao View Post

ZFS already does this through the ARC algorithm.

super fantastic awesome brah
Leave a comment:
LasseKongo replied

23 December 2012, 05:21 PM
Originally posted by ryao View Post

Try out L2ARC. It is a cache on faster storage. Migrating things (like Apple's Fusion drive) is bad for performance because it requires additional IOs. Having a copy somewhere faster does not have such a penalty. There is no reason that ARC could not be used in either scenario, but moving data around does not make sense when you can just cache it.

I am using L2ARC for caching 2 RAID-Z devices in my ZFS box, and yes, read operations will benefit if they are in the cache. There are a couple of downsides as I see it:

* After a clean boot the L2ARC is invalid, and depending on your setup it can take some time to warm up with new data.
* Just like with ZFS dedup there is a table keeping track of which blocks that are located on the L2ARC, I believe it is 320 bytes / block as with dedup tables. If you have a large L2ARC this consumes a lot of memory. If you have a 100GB L2ARC with 8K blocks this table consumes 4GB om RAM. In FreeBSD the default setting is to allow 25% of main memory to ZFS metadata, which mean I would need 16GB in the system to keep the table in memory.
* Writes still go to the slow disks first, and is then eventually copied to the L2ARC for caching. A ZIL can speed up metadata operations, but not data.

The data migration can be a scheduled job, it doesn?t have to be in real time, or it can be done on idle I/O cycles.

It will certainly be interesting to see how the BTRFS guys is going to use this. I heard the RAID5/6 patches are going to be included in 3.8 which would clear another obstacle for adoptation for me. I will stick with ZFS for the time being, but I think BTRFS will be really good in maybe a years time. Just the fact that I cannot add or remove disks to the vdevs, or even remove an entire vdev i ZFS is starting to piss me off.
Leave a comment:
ryao replied

23 December 2012, 04:24 PM
Originally posted by LasseKongo View Post

No it doesn?t, the ARC keeps a copy of the hottest blocks, they also reside in their original location. A better solution would
migrate the blocks to faster/slower storage depending on usage pattern, which also means that it would be persistent across
reboot which the ZFS ARC is not. Also the ARC only does read caching, with a true tiering solution, even writes are faster if they
get migrated to a lower storage tier.
Hopefully the Linux VFS implementation will make it possible to achive this functionality.

Try out L2ARC. It is a cache on faster storage. Migrating things (like Apple's Fusion drive) is bad for performance because it requires additional IOs. Having a copy somewhere faster does not have such a penalty. There is no reason that ARC could not be used in either scenario, but moving data around does not make sense when you can just cache it.

Last edited by ryao; 23 December 2012, 04:28 PM.
Leave a comment:
LasseKongo replied

23 December 2012, 01:46 PM
Originally posted by ryao View Post

ZFS already does this through the ARC algorithm.

No it doesn?t, the ARC keeps a copy of the hottest blocks, they also reside in their original location. A better solution would
migrate the blocks to faster/slower storage depending on usage pattern, which also means that it would be persistent across
reboot which the ZFS ARC is not. Also the ARC only does read caching, with a true tiering solution, even writes are faster if they
get migrated to a lower storage tier.
Hopefully the Linux VFS implementation will make it possible to achive this functionality.

Last edited by LasseKongo; 23 December 2012, 02:05 PM.
Leave a comment:

Announcement

Hot-Data Tracking Still Baking For The Linux Kernel

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: