Announcement

**waxhead** · 13 October 2020, 01:07 PM

Originally posted by F.Ultra View Post

My only problem with BTRFS at the moment is that directories that contain more than a few thousand files takes 10-20s to list from cold cache (this on a BTRFS Raid1 system with 110T so could be a case specific problem).

Are you using space_cache v2? I was under the impression that you just had to clear the v1 space cache and then enable the v2 cache , but this is not the case. It was a rather confusing and complex discussion on IRC a month or two back, but all I got out of it was that simply switching space cache was not that easy after all.

Depending on how many storage devices you use and what kind of HBA's you use I would suggest rebalancing data to raid10. If I remember correctly there was patched posted a while ago (that I think was merged) that allowed btrfs' raid10 to potentially handle loosing more than one drive. If that is true you **may** have a slightly better chance surviving two dropped devices if you are both unlucky and lucky at once

Of course you would need your metadata to be in raid10 or raid1c3 or raid1c4 to benefit from that.

And just a quick heads up to everybody - BTRFS RAID terminology is not really RAID in the classical sense - close enough yes, but quite different still.

**F.Ultra** · 13 October 2020, 08:12 PM

Originally posted by piorunz View Post

110TB with RAID1? Wouldn't you be better off with other RAID configuration than RAID1?

BTRFS Raid1 is disconnected from Raid1, in BTRFS is just means that you have a duplicate of each COW-block

**F.Ultra** · 13 October 2020, 08:19 PM

Originally posted by waxhead View Post

Are you using space_cache v2? I was under the impression that you just had to clear the v1 space cache and then enable the v2 cache , but this is not the case. It was a rather confusing and complex discussion on IRC a month or two back, but all I got out of it was that simply switching space cache was not that easy after all.

Depending on how many storage devices you use and what kind of HBA's you use I would suggest rebalancing data to raid10. If I remember correctly there was patched posted a while ago (that I think was merged) that allowed btrfs' raid10 to potentially handle loosing more than one drive. If that is true you **may** have a slightly better chance surviving two dropped devices if you are both unlucky and lucky at once

Of course you would need your metadata to be in raid10 or raid1c3 or raid1c4 to benefit from that.

And just a quick heads up to everybody - BTRFS RAID terminology is not really RAID in the classical sense - close enough yes, but quite different still.

No I'm not using space_cache (unless it's on by default), by cold cache I meant the Linux buffers cache. Strangely enough ls was fast today 24h later even though files have been added to the directories but I guess that Linux VFS simply cached that as well, the machine have 64GB free RAM after all. I have 24 SAS drives in that setup with a LSI 9207-8i as the HBA.

**waxhead** · 13 October 2020, 11:08 PM

Originally posted by F.Ultra View Post

No I'm not using space_cache (unless it's on by default), by cold cache I meant the Linux buffers cache. Strangely enough ls was fast today 24h later even though files have been added to the directories but I guess that Linux VFS simply cached that as well, the machine have 64GB free RAM after all. I have 24 SAS drives in that setup with a LSI 9207-8i as the HBA.

Ok, BTRFS have two cache mechanisms for free space. You are probably using v1 if you are just using the defaults. On multi-terrabyte filesystem the performance may be degraded. When you list your directories the access time will be updated (which may result in a write) so it just maybe this was the reason you where seeing delays. You can try to switch to space_cache=v2 which was not as straight forward as it may seem from the manpage ( https://btrfs.wiki.kernel.org/index....e/btrfs%285%29 ). Another thing you can do is to try to put large directories in subvolumes. A couple of years ago I did this on a server with about 7.5 million files which was a bit slow on lots of (heavy) small file operations.

**kreijack** · 14 October 2020, 01:08 PM

Originally posted by waxhead View Post

Are you using space_cache v2? I was under the impression that you just had to clear the v1 space cache and then enable the v2 cache , but this is not the case. It was a rather confusing and complex discussion on IRC a month or two back, but all I got out of it was that simply switching space cache was not that easy after all.

What I understood is that it is possible to switch to V2 space cache quite easily; however some bit of (uneed) data of V1 will survive to the switch.

Originally posted by waxhead View Post

Depending on how many storage devices you use and what kind of HBA's you use I would suggest rebalancing data to raid10. If I remember correctly there was patched posted a while ago (that I think was merged) that allowed btrfs' raid10 to potentially handle loosing more than one drive. If that is true you **may** have a slightly better chance surviving two dropped devices if you are both unlucky and lucky at once

Of course you would need your metadata to be in raid10 or raid1c3 or raid1c4 to benefit from that.

In raid 1/10 you can loose up to half of the disk in the best scenario. However in the worst one you lost your filesystem when only two disks fails. It depends by which pair of disks you loose.

When you loose a disk, you cannot loose the disks where are the other half of the copies of the data.

Eg. if you have the following setup

DISK1 DISK2
DISK3 DISK4
DISK5 DISK6

Where disk2,4,6 are the mirror of disk1,3,5 you can loose disk1, disk4 and disk6 and everything work well. However if you loos disk1 and sisk2 the filesystem is gone.
BTRFS complicates the thing in the sense that there is a pair of chunk (== slice of disk) and not a pair of disk. So a chunk of the disk1 may be mirrored in the disk2; and the next chunk of the disk1 may be mirrored in another disk...

With RAID1/10 it is guarantee that the filesystem will survives to a lost of *one* (any) disk. But if you loose another disk it is *not guarantee* that the filesystem will survive (it may happens or not).
For example RAID6 have the guarantee that the filesystem will survive even if two disks are lost.

**waxhead** · 14 October 2020, 01:49 PM

Originally posted by kreijack View Post

What I understood is that it is possible to switch to V2 space cache quite easily; however some bit of (uneed) data of V1 will survive to the switch.

...

In raid 1/10 you can loose up to half of the disk in the best scenario. However in the worst one you lost your filesystem when only two disks fails. It depends by which pair of disks you loose.

When you loose a disk, you cannot loose the disks where are the other half of the copies of the data.

Eg. if you have the following setup

DISK1 DISK2
DISK3 DISK4
DISK5 DISK6

Where disk2,4,6 are the mirror of disk1,3,5 you can loose disk1, disk4 and disk6 and everything work well. However if you loos disk1 and sisk2 the filesystem is gone.
BTRFS complicates the thing in the sense that there is a pair of chunk (== slice of disk) and not a pair of disk. So a chunk of the disk1 may be mirrored in the disk2; and the next chunk of the disk1 may be mirrored in another disk...

With RAID1/10 it is guarantee that the filesystem will survives to a lost of *one* (any) disk. But if you loose another disk it is *not guarantee* that the filesystem will survive (it may happens or not).
For example RAID6 have the guarantee that the filesystem will survive even if two disks are lost.

First of all the space cache. What I understood is that if there is still something left of the V1 space cache it will continue to be used even if V2 is present. I may be totally or partly wrong about this, but it was a interesting and confusing discussion. It is not as easy as clearing v1, and then mounting with V2 *as I understood it* even if the manual page does not indicate this at all.

I do understand how BTRFS RAID10 works. Until now loosing two disks have always been a problem for BTRFS RAID10. There was patched proposed a while ago ( https://patchwork.kernel.org/[email protected]/ ) but as you can see the author asked to discard that patch until a problem with not being able to create degraded chunks is solved (and I have no clue if it is solved now).

Actually there is a theoretically much higher chance of recovering any data from a BTRFS filesystem due to the way data is stored (in chunks, slices or even "partitions" if you like) compared to traditional RAID10/5/6. As long as your metadata is safe you could in theory reconstruct whatever is readable even if you are missing more disks than what would otherwise be possible in a traditional RAID.

**F.Ultra** · 14 October 2020, 05:57 PM

Originally posted by waxhead View Post

Ok, BTRFS have two cache mechanisms for free space. You are probably using v1 if you are just using the defaults. On multi-terrabyte filesystem the performance may be degraded. When you list your directories the access time will be updated (which may result in a write) so it just maybe this was the reason you where seeing delays. You can try to switch to space_cache=v2 which was not as straight forward as it may seem from the manpage ( https://btrfs.wiki.kernel.org/index....e/btrfs%285%29 ). Another thing you can do is to try to put large directories in subvolumes. A couple of years ago I did this on a server with about 7.5 million files which was a bit slow on lots of (heavy) small file operations.

ok so BTRFS does not honor the nodiratime mount option?

**waxhead** · 15 October 2020, 09:35 AM

Originally posted by F.Ultra View Post

ok so BTRFS does not honor the nodiratime mount option?

I have no clue.

**S.Pam** · 25 October 2020, 07:34 PM

Originally posted by F.Ultra View Post

ok so BTRFS does not honor the nodiratime mount option?

You need to use noatime. That implies nodiratime too. You can't do only nodiratime like in ext4.

you should really use space_cache=v2. Really a
very big difference if you have an FS of over a TiB.

**F.Ultra** · 25 October 2020, 07:48 PM

Originally posted by Spam View Post

You need to use noatime. That implies nodiratime too. You can't do only nodiratime like in ext4.

you should really use space_cache=v2. Really a
very big difference if you have an FS of over a TiB.

Thanks, I used both noatime and nodiratime in fstab so that should have covered that then. Will experiment with space_cache later, will install a similar server soon so will test it out there.

Announcement

Btrfs With Linux 5.10 Brings Some Sizable FSync Performance Improvements

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment