Announcement

**locore** · 08 March 2021, 09:11 PM

Originally posted by horizonbrave View Post

ZFS has a built-in email alert system, doesn't BTRFS have one or any plans that you know it might be implemented?
Thanks

You can easily set up a cron job that scrubs a BTRFS array and sends you an email if it finds any errors.

**k1e0x** · 09 March 2021, 02:36 PM

Originally posted by mppix View Post

Is raid 5/6 really still a thing with BTRFS or ZFS?
Even raid 0 and raid 10 can be questioned for SATA SSDs (higher sequential, lower iops) and we moving more and more toward NVMe.

Depends what the question is.. in Enterprise, no.. The interview with Adam Leventhal pretty much details why. ZFS supports 5, 6 and 7.3 (aka raidz3). And 7.3 should be ok for Enterprise but it is slower than stripe on mirrors and performance usually is more important than capacity, (If you really need capacity you'd just go gluster on ZFS.) So your use case would have to depend.
ZFS raid5-6 is ok for home use where capacity is more an issue.. 6 preferred if you like your data more.

The larger the drives the more chance they will have uncorrectable errors on resilver. These levels work tho bug/hole free in ZFS as well as they ever are going to given the physical limitations.

**k1e0x** · 09 March 2021, 02:54 PM

Originally posted by mppix View Post

1+2) Except that we live in a time where we can buy platforms with 128 pcie4 lanes. You can easily setup >=26 pcie4 nvme drives on an Asrock Rack RomeD8 motherboard. The cumulative theoretical bandwidth 26x7.877 GB/s=204.8GB/s matches the 8 channel DDR4-3200 system bandwidth of 204.8GB/s. Good luck doing any raid 5/6 operations on top of that (ZFS or not). I'm currently trying to figure out if raid 10 makes any sense (spoiler raid 1 may beat it at least for some relevant things)
3) mdadm solved them as well.
4) Sure, and I hope that TrueNAS on BSD and ZFS will have a long life. However, it cannot hope to compete with TrueNAS scale. KVM virtualization and GlusterFS each will make large user bases switch once it is stable/reliable.

Um.. so I work in enterprise and I work on ZFS storage arrays.. You are coming from the home use perspective here because you're talking about the Hypervisor on the same platform that the storage is on. That isn't really a thing, you'll have compute nodes and storage nodes on different systems designed for that.

Using stripe on triple mirrors on ZFS is very fast with large ARC caches too. Using the ARC well is the ZFS secret sauce to performance. BTRFS isn't even a consideration here.. the question is often between things like ZFS, NetApp, EMC or DDN. So in my mind BTRFS and ZFS don't really compete with each other. The competition is between ZFS and NetApp. - Open source or closed source. BTRFS is a home user filesystem or a guest filesystem as far as I've seen in enterprise.

And no.. not on FreeNAS (or TrueNAS) .. just vanilla FreeBSD. I have tried it and found TrueNAS buggy.. and problematic. A lot of the enterprise features are broken, missing or just designed wrong.. so what is the point of using it? .. With their design I would not really put hopes in scale.. It's a home user NAS product. meh

I have never seen proxmox or lemonos or whatever linus tech tips uses in enterprise.. but I can tell you that enterprise is what drives the consumer market and that will have ZFS for a very long time as it's the only thing that competes with the big boys.

I don't know if you know this but NetApp is also FreeBSD (UFS) based so.. You basically get FreeBSD or FreeBSD as options for storage right now.. safe to say that yes, it will have a long life.

And yes they push upstream as you can see here. https://www.freebsd.org/status/repor...NetApp-Changes

**mppix** · 09 March 2021, 10:13 PM

Originally posted by k1e0x View Post

Um.. so I work in enterprise and I work on ZFS storage arrays.. You are coming from the home use perspective here because you're talking about the Hypervisor on the same platform that the storage is on. That isn't really a thing, you'll have compute nodes and storage nodes on different systems designed for that.

So just to understand my situation. I'm doing HPC in my homelab on nodes that run 24+2 nvme drives? I should be so lucky :P

Running the same hypervisor on storage and compute does not mean that you mix storage and compute nodes. It just means you figured out how to run a hyperconverged architecture.

Originally posted by k1e0x View Post

Using stripe on triple mirrors on ZFS is very fast with large ARC caches too. Using the ARC well is the ZFS secret sauce to performance. BTRFS isn't even a consideration here.. the question is often between things like ZFS, NetApp, EMC or DDN. So in my mind BTRFS and ZFS don't really compete with each other. The competition is between ZFS and NetApp. - Open source or closed source. BTRFS is a home user filesystem or a guest filesystem as far as I've seen in enterprise.

If you have any results on nvme nodes, I'm all ears.

Originally posted by k1e0x View Post

And no.. not on FreeNAS (or TrueNAS) .. just vanilla FreeBSD. I have tried it and found TrueNAS buggy.. and problematic. A lot of the enterprise features are broken, missing or just designed wrong.. so what is the point of using it? .. With their design I would not really put hopes in scale.. It's a home user NAS product. meh

You may want to check the TrueNAS price list and see if their systems qualify for home users

Originally posted by k1e0x View Post

I have never seen proxmox or lemonos or whatever linus tech tips uses in enterprise.. but I can tell you that enterprise is what drives the consumer market and that will have ZFS for a very long time as it's the only thing that competes with the big boys.

I guess I should run ESXi+ZFS storage architecture for my laptop.

Originally posted by k1e0x View Post

I don't know if you know this but NetApp is also FreeBSD (UFS) based so.. You basically get FreeBSD or FreeBSD as options for storage right now.. safe to say that yes, it will have a long life.
And yes they push upstream as you can see here. https://www.freebsd.org/status/repor...NetApp-Changes

Ceph disagrees with you. .. but I was referring to Trunas core vs scale.

**k1e0x** · 10 March 2021, 02:49 PM

Originally posted by mppix View Post

So just to understand my situation. I'm doing HPC in my homelab on nodes that run 24+2 nvme drives? I should be so lucky :P

Running the same hypervisor on storage and compute does not mean that you mix storage and compute nodes. It just means you figured out how to run a hyperconverged architecture.

If you have any results on nvme nodes, I'm all ears.

Lol. Oh I see, yes how lucky of you. (Sarcastic)

You should see what I got 🤣 (sadly I don't share details on public forums)

It doesn't really work this way. You don't ordinarily start with the hardware and just try to make it "do something". You start with a list of requirements. What problem are you trying to solve? What performance metrics are you trying to achieve? Build a list of requirements first. It sounds like you don't have a problem and you're just playing around.. since that is the case and it sounds like you are learning here (are you in collage?) I'd say make one a storage node and make one a cache node and run application containers off a compute node (your laptop works) as a learning experience. I'll leave the implementation up to you. Do you think you can accomplish this? Post your solution here.

Arrogance and confidence is good but I think you may be a little bit too fixed in your opinions due to lack of experience. It's ok, you'll get there just don't burn too many bridges before you do. There is a wealth of experience out there built from pain, blood, sweat and tears and people will be happy share it with you if you don't rub them the wrong way.

Originally posted by mppix View Post

You may want to check the TrueNAS price list and see if their systems qualify for home users

Their price does not make the product good, nor enterprise. It's *still* IMO a home user product. And I believe the reason that is because they cater to the home user crowd in development and solving problems. From my experience the business / enterprise features need a lot of work. The UI's won't even let you build a nested array. Raidz only. Their is a need for this type of thing in small business however and TrueNAS could solve that well for people in that environment. It's not a bad product.. but they need to work on it. I see it as a somewhat do it yourself QNap or Synology and I'd recommend it over either of those.

Originally posted by mppix View Post

I guess I should run ESXi+ZFS storage architecture for my laptop.

You may and not realize it, not *on the laptop* but your laptop may very well access it and depend on it.. but I guess you'd rather just be a smart ass.

Originally posted by mppix View Post

Ceph disagrees with you. .. but I was referring to Trunas core vs scale.

Yeah.. meh.. I disagree with a lot of people. It sits on top of a Posix file system anyhow so there is no reason you can't put it on ZFS, far as I can tell. (if you wanted to) ZFS generally works on block storage not files and there is a API for accessing that layer, you might be able to hook it into the block layer of ZFS and skip the posix layer entirely.. RedHat's got some Stratus stuff too. Be careful of new shiny.

It's kind of a problem I have with a lot of Linux solutions is they continuously advance and build on existing stuff without redesigning the lower bits. You end up in Ceph situations where you have XFS on top of XFS, and I assume that would mean they are leaving performance on the table. Take LVM, it's extremely crude and feels like DOS to me.. but everyone uses it and just accepts it like it has to be there or something. ("oh but it's a great volume manager!" not it's really not.. partitions? Really? it's DOS, my opinion.) ZFS proves it does not need to exist, the storage system can auto manage it's own volumes no different than your memory manager manages your vram pages, you don't partition dimms. Linux please challenge the existing abstractions and stop mounting loopback on loopback on loopback and thinking ok.. that is great. next issue.

**mppix** · 10 March 2021, 07:25 PM

Originally posted by k1e0x View Post

Lol. Oh I see, yes how lucky of you. (Sarcastic)

You should see what I got 🤣 (sadly I don't share details on public forums)

Yes, better be careful - corporate espionage is running high here.

Originally posted by k1e0x View Post

It doesn't really work this way. You don't ordinarily start with the hardware and just try to make it "do something". You start with a list of requirements. What problem are you trying to solve? What performance metrics are you trying to achieve? Build a list of requirements first. It sounds like you don't have a problem and you're just playing around.. since that is the case and it sounds like you are learning here (are you in collage?) I'd say make one a storage node and make one a cache node and run application containers off a compute node (your laptop works) as a learning experience. I'll leave the implementation up to you. Do you think you can accomplish this? Post your solution here.

Arrogance and confidence is good but I think you may be a little bit too fixed in your opinions due to lack of experience. It's ok, you'll get there just don't burn too many bridges before you do. There is a wealth of experience out there built from pain, blood, sweat and tears and people will be happy share it with you if you don't rub them the wrong way.

Cm'on man. I'm doing this for quite some time now.
Throughout this thread I did nothing but saying that (a) nvme is nontrivial at scale and (b) zfs is not the golden bullet.
Somehow that triggered your statements essentially saying that I don't get it and zfs the solution (at least for most things). Good for you but zfs does not really solve anything for me. Also, based on your responses, I would guess that your enterprise has lower storage requirements than my "home lab" and you have not looked at qualifying nvme yet. I humbly accept that my responses had some point to them but please have also look in the mirror.
BTW. why should I use cache nodes for nvme storage nodes?

Originally posted by k1e0x View Post

Their price does not make the product good, nor enterprise. It's *still* IMO a home user product. And I believe the reason that is because they cater to the home user crowd in development and solving problems. From my experience the business / enterprise features need a lot of work. The UI's won't even let you build a nested array. Raidz only. Their is a need for this type of thing in small business however and TrueNAS could solve that well for people in that environment. It's not a bad product.. but they need to work on it. I see it as a somewhat do it yourself QNap or Synology and I'd recommend it over either of those.

I don't see the problem. TrueNAS' web-interface is good for SMB. Enterprise usually needs the terminal anyway for one reason or another.

Originally posted by k1e0x View Post

You may and not realize it, not *on the laptop* but your laptop may very well access it and depend on it.. but I guess you'd rather just be a smart ass.

Sure, easiest way is to just get a Synology NAS
I'm still looking at orders of magnitude higher performance.

Originally posted by k1e0x View Post

Yeah.. meh.. I disagree with a lot of people. It sits on top of a Posix file system anyhow so there is no reason you can't put it on ZFS, far as I can tell. (if you wanted to) ZFS generally works on block storage not files and there is a API for accessing that layer, you might be able to hook it into the block layer of ZFS and skip the posix layer entirely.. RedHat's got some Stratus stuff too. Be careful of new shiny.

It's kind of a problem I have with a lot of Linux solutions is they continuously advance and build on existing stuff without redesigning the lower bits. You end up in Ceph situations where you have XFS on top of XFS, and I assume that would mean they are leaving performance on the table. Take LVM, it's extremely crude and feels like DOS to me.. but everyone uses it and just accepts it like it has to be there or something. ("oh but it's a great volume manager!" not it's really not.. partitions? Really? it's DOS, my opinion.) ZFS proves it does not need to exist, the storage system can auto manage it's own volumes no different than your memory manager manages your vram pages, you don't partition dimms. Linux please challenge the existing abstractions and stop mounting loopback on loopback on loopback and thinking ok.. that is great. next issue.

You don't like Linux stuff because "they continuously advance and build on existing stuff without redesigning the lower bits".
However, you also don't like btrfs and ceph that explicitly don't do that (I don't think you understood ceph btw).
Unfortunately, I also think your lvm knowledge dates back to about 2005.
LVM now fully incorporates mdadm raid functions and does a surprisingly large share of the things why you'd use zfs on a single node and it does it very well (minus file system of course - just use exfat

).

**k1e0x** · 10 March 2021, 08:40 PM

Originally posted by mppix View Post

Yes, better be careful - corporate espionage is running high here.

Cm'on man. I'm doing this for quite some time now.
Throughout this thread I did nothing but saying that (a) nvme is nontrivial at scale and (b) zfs is not the golden bullet.
Somehow that triggered your statements essentially saying that I don't get it and zfs the solution (at least for most things). Good for you but zfs does not really solve anything for me. Also, based on your responses, I would guess that your enterprise has lower storage requirements than my "home lab" and you have not looked at qualifying nvme yet. I humbly accept that my responses had some point to them but please have also look in the mirror.
BTW. why should I use cache nodes for nvme storage nodes?

I don't see the problem. TrueNAS' web-interface is good for SMB. Enterprise usually needs the terminal anyway for one reason or another.

Sure, easiest way is to just get a Synology NAS
I'm still looking at orders of magnitude higher performance.

You don't like Linux stuff because "they continuously advance and build on existing stuff without redesigning the lower bits".
However, you also don't like btrfs and ceph that explicitly don't do that (I don't think you understood ceph btw).
Unfortunately, I also think your lvm knowledge dates back to about 2005.
LVM now fully incorporates mdadm raid functions and does a surprisingly large share of the things why you'd use zfs on a single node and it does it very well (minus file system of course - just use exfat

).

I'm not really sure you have a use case. heh.

So whats your solution to the problem? Buy something? The reason you're caching it is to learn how. Just buying more and faster stuff is a poor solution to the problem because it's not always viable or cost efficient. People still use tape drives? Why? Because it can store a ton and it's cheep. S3 Glacier is probably stored on tape.

There are a lot of ways you could have gone.. However my solution to the problem would be like this:

In the real world I'd use SAS expanders or something but let's just assume we are using all home user equipment and for proof of concept reasons we will use iSCSI. (Yes, I know, that causes all kinds of problems with i/o latency etc etc.. not the point here, this is just for proof of concept)

Lay out the storage controller nvme's each into their own iSCSI LUN. (As you pointed out, yes, you wouldn't really use nvme's for this part, they are expensive and the network will be your bottleneck but since you have it.. it won't hurt anything. The concepts here are more important than the hardware you are using or if it fits. because no, it doesn't, but don't worry about it.)

Mount those individual LUN's into a pool into the head node (aka cache server) and add the local nvme's into a large L2ARC / ZIL (about 90% ARC / 10% ZIL). You'll want to use a mirrored pool layout not raidz so you can just add shelves as needed and expand it.

What happens if a shelf dies? You'll have to think about that in your layout.. since you only have one, you'd be hosed anyhow. For more, each member of the mirror set needs to be from a different shelf.

And for the application clients you can export zvol block storage or datasets however you like.

How fast or big could this be? I don't really know, never done it on iSCSI before there is some overhead but I have no reason to believe it wouldn't work or get pretty close to saturating a 10gb network. 100gb? yes, you will have problems, but thats something you wont need to worry about. The head node will be very fast working off that cache though. Capacity wise? you'll run out of wallet and might need to reinforce the floor before you hit the limit there.

The nice part about this is it's simple. You only installed one piece of software, FreeBSD's base OS image. That's all you need. ZFS takes care of everything. You can configure it for your redundancy requirements, add auto failover LUNS and you can expand it as much as you want.. but you know.. you go ahead with your fancy solutions.

**mppix** · 11 March 2021, 11:27 AM

Originally posted by k1e0x View Post

I'm not really sure you have a use case. heh.

So whats your solution to the problem? Buy something? The reason you're caching it is to learn how. Just buying more and faster stuff is a poor solution to the problem because it's not always viable or cost efficient. People still use tape drives? Why? Because it can store a ton and it's cheep. S3 Glacier is probably stored on tape.

There are a lot of ways you could have gone.. However my solution to the problem would be like this:

In the real world I'd use SAS expanders or something but let's just assume we are using all home user equipment and for proof of concept reasons we will use iSCSI. (Yes, I know, that causes all kinds of problems with i/o latency etc etc.. not the point here, this is just for proof of concept)

Lay out the storage controller nvme's each into their own iSCSI LUN. (As you pointed out, yes, you wouldn't really use nvme's for this part, they are expensive and the network will be your bottleneck but since you have it.. it won't hurt anything. The concepts here are more important than the hardware you are using or if it fits. because no, it doesn't, but don't worry about it.)

Mount those individual LUN's into a pool into the head node (aka cache server) and add the local nvme's into a large L2ARC / ZIL (about 90% ARC / 10% ZIL). You'll want to use a mirrored pool layout not raidz so you can just add shelves as needed and expand it.

What happens if a shelf dies? You'll have to think about that in your layout.. since you only have one, you'd be hosed anyhow. For more, each member of the mirror set needs to be from a different shelf.

And for the application clients you can export zvol block storage or datasets however you like.

How fast or big could this be? I don't really know, never done it on iSCSI before there is some overhead but I have no reason to believe it wouldn't work or get pretty close to saturating a 10gb network. 100gb? yes, you will have problems, but thats something you wont need to worry about. The head node will be very fast working off that cache though. Capacity wise? you'll run out of wallet and might need to reinforce the floor before you hit the limit there.

The nice part about this is it's simple. You only installed one piece of software, FreeBSD's base OS image. That's all you need. ZFS takes care of everything. You can configure it for your redundancy requirements, add auto failover LUNS and you can expand it as much as you want.. but you know.. you go ahead with your fancy solutions.

Thanks for the writeup. I might try this an in case I follow up here (I have 4 nodes here for testing but would scale this up for production).
For multinode, we are also looking into NVMeOF. Let's see how that goes.

**gilboa** · 11 March 2021, 02:12 PM

Originally posted by flower View Post

hardware raid controllers are not a thing any more. software raid (zfs / mdadm - not btrfs) has way more benefits. just use a ups though

Not a thing anymore? Carte to back this claim with actual usage numbers?
Cause the last time I checked, HP, Dell, Lenovo, Fujitsu, Supermicro, et-al seem to be "stuck" in the old days, still relaying on "legacy" RAID controllers with a beefy battery backed cache.

Ironically a year (?) ago I spent a couple of weeks helping a friend recover VMs from a failed oVirt + Gluster + OpenZFS cluster (hardware RAID in JBOD) that suffered massive corruption due to a massive power surge + outage.
Ironically just above this clusters semi identical cluster that used the on-board hardware RAID controller + XFS as opposed to OpenZFS pools, and it simply boot like nothing happened. (We checked checksums and found no errors).

Anecdotal evidence? Quite possibly. But unlike you, I do my best not to make broad comments...

**flower** · 11 March 2021, 03:26 PM

Originally posted by gilboa View Post

Not a thing anymore? Carte to back this claim with actual usage numbers?
Cause the last time I checked, HP, Dell, Lenovo, Fujitsu, Supermicro, et-al seem to be "stuck" in the old days, still relaying on "legacy" RAID controllers with a beefy battery backed cache.

Ironically a year (?) ago I spent a couple of weeks helping a friend recover VMs from a failed oVirt + Gluster + OpenZFS cluster (hardware RAID in JBOD) that suffered massive corruption due to a massive power surge + outage.
Ironically just above this clusters semi identical cluster that used the on-board hardware RAID controller + XFS as opposed to OpenZFS pools, and it simply boot like nothing happened. (We checked checksums and found no errors).

Anecdotal evidence? Quite possibly. But unlike you, I do my best not to make broad comments...

jbod is not raid. hba's are still widely used and a necessity.
hardware raid controllers just introduce an additional point of failure and make recovery more painful (eg you need to keep another one around).

btw onboard raid controllers are even worse. the ones found on consumer hardware are just software raid anyway (with the drawbacks of hw raid controllers).

real software raid is just good enough (pure mdadm / zfs at least). only problem is that you need more bandwith (raid10/4 disks nvme hw raid controller can use 8 lanes, raid10/4disk sw raid needs 16 for full performance)

sure there are still people using them. but i wouldnt advise anyone to build a new system with it.

i cant comment on your scenario as it depends on to many factors. i wouldnt use one or two personal events to decide which is better.

Announcement

Btrfs Will Finally "Strongly Discourage" You When Creating RAID5 / RAID6 Arrays

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment