Announcement

**mppix** · 08 March 2021, 02:34 PM

Originally posted by drjohnnyfever View Post

1) Okay... Didn't say it wasn't
2) Yes, Didn't say it wasn't
3) Its perfectly relevant when we're talking about the pitfalls of RAID5/6. These issues have been around a long time and people have done a bunch of work on resolving them.
4) They are still doing a ton of new feature development on FreeBSD so, I'm not sure that is accurate. They came out with a product for running Linux containers is what we know for sure. And they seem pretty committed to ZFS considering they are paying devs to upstream work to OpenZFS.

1+2) Except that we live in a time where we can buy platforms with 128 pcie4 lanes. You can easily setup >=26 pcie4 nvme drives on an Asrock Rack RomeD8 motherboard. The cumulative theoretical bandwidth 26x7.877 GB/s=204.8GB/s matches the 8 channel DDR4-3200 system bandwidth of 204.8GB/s. Good luck doing any raid 5/6 operations on top of that (ZFS or not). I'm currently trying to figure out if raid 10 makes any sense (spoiler raid 1 may beat it at least for some relevant things)
3) mdadm solved them as well.
4) Sure, and I hope that TrueNAS on BSD and ZFS will have a long life. However, it cannot hope to compete with TrueNAS scale. KVM virtualization and GlusterFS each will make large user bases switch once it is stable/reliable.

**drjohnnyfever** · 08 March 2021, 02:43 PM

Originally posted by mppix View Post

1+2) Except that we live in a time where we can buy platforms with 128 pcie4 lanes. You can easily setup >=26 pcie4 nvme drives on an Asrock Rack RomeD8 motherboard. The cumulative theoretical bandwidth 26x7.877 GB/s=204.8GB/s matches the 8 channel DDR4-3200 system bandwidth of 204.8GB/s. Good luck doing any raid 5/6 operations on top of that (ZFS or not). I'm currently trying to figure out if raid 10 makes any sense (spoiler raid 1 may beat it at least for some relevant things)
3) mdadm solved them as well.
4) Sure, and I hope that TrueNAS on BSD and ZFS will have a long life. However, it cannot hope to compete with TrueNAS scale. KVM virtualization and GlusterFS each will make large user bases switch once it is stable/reliable.

Somewhere in one of these posts I said stripes and pools of mirrors are the best for performance. Nowhere did I ever say RAID5 or Z2 was ideal for everything.

mdadm hasn't really solved any fundamental raid5/6 issues as far as I know. Maybe we're talking about different things?

What you are describing is basically proxmox. TrueNAS on BSD already runs VMs. Freebsd supports GlusterFS, 9pfs and all these other things. Yeah some people going to prefer a QEMU/KVM platform, but this isn't revolutionary and I'm not really expecting a giant mass migration to SCALE. Admittedly I could be wrong on this but there are already a ton of KVM options out there, its a crowded space.

**waxhead** · 08 March 2021, 03:05 PM

Originally posted by scineram View Post

What does btrfs do when a disc fails in RAID1?

It depends on how many storage devices you have and what sizes they are. As long as BTRFS can create a second copy on any other storage device it will work. BTRFS "RAID1" means two copies regardless of how many storage devices are in the pool.

**mppix** · 08 March 2021, 03:16 PM

Originally posted by drjohnnyfever View Post

Somewhere in one of these posts I said stripes and pools of mirrors are the best for performance.

I am trying (and probably failing) to point out that that may not be the case in nvme arrays.

Originally posted by drjohnnyfever View Post

mdadm hasn't really solved any fundamental raid5/6 issues as far as I know. Maybe we're talking about different things?

AFAIK mdadm raid 5/6 is stable but especially raid 5 is rarely recommended for anything anymore.

Originally posted by drjohnnyfever View Post

What you are describing is basically proxmox. TrueNAS on BSD already runs VMs. Freebsd supports GlusterFS, 9pfs and all these other things. Yeah some people going to prefer a QEMU/KVM platform, but this isn't revolutionary and I'm not really expecting a giant mass migration to SCALE. Admittedly I could be wrong on this but there are already a ton of KVM options out there, its a crowded space.

AFAIK, TrueNAS scale is a hyperconverged solution that will compete 1:1 with proxmox (proxmox leans heavily toward ceph), oVirt, and possibly even commercial ones. It seems like TrueNAS scale can run an entire virtualized data center (with separate storage/computation or not).
TrueNAS core is a SAN/NAS that can do some containers. I don't know if it supports gluster out of the box.
We can probably debate when it makes sense to migrate existing systems from core to scale. However if scale holds its promises, scale is a nobrainer over core and oVirt (and possibly proxmox as well) for new deployments.

**drjohnnyfever** · 08 March 2021, 03:27 PM

Originally posted by mppix View Post

I am trying (and probably failing) to point out that that may not be the case in nvme arrays.

I sort of doubt raid 5/6 is going to be any faster than a bunch of independent disks so I'm not sure what your point is.

Originally posted by mppix View Post

AFAIK mdadm raid 5/6 is stable but especially raid 5 is rarely recommended for anything anymore.

Wow, OK, so it doesn't just eat data out of spite. But you still have the write hole issue, the higher level filesystem being unaware of how the blocks are actually laid out on disk, and rebuild times for large disks. The latter is mostly a problem for spinning rust but a problem none the less.

To be clear I meant issues with RAID5/6 in principle, not implementation.

Originally posted by mppix View Post

AFAIK, TrueNAS scale is a hyperconverged solution that will compete 1:1 with proxmox (proxmox leans heavily toward ceph), oVirt, and possibly even commercial ones. It seems like TrueNAS scale wants to run your entire virtualized data center (whether you separate storage and computation or not).
TrueNAS core is a SAN/NAS that can do some containers. I don't know if it supports gluster out of the box.
We can probably debate when it makes sense to migrate existing systems from core to scale. However if scale holds its promises, scale is a nobrainer over core and oVirt (and possibly proxmox as well) for new deployments.

TrueNAS Core does containers and VMs. It's docker support is weak and some customers will only consider KVM. That doesn't mean you can't run Windows VMs on your hyperconverged zpool on it. But like I say the KVM space is crowded. I'm already using proxmox and FreeBSD-Bhyve-ZFS for all sort of things. We've pretty much eliminated ESXi at the office except for testing our product on it.

**drjohnnyfever** · 08 March 2021, 03:37 PM

Originally posted by drjohnnyfever View Post

To be clear I meant issues with RAID5/6 in principle, not implementation.

To be more clear, RAIDZ is fundamentally different from RAID5/6 in the crucial detail that the thing doing the equivalent of journaling/logging and updating the file records is the same thing as is doing the writes and parity. So the FS is able to update the metadata explicitly and only after the writes have completed and flushed. RAID5/6 just presents the FS with a block interface and btrfs and ext4, xfs etc on mdadm RAID have no way of knowing what ops actually occurred on disk underneath before updating the metadata. Its totally opaque.

**mppix** · 08 March 2021, 03:50 PM

Originally posted by drjohnnyfever View Post

I sort of doubt raid 5/6 is going to be any faster than a bunch of independent disks so I'm not sure what your point is.

No likely not; problem is that raid 10 may also not be faster than raid 1 on nvme arrays (talking iops..).

Originally posted by drjohnnyfever View Post

Wow, OK, so it doesn't just eat data out of spite. But you still have the write hole issue, the higher level filesystem being unaware of how the blocks are actually laid out on disk, and rebuild times for large disks. The latter is mostly a problem for spinning rust but a problem none the less.

Do you mean this?

A journal for MD/RAID5 [LWN.net]

https://lwn.net/Articles/665299/

Just to be sure - I don't advocate for raid 5/6 - as far as I am concerned they should deprecate it on any system/platform inculding mdadm, zfs and btrfs.

Originally posted by drjohnnyfever View Post

TrueNAS Core does containers and VMs. It's docker support is weak and some customers will only consider KVM. That doesn't mean you can't run Windows VMs on your hyperconverged zpool on it. But like I say the KVM space is crowded. I'm already using proxmox and FreeBSD-Bhyve-ZFS for all sort of things. We've pretty much eliminated ESXi at the office except for testing our product on it.

I'm somewhere in the same space, except we use gluster and we don't use zfs

Truenas scale may eliminate the need for proxmox for us..
How is the hyperconverged zpool performance?

**drjohnnyfever** · 08 March 2021, 04:18 PM

Originally posted by mppix View Post

No likely not; problem is that raid 10 may also not be faster than raid 1 on nvme arrays (talking iops..).

I need to do some testing on those extreme use cases. On ZFS there isn't really a difference between a stripe and a bunch of mirrors on a fundamental level so I'm not sure how that scales specifically. I know HPE/Cray were doing some performance work with high end NVMes with ZFS pools > 10 devices.

Side note, there are definitely cases where ZFS does not scale as well as alternatives. Netflix for instance use UFS on FreeBSD specifically because it handles their streaming workload better. In their case they aren't using pooled storage at all, just independent disks with a bunch of unrelated sequential reads coming off them.

Originally posted by mppix View Post

Do you mean this?

A journal for MD/RAID5 [LWN.net]

https://lwn.net/Articles/665299/

Just to be sure - I don't advocate for raid 5/6 - as far as I am concerned they should deprecate it on any system/platform inculding mdadm, zfs and btrfs.

That is interesting. That might well be a solution to the problem and admittedly I was unaware of it. Although I'm unaware of anyone I know actually using it with their md raid.

Originally posted by mppix View Post

I'm somewhere in the same space, except we use gluster and we don't use zfs

Truenas scale may eliminate the need for proxmox for us..
How is the hyperconverged zpool performance?

Performance has been fairly good overall. I don't regret moving away from VMware with local datastores on RAID10 at all. I've had some interesting issues on proxmox with zolvs being slower than a big file on a dataset when for VMs. Of course ZFS likes a lot of RAM and a slog, and it rewards you for it if you supply them.

**F.Ultra** · 08 March 2021, 06:51 PM

Originally posted by terrywang View Post

You can easily source PCIe M.2 NVMe adaptors for desktops, very cheap but still reliable if you know what brand to buy.

I've recently got 3 M.2 to PCIe cards (2 with passive cooling) but haven't used them yet on the only desktop (SFF, only 1 PCIe slots...).

ah, why didn't I think of that!

**mppix** · 08 March 2021, 08:19 PM

Originally posted by drjohnnyfever View Post

I need to do some testing on those extreme use cases. On ZFS there isn't really a difference between a stripe and a bunch of mirrors on a fundamental level so I'm not sure how that scales specifically. I know HPE/Cray were doing some performance work with high end NVMes with ZFS pools > 10 devices.

Let me/us know what you find. I'd be interested.

Fun story, I changed perspective on striping by working on our setup. We need high iops (and to a lesser degree latency) and can easily invest in pcie 3/4 drives but all networking is currently cat6A - 10GBaseT. Problem is that each pcie3.0 nvme drive can theoretically saturate about 2 LCAP links and pcie4 about 4. We are quite deep in the what's the point territory. Also, our testing suggests that striping tends to reduce iops (we tested the typical stack: mdadm raid 1 and 10 + lvm + ext4/xfs as well as lvm's included mdadm raid 1 and 10 + file systems).
I'll likely try out btrfs next and play with faster/rdma networking.

Originally posted by drjohnnyfever View Post

Side note, there are definitely cases where ZFS does not scale as well as alternatives. Netflix for instance use UFS on FreeBSD specifically because it handles their streaming workload better. In their case they aren't using pooled storage at all, just independent disks with a of unrelated sequential reads coming off them.

Interesting!

Originally posted by drjohnnyfever View Post

Performance has been fairly good overall. I don't regret moving away from VMware with local datastores on RAID10 at all. I've had some interesting issues on proxmox with zolvs being slower than a big file on a dataset when for VMs. Of course ZFS likes a lot of RAM and a slog, and it rewards you for it if you supply them.

Announcement

Btrfs Will Finally "Strongly Discourage" You When Creating RAID5 / RAID6 Arrays

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment