4-Disk Btrfs Native RAID Performance On Linux 4.10

Zucca replied

01 February 2017, 05:20 AM
Originally posted by Zan Lynx View Post

SATA and SAS protocol overhead is why NVMe was invented. Read a bit about it at http://www.nvmexpress.org/nvm-express-overview/

With NVMe you can read every storage block in random order from multiple CPU cores at the same time, all at ALMOST the same speed as a single sequential read, because of multiple IO queues that are 64K IOPS deep.

I hope NVMe is the future and eventually we'd get ONE standard connector for it.
The current situation with SATA, SAS, SATAexpress, whatever - is a mess. Many protocols, many connectors. I haven't even bothered to count them all.
Leave a comment:
jacob replied

31 January 2017, 09:45 PM
Originally posted by starshipeleven View Post

I think you are confused about of what "block layer" is.
It is an abstraction so the OS does not need to bother how the fuck each storage device actually works internally. How it works internally is irrelevant, it shows "block layer" as a virtual interface with outside world.

Block layer is the same thing on HDD, SSD, USB thumb drives, it may have different block sizes, there are various features, but it works the same.
Drives always remapped blocks, be it "bad sectors" for hard drives, flash drives doing tricks like I said for wear leveling and performance, and so on.
OS does not know about that, it only sees blocks in the same place it put them, not what physical cells/sectors these blocks are actually mapped to (that may change). Block-level fragmentation is an artifact of the file-system, and it will happen regardless of the underlying storage device technology. If ext4 does not fragment, it won't fragment on any block device, period.

Let's make an example of what I was saying above. First block is mapped to physical cell 1.

OS sees the first block, and if it asks the contents of first block the controller will send over the contents of cell 1.
After some writes the SSD controller decides to change it to not stress flash cell 1. It changes its own allocation table so that first block is now mapped on cell 2.

For OS nothing has changed. OS still sees the same first block, and if it asks the contents of first block the controller will send over the contents of cell 2 that are the same as before. Only the controller knows that something has changed.

This is also what happens for "reallocated sectors" in hard drives. A sector is marked bad and its block is remapped to another (spare) sector. The OS knows nothing about this.

Let's make another example, a file must be written over 3 different blocks because it's big.

The OS assigns 3 contiguous blocks at the block layer because the filesystem isn't a piece of shit like NTFS, let's say block 4, 5 and 6.
Sends this info to the SSD controller, the controller will look around for free space in its allocation tables and assign to block 4 the cell 1235 from flash chip 1, to block 5 the cell 5646 from flash chip 2 and to block 6 the cell 21312 from flash chip 3, and writes go there (each block is mapped to a cell in a different chip because it's basically doing a RAID0).

If the OS asks again the contents of blocks 4,5,6, the controller will read its own table and go fetch the data from the actual physical place it is in.

Block level is contiguous or not as decided by the filesystem, physical layer is whatever is best for the storage technology used, is decided by the drive controller, and none really knows nor cares what it is.

You just keep missing my point. This is all irrelevant. Let's say I want to read 16 blocks which are logically contiguous. That is, the OS's block layer sees that as a single array of 16 blocks at blocks addresses A up to A+15. Now for simplicity let's assume that the SSD maps them on 4 disjoint physical regions of 4 cells each. Now what I'm asking is this:

1. When the block layer initiates a transfer of the 16 blocks, will the SSD indeed send 16 blocks or will it only send blocks A to A+3 (the first PHYSICALLY contiguous extent), after which the OS will have to submit a second DMA request for blocks A+4 to A+7 and so forth?

2. If the answer to question 1 is yes, that is the SSD can transfer 16 blocks in one op even if they are stored on PHYSICALLY DISJOINT memory cells, is there a performance penalty in this case (slower throughput) compared to if they were all in physically contiguous cells?
Leave a comment:
Zan Lynx replied

31 January 2017, 09:33 PM
SATA and SAS protocol overhead is why NVMe was invented. Read a bit about it at http://www.nvmexpress.org/nvm-express-overview/

With NVMe you can read every storage block in random order from multiple CPU cores at the same time, all at ALMOST the same speed as a single sequential read, because of multiple IO queues that are 64K IOPS deep.
Likes 1
Leave a comment:
starshipeleven replied

31 January 2017, 08:46 PM
Originally posted by jacob View Post

That's exactly my point. The block layer knows nothing about this so it would issue DMA requests as if this mechanism didn't exist.

I think you are confused about of what "block layer" is.
It is an abstraction so the OS does not need to bother how the fuck each storage device actually works internally. How it works internally is irrelevant, it shows "block layer" as a virtual interface with outside world.

Block layer is the same thing on HDD, SSD, USB thumb drives, it may have different block sizes, there are various features, but it works the same.
Drives always remapped blocks, be it "bad sectors" for hard drives, flash drives doing tricks like I said for wear leveling and performance, and so on.
OS does not know about that, it only sees blocks in the same place it put them, not what physical cells/sectors these blocks are actually mapped to (that may change). Block-level fragmentation is an artifact of the file-system, and it will happen regardless of the underlying storage device technology. If ext4 does not fragment, it won't fragment on any block device, period.

Let's make an example of what I was saying above. First block is mapped to physical cell 1.

OS sees the first block, and if it asks the contents of first block the controller will send over the contents of cell 1.
After some writes the SSD controller decides to change it to not stress flash cell 1. It changes its own allocation table so that first block is now mapped on cell 2.

For OS nothing has changed. OS still sees the same first block, and if it asks the contents of first block the controller will send over the contents of cell 2 that are the same as before. Only the controller knows that something has changed.

This is also what happens for "reallocated sectors" in hard drives. A sector is marked bad and its block is remapped to another (spare) sector. The OS knows nothing about this.

Let's make another example, a file must be written over 3 different blocks because it's big.

The OS assigns 3 contiguous blocks at the block layer because the filesystem isn't a piece of shit like NTFS, let's say block 4, 5 and 6.
Sends this info to the SSD controller, the controller will look around for free space in its allocation tables and assign to block 4 the cell 1235 from flash chip 1, to block 5 the cell 5646 from flash chip 2 and to block 6 the cell 21312 from flash chip 3, and writes go there (each block is mapped to a cell in a different chip because it's basically doing a RAID0).

If the OS asks again the contents of blocks 4,5,6, the controller will read its own table and go fetch the data from the actual physical place it is in.

Block level is contiguous or not as decided by the filesystem, physical layer is whatever is best for the storage technology used, is decided by the drive controller, and none really knows nor cares what it is.
Leave a comment:
jacob replied

31 January 2017, 07:51 PM
Originally posted by starshipeleven View Post

Block layer has no fucking idea of how is done and how works the physical layer, what appears as a full contiguous file on block layer may be all over the disk on the physical layer, but that's completely irrelevant or even beneficial as it gets more read speed due to its "RAID0"-like implementation on physical layer.

That's exactly my point. The block layer knows nothing about this so it would issue DMA requests as if this mechanism didn't exist. Hence back to the question: does the SSD accept single, burst-mode operations for series of blocks that are logically continuous but physically scattered around, and does it perform them with the same performance as if the blocks were physically contiguous? Or is the max DMA transfer it would accept limited to the size of a physically continuous extent? In the first case, it's all great; in the second case, it has a performance penalty.
Leave a comment:
starshipeleven replied

31 January 2017, 07:36 PM
Originally posted by jacob View Post

The speed to fetch each individual block is the same on the SSD but that's not what I'm talking about. Fetching 10 contiguous blocks over say SATA takes 1 DMA burst transfer. Fetching 10 blocks split into 2 extents takes 2 consecutive DMA transfers (SSD or not). So yes, there is large performance penalty for fragmentation on SSDs, not as high as on rotating disks but the difference is much smaller that what you seem to believe.

I'm not talking of blocks, I'm talking of physical layer. SSDs have fragmentation on physical layer, then they can also get fragmented on block layer if the filesystem allows that (ext4 usually does not), but that is a filesystem issue that would be the same on any other block device.

Block layer has no fucking idea of how is done and how works the physical layer, what appears as a full contiguous file on block layer may be all over the disk on the physical layer, but that's completely irrelevant or even beneficial as it gets more read speed due to its "RAID0"-like implementation on physical layer.

Last edited by starshipeleven; 31 January 2017, 07:41 PM.
Leave a comment:
jacob replied

31 January 2017, 07:17 PM
Originally posted by starshipeleven View Post

No. The speed to fetch the data from blocks anywhere in the SSD is the same, because the flash chips are random access memory, not sequential like hard drives.
There isn't "somehwat lower" penalty, there is no penalty at all because the flash technology reads at the same speed and with the same latency in any cell.

The speed to fetch each individual block is the same on the SSD but that's not what I'm talking about. Fetching 10 contiguous blocks over say SATA takes 1 DMA burst transfer. Fetching 10 blocks split into 2 extents takes 2 consecutive DMA transfers (SSD or not). So yes, there is large performance penalty for fragmentation on SSDs, not as high as on rotating disks but the difference is much smaller that what you seem to believe.
Leave a comment:
starshipeleven replied

31 January 2017, 07:12 PM
Originally posted by stiiixy View Post

Someone's having their period.

Someone managed to post everything wrong.

Maybe if you knew our use case

You stated it. You want performance and you don't care of other features. Because otherwise you would be talking of ZFS instead of mdadm raid.

But seeing as you can't even get the interpretation of RAID1 right

Btrfs's "RAID1" is not actual RAID1, if you don't know basic info about btrfs it is not my problem.

BTRFS 5/6 has a proven data loss bug. You want to risk someone elses 40 years of data on a bug like that?

And ZFS what issues has that you could not use it?

Let me spell this out for you with regards to performance; SHIT.

See? You only want performance. Please don't use btrfs, as it's never going to be faster than block-level RAID that you use already.

The existing system is rock-solid, working and...

.... since it is using mdadm RAID I still don't see why you really want btrfs since you seem to be fine with that.
Leave a comment:
starshipeleven replied

31 January 2017, 07:00 PM
Originally posted by jacob View Post

So my question remains, whether the fact that the device remaps logical block addresses to reduce wear (which is a good thing) prevents it from being able to transfer logically continuous buffers in a singe operation or not?

No. The speed to fetch the data from blocks anywhere in the SSD is the same, because the flash chips are random access memory, not sequential like hard drives.
There isn't "somehwat lower" penalty, there is no penalty at all because the flash technology reads at the same speed and with the same latency in any cell.
Leave a comment:
stiiixy replied

31 January 2017, 06:58 PM
Originally posted by SystemCrasher View Post

You write it almost like if there're ppl from marketing department trying really hard to sell you something. Yet, btrfs devs do not sell storage solutions unlike Sun. They are merely hired by companies using btrfs for their deployments and so on. Btrfs probably works for them and their scenarios if they dare to deploy it, not to mention devs would fix things it it wouldn't be the case. Btw raid 5/6 in btrfs considered to be experimental and got some shortcomings, so using it in production is probably not the best idea ever.

Waiting for [some time] on its own would't do any magic. Except getting you somwhat older ofc.

Yes, that sales engine is called 'the Internet'. Thats also supposed to be a joke
I waited years before the driver had matured before I tested on some bigger iron than some home job NAS. The simple management of BTRFS arrays is what initially sold me as we could do away with all the legacy custom hardware stuff we've been relying. The six months 'waiting' was us simply us pounding on the BTRFS server. Needless to say, it will likely be deployed when time permits. It just fell short of deployment FOS US at this instance because we would prefer RAID6 but the realities fell just short. No buggy. Not sure why others ate getting their panties in a know. I use BTRFS at home.
Leave a comment:

Announcement

4-Disk Btrfs Native RAID Performance On Linux 4.10

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: