Announcement

**Tgui** · 03 November 2010, 08:23 AM

Originally posted by kebabbert View Post

You dont understand what Data Integrity is.

It is not about a disk crashes or something similar. It is about retrieving the same data you put on the disk. Imagine you put this data on disk: "1234567890" but a corruption occured so you got back "2234567890". And the hardware does not even notice the data got corrupted. This is called Silent Corruption and occurs all the time.

Now, imagine you have a fast filesystem but there is silent corruption now and then. You can NOT trust the data you get back. As I have shown in links, this happens to XFS, ReiserFS, JFS, ext3, etc. Even to hardware raid, it happens all the time.

CERN did a test, their 3.000 storage Linux servers showed 100s of instances of silent corruption. (CERN wrote a known bit pattern to the disks and compared the result - and they differed). CERN can not trust the data on disk. Therefore CERN are now migrating to ZFS (which actually is the only modern solution who is designed from scratch, to protect against silent corruption).

I dont get it, who wants to have a fast filesystem which gives you false data?

You obviously don't get it, or refuse to note the examples given to you, like the google situation. You also don't get that there is no such thing as 100% data integrity EVER. No chance. In all cases choices made based on cost/performance/and data integrity needs are all balanced.

Google doesn't need perfect data integrity or hardware quality at that. By the volumes of data they process and the volume of hardware they deal with they don't care if a server is of iffy quality, they toss it. Large data volumes might also suggest a faster yet more data error prone file system might fit them ROI wise best.

Like I asked you, do you use ECC in all your computers in all situations? Do you slightly overvolt/over clock/ underclock any part of your computer,back up on tape,backup on 10yr DVDs? The point is, you personally make trade offs on data integrity on a daily basis.

Gee I don't get it! Ohhh how can someone/corporation live in a world without 100% data integrity?!!

**Ranguvar** · 03 November 2010, 08:29 AM

Originally posted by energyman View Post

And no matter what the devs write 'it was never guaranteed'... FUCK YOU.

POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored. Early filesystems like ext2 were designed such that (not on purpose) it wasn't really a problem if the software didn't save properly, and therefore people used bad code for too long with no issues. Now we have filesystems which make use of everything possible, and broken software fails on them. So fix the software.

Didn't know you were still around, Jade, since the Reiser confession. Or was that forced, or something, I suppose?

**yotambien** · 03 November 2010, 08:40 AM

Originally posted by tytso View Post

Some people like to treat file system benchmarks as a competition, and want to score wins and losses. That's not the way I look at it. I hack file systems because I'm passionate about working on that technology. I'm more excited about how I can make ext4 better, and not whether I can "beat down" some other file system. That's not what it's all about.

-- Ted

That's a hell of a quote. Respect.

**Tgui** · 03 November 2010, 08:42 AM

Originally posted by yotambien View Post

That's a hell of a quote. Respect.

I think the first thought that popped in my head was "Thank you!", for the hard work and sentiment.

**kebabbert** · 03 November 2010, 12:46 PM

Originally posted by Tgui View Post

You obviously don't get it, or refuse to note the examples given to you, like the google situation. You also don't get that there is no such thing as 100% data integrity EVER. No chance. In all cases choices made based on cost/performance/and data integrity needs are all balanced.

Google doesn't need perfect data integrity or hardware quality at that. By the volumes of data they process and the volume of hardware they deal with they don't care if a server is of iffy quality, they toss it. Large data volumes might also suggest a faster yet more data error prone file system might fit them ROI wise best.

Like I asked you, do you use ECC in all your computers in all situations? Do you slightly overvolt/over clock/ underclock any part of your computer,back up on tape,backup on 10yr DVDs? The point is, you personally make trade offs on data integrity on a daily basis.

Gee I don't get it! Ohhh how can someone/corporation live in a world without 100% data integrity?!!

Hmmm... You dont seem to understand what I mean.

CERN is storing lots of data because of their LHC (Large Hadron Collider) which costed billions and decades to plan and build. They are trying to find the Higgs boson(?). CERN really thinks it is important that their bits from the experiments are stored correct. Now they are migrating to ZFS:

http://blogs.sun.com/simons/entry/hpc_consortium_big_science_means

"Simultaneously, more LCH sites are beginning to use Sun's Thumper (Sun Fire x4500) ultra-dense disk storage systems.

Having conducted testing and analysis of ZFS, it is felt that the combination of ZFS and Solaris solves the critical data integrity issues that have been seen with other approaches. They feel the problem has been solved completely with the use of this technology. There is currently about one Petabyte of Thumper storage deployed across Tier1 and Tier2 sites. That number is expected to rise to approximately four Petabytes by the end of this summer."

Here is another link about LHC and ZFS:

http://hpc-events.com/sun-dresden07/CERN_Gasthuber.pdf

"Solaris 10/11 with ZFS solves critical data
integrity cases
? seen on all types of HW (?/$ independent) <---------OBS!!!!
? we feel that this issue is solved completely
Numerous sites now deploy ZFS + Thumper
? other storage + ZFS still minor
? >1 PB already in operation (T1 + T2)
- just count the well known sites (as of April 07)
- doubles soon"

Also, in finance (which is I work) it is extremely important that the data is stored correct. I hope you understand that there are some fields of work, where performance is secondary.

Of course I am not saying that ZFS is 100% secure, but it is far more secure than other solutions. The thing is that ZFS uses end-to-end checksummming. No other solution uses that. End to end means, from RAM down to controller down to disk, the entire chain. There might be bit flips in some parts of the chain. No solution compares the end of the chain with checksums. They only compare checksums within a realm, not when data passes a realm to a new.

For instance, here is an example where ZFS end-to-end checksum immediately detected there was an error in a switch. The switch injected faulty bits into the data stream, down to server. Earlier, they did not notice the faulty switch and the corrupt data:

Faulty FC port meets ZFS

http://jforonda.blogspot.com/2007/01/faulty-fc-port-meets-zfs.html

Right now, I'm in a project that is taking a lot of my time and as such, I am behind my normal readings. Whenever I'm in this situation, my...

"As it turns out our trusted SAN was silently corrupting data due to a bad/flaky FC port in the switch. DMX3500 faithfully wrote the bad data and returned normal ACKs back to the server, thus all our servers reported no storage problems.
...
ZFS was the first one to pick up on the silent corruption"

There is also research from computer scentists that show that hardware raid and filesystems are not secure, you can not trust on them. There is also research that shows ZFS to be safe, ZFS detected all artificially introduced errors. ZFS would have also corrected all errors if they had used raid. Now they used single disk.

The first step is detecting errors. Then you can correct the errors. Unfortunately, only ZFS is designed from scratch to detect errors.

In summary, you can use a storage solution that scales to PetaBytes and is SAFE. So, I dont see why you should focus on performance, if you can not use that solution in for instance, finance?

(No, I do not use ECC RAM yet, because I am waiting to upgrade my old ZFS server. When I upgrade, I will surely use ECC RAM. I agree ECC is important)

**energyman** · 03 November 2010, 01:19 PM

Originally posted by Ranguvar View Post

POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored. Early filesystems like ext2 were designed such that (not on purpose) it wasn't really a problem if the software didn't save properly, and therefore people used bad code for too long with no issues. Now we have filesystems which make use of everything possible, and broken software fails on them. So fix the software.

Didn't know you were still around, Jade, since the Reiser confession. Or was that forced, or something, I suppose?

file A is on disk.

You want to rename it to B.

You call rename(). A crash at the wrong moment and both are gone. r there is a file A or B. But it contents? Gone. That is a fucking braindead idiocy.
from the btrfs faq:

What are the crash guarantees of rename?
Renames NOT overwriting existing files do not give additional guarantees. This means, a sequence like
echo "content" > file.tmp
mv file.tmp file

# *crash*
will most likely give you a zero-length "file". The sequence can give you either
Neither file nor file.tmp exists
Either file.tmp or file exists and is 0-size or contains "content"

That is inacceptable. No matter what POSIX says. POSIX is crap anyway (windows NT is posix compliant too... yeah..)

Whoever thinks that some clusterfuck like that is acceptable has a major problem with reality.

In reality data is sacrosanct. Nuking it is not an option. A FS nuking data is fucking broken by fucking design.

**movieman** · 03 November 2010, 02:10 PM

Originally posted by Ranguvar View Post

POSIX says what you do or do not do, as an app, if you want your data safely stored. How much more simple does it need to be? Follow POSIX, or else it's the program author's fault when the data is not safely stored.

Fsync() is evil.

I mean, really, truly, horribly, satanically evil.

Flushing data on my laptop forces the disk to spin up just to write a file that I probably don't care that much about, thereby wasting my battery power. Flushing data on an ecommerce database server, on the other hand, is probably vital to ensure that databases are kept up to date.

But that's a system configuration choice, and should not ever be something that applications randomly decide to do. If I don't care that I might lose the last five minutes of files when I crash, then Firefox shouldn't be calling fsync() every time I visit a new web page. But it does because there are so many crappy filesystems which will corrupt your files if you crash before everything has been flushed to disk.

Having every application decide whether to force a write to the disk and waste my battery power is simply braindead. Filesystems should behave in a sensible manner so that we don't need this kind of hackery to make them work the way they should have worked in the first place. If I have file A on disk and I edit it and write it out, then the filesystem should be able to ensure that when I read it back after a reboot it will either be file A or file B and not an empty file or some corrupted mixture of the two. Anything else is unacceptable in a general use filesystem (special use filesystems may well prefer speed to consistency and be able to handle corruption issues).

**smitty3268** · 03 November 2010, 02:13 PM

Originally posted by kebabbert View Post

Hmmm... You dont seem to understand what I mean.

No, it's pretty clear you're the one who isn't understanding. Everyone agrees that in certain cases it's important to have data integrity. What you don't seem to get is that in certain situations it is perfectly acceptable to have data errors. Even if you don't know they are there. This has been explained to you, but you keep repeating the same stuff so I'm not sure if you're ignoring us or just don't understand the concept.

**kebabbert** · 04 November 2010, 06:07 AM

Originally posted by smitty3268 View Post

No, it's pretty clear you're the one who isn't understanding. Everyone agrees that in certain cases it's important to have data integrity. What you don't seem to get is that in certain situations it is perfectly acceptable to have data errors.

Ok, please enlighten me. I work in finance, therefore I have trouble seeing situations where it is acceptable that the data you get, is not correct. Maybe my line of work has colored me, but please give me some real life examples where it is acceptable that you get erroneous data (not some contrived examples).

I suppose you also advocate fast cpus, that every once in a while insists that 1 + 1 = 7?

To me, it is strange to suggest such storage solutions or hardware. I can promise you that in finance your suggestions would get kicked out, faster than greased lightning.

**misiu_mp** · 04 November 2010, 10:40 AM

Originally posted by kebabbert View Post

Ok, please enlighten me. I work in finance, therefore I have trouble seeing situations where it is acceptable that the data you get, is not correct.

Multimedia files usually tolerate being slightly corrupted. It might show as an small artefact in a movie or an image or a crack an audio stream.
Applications such as video hosting would probably be great candidates for fast but less secure storage.

Announcement

Ted Ts'o: EXT4 Within Striking Distance Of XFS

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment