KDE Almost Lost All Of Their Git Repositories

Guest replied

26 March 2013, 04:32 PM
Originally posted by jrch2k8 View Post

For example i worked with Sun[at that time][not at sun but i as a client side analist and certificator] in 1999 in a datacenter that costed 15 million USD and the certification went from structure to software and i can say last time any of those servers were shutdown was in 2001 after a 90 hours power failure and they only restart for upgrades since then.[Oracle handles it now] and in all this years Oracle/SAP and solaris never failed once, the sparc servers have never failed either and the SAN[ish] that came with it [hitachi] in all this years only 1 disk died and ZFS[was close to solaris 11 release] recovered the raid like a champ without failing any service. This datacenter is expected to produce until 2015 <-- this is the common case for solaris and the same applies to AIX.

the point been in a properly certified Solaris/AIX system something like an abrupt shutdown is a non tolerable issue to start with and it implies a grave mistake was made in some part of the certification process or by a very bad sysadmin and the data corruption is mostly nil since those datacenters are heavily redundant and clustered which make very hard to trigger this kind of failure in a fatal fashion.

I respect you and your comments, but the same applies to Linux. Linux was very strong in enterprise even in 2004. Todays it runs stock exchanges (you should agree hardware will cost a lot in this case) and demands highest reliability and stability. We can read some comments coming from Gentoo developer who compares enterprise Solaris to some home made Linux distribution, but something like this isn't sane.

this kind of errors are less likely to happen in linux tho since its main target is not those uber datacenters and you can expect power failures and many other issues and the server has to recover properly everytime as long as is possible.

That's not true. Linux handled such workloads more than ten years ago. You can read about Linux usage in enterprise computing and critical workloads:

http://itknowledgeexchange.techtarget.com/enterprise-linux/is-linux-a-mission-critical-app-operating-system-yet/

CNN.com - Technology - Linux gaining with mission-critical systems - January 3, 2001

http://www.cnn.com/2001/TECH/computing/01/03/linux.mission.idg/index.html?_s=PM:TECH

Linux, operating system
Leave a comment:
DrYak replied

26 March 2013, 10:08 AM
Git was the problem.

Originally posted by droidhacker View Post

To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began.

THAT WAS THE WHOLE problem to begin with. They managed to hit a corner case (i.e.: certain specific type of behaviour which happen when "--mirror" is in play) were corrupted objects where getting copied over and the servers began make perfect copies of the corruption that was on the master.

Originally posted by droidhacker View Post

Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.

That's how it ended-up looking. As if the corrupted files got copied all-over. Because they hit a specific path where the usual sanity checks in GIT don't happen.

The people at KDE *are* competent. They know they way around administration. But they met 2 sets of problems.
- Technical constraints. It's not easy to deploy a perfect backup solution with limited resources.
- A corner case: They hit a situation where GIT doesn't behave as it they though it would and corruption got passed around.

Speaking of file systems that's also a problem they pointed out. One of the problems they had is that there is no way to make sure that the files on the filesystem are consistent while git is live (well, that's normal in a way). The only way would be to force git to check its consistency before make the tarball, rsync'ing, etc.
But that is computationally expensive (and again they have limited resource. Again they know they way around administration. Any good sysadmin can in theory setup a perfect backup system. The problem is doing it on the scale of KDE while on a shoestring budget).
Either you freeze the whole GIT for a long time and wait until everything settles down in order to have consistent file to backup (but that blocks users).
Or you need a dedicated machine, which pulls (imperfect) mirror of the repo, then does the consistency checks and then only performs a classic filesystem backup at this point (but in this cas, this require CPU cycles, which aren't cheap in a data center).

The alternative they have is keep whole old git mirrors. Now their problem is that with ~25GB per whole repository, that's around 1TB of backup data. And, although a simple 1TB drive is cheap in a parts shop, it's going to be quite expensive again in a data center.

Originally posted by droidhacker View Post

Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.

Again their difficulty is achieving this while not needing to interrupt the git service and still having files which are actually constant and can be re-used in case of restore. All within what they can afford.

In the end they decided to:
- update their current backup strategy so it doesn't hit the corner case that caused all this mess, and put better strategy in place to detect when this kind of situation manage to happen (intelligently monitor the repos-list and trigger a warning if anything weird happens to it).
- keep a couple of older mirrors (within what they can afford. It will only be a couple of backups. Not a few dozen all the range of daily/weekly/monthly backups)
- a new dedicated server which does the mirror + check consistency + classic file-backup cycle.
Leave a comment:
droidhacker replied

26 March 2013, 09:12 AM
The key take-home that everybody seems to be missing, is VIRTUAL MACHINE. It doesn't matter what filesystem they were using, if the filesystem itself isn't the cause of its own corruption.

Now as for data protection in this kind of problem, there are multiple kinds of protection that can and should be implemented here. To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began. Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.

Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.
Leave a comment:
DrYak replied

26 March 2013, 06:49 AM
Git specific alternative

Or in their case specific case, KDE could have a bunch of servers whose "git --mirror" lags behind on purpose, so it contains older versions of the git repository.

Compared to the other solution (btrfs/zfs, rsync+hardlinks+cronjob, etc.) it is a bit more expensive as it require more ressource (this setup doesn't use any form of copy-on-write to reduce the duplication)
Leave a comment:
DrYak replied

26 March 2013, 06:29 AM
ZFS (and BTRFS) *are* solutions

ZFS (and BTRFS) are a *possible backup* solution, but it has not much to do with checksuming and bit-perfect control or in any other way being bullet-proof.

Shit happens. Today it was a bad fsck.ext4, tomorrow it might something completely different (hardware failure, bug in git itself, whatever)...
No matter how much you argue that problem XyZ would have been prevented by that special feature #1322 of filesystem AbC, one day will come when some un predicted problem will fry your setup. It's not a matter of robustness, it's a matter of *TIME*. All things fail eventually, for some reasons or another.

A proper backup solution isn't a mirror. A proper backup solution is the one which can answer the case : "Oh shit! I don't have proper copy of file Foo.Bar I need to go back a few months ago when I had the correct version still around".

ZFS and BTRFS happen to be also good solutions for that. Not because of robustness, but because they feature "copy-on-write", data deduplication and so on. Thus snapshotting comes for "free".

If your precious data is periodically duplicated (simply mirrored) to a backup server which runs BTRFS and does periodic snapshots (daily for the last 2 weeks, weekly for the last 2 months, monthly for the last year, and yearly since begining of project) you would be safe. If anything happens to your precious data, even if it got silently corrupted, with nobody noticing, and the files getting mirrored/duplicated.
Should you realise that garbage has appeared in your files due to some broken fsck you ran last week, just grab the copy from the day prior to the problem from the backup.

BTRFS and ZFS are just more modern alternative to what was previously achieved with regular posix-compliant filesystems and a combo of rsync + hardlinks + cron. rsync isn't a magical backup bullet because it can make bit perfect copy by checksumming everything, its a good backup solution because when configured correctly, the hardlinks can keep older versions without eating too much space.

So yes, ZFS will be a good solution. Not because it has some magical immunity to problems making it a *substitute* to classic backup (yes, it is resilient to some type of problems, that comes as a plus). But because it can configured to work as an *actual backup*: something which help you move back in time to before the problem hapenned. (As does BTRFS, and as did the rsync/hardlink/cronjob trio before CoW filesytem became the latest craze).
Leave a comment:
ryao replied

25 March 2013, 10:48 PM
Originally posted by Sidicas View Post

I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.

Your belief is wrong. fsck is supposed to leave repair to the system administrator. `git fsck` adheres to that concept, but fsck.ext4 violates it. What damaged the git repository was fsck.ext4 when it tried to do the system administrator's job for him. ZFS is designed to prevent the issues that fsck is meant to detect. Consequentially, it cannot be broken in the manner that ext4 was.

With that said, the people that claim ZFS is a substitute for backups are those trying to refute the idea, such as yourself. No one who uses ZFS makes such claims.

Last edited by ryao; 25 March 2013, 10:55 PM.
Leave a comment:
smitty3268 replied

25 March 2013, 09:54 PM
Originally posted by Sidicas View Post

In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.

What's the saying? When you've got a nice hammer, everything starts looking like a nail?

ZFS is nice. It's not the answer to everything.
Leave a comment:
Sidicas replied

25 March 2013, 09:43 PM
Originally posted by ryao View Post

The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.

I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.

Last edited by Sidicas; 25 March 2013, 09:47 PM.
Leave a comment:
ryao replied

25 March 2013, 09:19 PM
Originally posted by Sidicas View Post

Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.

The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.
Leave a comment:
Sidicas replied

25 March 2013, 09:04 PM
Originally posted by ryao View Post

This is the silent corruption that ZFS was created to prevent. Unfortunately, KDE's servers were not using it.

Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.

As far as ZFS stopping the corruption altogether, you don't know that. The corruption could have been caused by a software glitch in VM software, git, or the VM guest OS and wouldn't be detected at all by ZFS since the data could have been bit-perfect but still invalid. It might have been data corruption, but perhaps not filesystem level corruption.

So while ZFS may or may not have helped here, it's still no substitute for backups.

Originally posted by ryao View Post

Unfortunately, proper backups would not have helped as much as one would think without a way to know whether or not the data was bad prior to doing them. It is not clear when the repositories became corrupted, although the blog post suggests that fsck was the trigger.

Agreed. Although KDE cited their problem being that backing up git servers isn't properly documented.

There is too much churn in the data which makes it difficult, if not impossible to take a snapshot or backup at any time on their main server and be guranteed it's consistent.

Typically you would take the server offline for maintenance in these kinds of situations, but KDE believed they had a better solution for backups by using the --mirror.

Unfortunately it appears git --mirror is a lot like rsync --delete where if things get corrupted at the source, that corruption will be mirrored on sync. KDE claims this isn't properly documented behavior of --mirror, and they're probably right.

Yes, KDE had tarballs, but they didn't have tarballs for backup purposes.. The Tarballs were individual tarballs of each respository and were meant to make it easier to download repository contents. The tarballs did not contain everything to restore all content that was on the git server. git bundle to suffer from the same problem that it just doesn't copy everything. The only way to copy everything and keep the main server online 24/7, AFAIK, is with --mirror.

So KDE was right that git --mirror was probably the best way to go if they didn't want to take the server down for maint.. What they failed to do was:
1. Take a mirror server offline and run a git fsck to make sure that what it mirrored was sane.

1a. AFAIK, all the KDE's mirrors were always online and KDE claims that since the servers were all online, the data churn on them made it impossible to get a sane snapshot for tarballs, ZFS, or any other backup purposes. Thus, it should have been obvious to KDE, IMO, that they need to take the main server offline or one of the mirrors offline to do proper backups. Since everything is mirroring off of the main server, and git fsck is known to take forever on 1500 repositories (as KDE claims) it makes sense to take one of the mirrors down and use the mirror to do the backups.

1b. KDE claims the corruption might have started months ago. If they ran git fsck on any of the mirrors after doing a mirror and taking it offline, they would have discovered the problem months ago.

2. After running git fsck on the mirror, they should then take a snapshot /backup of it. They'd be guaranteed that all their snapshots / backups were sane because the mirror is offline so people aren't pushing to it and it's also git fsck clean.

3. Keep backups that date back for a solid year, I'm shocked that people still don't do this, but it should be obvious.. Also, make sure those backups are stored at least a couple hundred miles away from your nearest online server because I've known plenty of companies that spend a lot of money doing backups to tape, only to have them wiped out in the same natural disaster.

Last edited by Sidicas; 25 March 2013, 09:30 PM.
Leave a comment:

Announcement

KDE Almost Lost All Of Their Git Repositories

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: