Announcement

Collapse
No announcement yet.

KDE Almost Lost All Of Their Git Repositories

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • phoronix
    started a topic KDE Almost Lost All Of Their Git Repositories

    KDE Almost Lost All Of Their Git Repositories

    Phoronix: KDE Almost Lost All Of Their Git Repositories

    There was almost "The Great KDE Disaster Of 2013" when the KDE project almost lost all of their 1,500+ Git repositories...

    http://www.phoronix.com/vr.php?view=MTMzNTc

  • Pawlerson
    replied
    Originally posted by jrch2k8 View Post
    For example i worked with Sun[at that time][not at sun but i as a client side analist and certificator] in 1999 in a datacenter that costed 15 million USD and the certification went from structure to software and i can say last time any of those servers were shutdown was in 2001 after a 90 hours power failure and they only restart for upgrades since then.[Oracle handles it now] and in all this years Oracle/SAP and solaris never failed once, the sparc servers have never failed either and the SAN[ish] that came with it [hitachi] in all this years only 1 disk died and ZFS[was close to solaris 11 release] recovered the raid like a champ without failing any service. This datacenter is expected to produce until 2015 <-- this is the common case for solaris and the same applies to AIX.
    the point been in a properly certified Solaris/AIX system something like an abrupt shutdown is a non tolerable issue to start with and it implies a grave mistake was made in some part of the certification process or by a very bad sysadmin and the data corruption is mostly nil since those datacenters are heavily redundant and clustered which make very hard to trigger this kind of failure in a fatal fashion.
    I respect you and your comments, but the same applies to Linux. Linux was very strong in enterprise even in 2004. Todays it runs stock exchanges (you should agree hardware will cost a lot in this case) and demands highest reliability and stability. We can read some comments coming from Gentoo developer who compares enterprise Solaris to some home made Linux distribution, but something like this isn't sane.

    this kind of errors are less likely to happen in linux tho since its main target is not those uber datacenters and you can expect power failures and many other issues and the server has to recover properly everytime as long as is possible.
    That's not true. Linux handled such workloads more than ten years ago. You can read about Linux usage in enterprise computing and critical workloads:

    http://itknowledgeexchange.techtarge...ng-system-yet/
    http://www.cnn.com/2001/TECH/computi...tml?_s=PM:TECH

    Leave a comment:


  • DrYak
    replied
    Git was the problem.

    Originally posted by droidhacker View Post
    To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began.
    THAT WAS THE WHOLE problem to begin with. They managed to hit a corner case (i.e.: certain specific type of behaviour which happen when "--mirror" is in play) were corrupted objects where getting copied over and the servers began make perfect copies of the corruption that was on the master.

    Originally posted by droidhacker View Post
    Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.
    That's how it ended-up looking. As if the corrupted files got copied all-over. Because they hit a specific path where the usual sanity checks in GIT don't happen.


    The people at KDE *are* competent. They know they way around administration. But they met 2 sets of problems.
    - Technical constraints. It's not easy to deploy a perfect backup solution with limited resources.
    - A corner case: They hit a situation where GIT doesn't behave as it they though it would and corruption got passed around.


    Speaking of file systems that's also a problem they pointed out. One of the problems they had is that there is no way to make sure that the files on the filesystem are consistent while git is live (well, that's normal in a way). The only way would be to force git to check its consistency before make the tarball, rsync'ing, etc.
    But that is computationally expensive (and again they have limited resource. Again they know they way around administration. Any good sysadmin can in theory setup a perfect backup system. The problem is doing it on the scale of KDE while on a shoestring budget).
    Either you freeze the whole GIT for a long time and wait until everything settles down in order to have consistent file to backup (but that blocks users).
    Or you need a dedicated machine, which pulls (imperfect) mirror of the repo, then does the consistency checks and then only performs a classic filesystem backup at this point (but in this cas, this require CPU cycles, which aren't cheap in a data center).

    The alternative they have is keep whole old git mirrors. Now their problem is that with ~25GB per whole repository, that's around 1TB of backup data. And, although a simple 1TB drive is cheap in a parts shop, it's going to be quite expensive again in a data center.

    Originally posted by droidhacker View Post
    Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.
    Again their difficulty is achieving this while not needing to interrupt the git service and still having files which are actually constant and can be re-used in case of restore. All within what they can afford.


    In the end they decided to:
    - update their current backup strategy so it doesn't hit the corner case that caused all this mess, and put better strategy in place to detect when this kind of situation manage to happen (intelligently monitor the repos-list and trigger a warning if anything weird happens to it).
    - keep a couple of older mirrors (within what they can afford. It will only be a couple of backups. Not a few dozen all the range of daily/weekly/monthly backups)
    - a new dedicated server which does the mirror + check consistency + classic file-backup cycle.

    Leave a comment:


  • droidhacker
    replied
    The key take-home that everybody seems to be missing, is VIRTUAL MACHINE. It doesn't matter what filesystem they were using, if the filesystem itself isn't the cause of its own corruption.

    Now as for data protection in this kind of problem, there are multiple kinds of protection that can and should be implemented here. To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began. Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.

    Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.

    Leave a comment:


  • DrYak
    replied
    Git specific alternative

    Or in their case specific case, KDE could have a bunch of servers whose "git --mirror" lags behind on purpose, so it contains older versions of the git repository.

    Compared to the other solution (btrfs/zfs, rsync+hardlinks+cronjob, etc.) it is a bit more expensive as it require more ressource (this setup doesn't use any form of copy-on-write to reduce the duplication)

    Leave a comment:


  • DrYak
    replied
    ZFS (and BTRFS) *are* solutions

    ZFS (and BTRFS) are a *possible backup* solution, but it has not much to do with checksuming and bit-perfect control or in any other way being bullet-proof.

    Shit happens. Today it was a bad fsck.ext4, tomorrow it might something completely different (hardware failure, bug in git itself, whatever)...
    No matter how much you argue that problem XyZ would have been prevented by that special feature #1322 of filesystem AbC, one day will come when some un predicted problem will fry your setup. It's not a matter of robustness, it's a matter of *TIME*. All things fail eventually, for some reasons or another.

    A proper backup solution isn't a mirror. A proper backup solution is the one which can answer the case : "Oh shit! I don't have proper copy of file Foo.Bar I need to go back a few months ago when I had the correct version still around".

    ZFS and BTRFS happen to be also good solutions for that. Not because of robustness, but because they feature "copy-on-write", data deduplication and so on. Thus snapshotting comes for "free".

    If your precious data is periodically duplicated (simply mirrored) to a backup server which runs BTRFS and does periodic snapshots (daily for the last 2 weeks, weekly for the last 2 months, monthly for the last year, and yearly since begining of project) you would be safe. If anything happens to your precious data, even if it got silently corrupted, with nobody noticing, and the files getting mirrored/duplicated.
    Should you realise that garbage has appeared in your files due to some broken fsck you ran last week, just grab the copy from the day prior to the problem from the backup.

    BTRFS and ZFS are just more modern alternative to what was previously achieved with regular posix-compliant filesystems and a combo of rsync + hardlinks + cron. rsync isn't a magical backup bullet because it can make bit perfect copy by checksumming everything, its a good backup solution because when configured correctly, the hardlinks can keep older versions without eating too much space.

    So yes, ZFS will be a good solution. Not because it has some magical immunity to problems making it a *substitute* to classic backup (yes, it is resilient to some type of problems, that comes as a plus). But because it can configured to work as an *actual backup*: something which help you move back in time to before the problem hapenned. (As does BTRFS, and as did the rsync/hardlink/cronjob trio before CoW filesytem became the latest craze).

    Leave a comment:


  • ryao
    replied
    Originally posted by Sidicas View Post
    I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

    In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
    Your belief is wrong. fsck is supposed to leave repair to the system administrator. `git fsck` adheres to that concept, but fsck.ext4 violates it. What damaged the git repository was fsck.ext4 when it tried to do the system administrator's job for him. ZFS is designed to prevent the issues that fsck is meant to detect. Consequentially, it cannot be broken in the manner that ext4 was.

    With that said, the people that claim ZFS is a substitute for backups are those trying to refute the idea, such as yourself. No one who uses ZFS makes such claims.
    Last edited by ryao; 03-25-2013, 10:55 PM.

    Leave a comment:


  • smitty3268
    replied
    Originally posted by Sidicas View Post
    In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
    What's the saying? When you've got a nice hammer, everything starts looking like a nail?

    ZFS is nice. It's not the answer to everything.

    Leave a comment:


  • Sidicas
    replied
    Originally posted by ryao View Post
    The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.
    I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

    In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
    Last edited by Sidicas; 03-25-2013, 09:47 PM.

    Leave a comment:


  • ryao
    replied
    Originally posted by Sidicas View Post
    Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.
    The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.

    Leave a comment:

Working...
X