Announcement

Collapse
No announcement yet.

KDE Almost Lost All Of Their Git Repositories

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by ryao View Post
    The entire ZFS core team resigned months after Oracle killed Open Solaris, along with plenty of other Solaris engineers. They now work on Illumos at various companies (with the ZFS developers mostly Delphix). They have fixed plenty of bugs that were in the last release of Open Solaris. These bugs affect Oracle's Solaris, but Oracle cannot merge the fixes without releasing its source code. The resignation of so many of Oracle's engineers has also led to a situation where Oracle no longer has the engineering talent needed to continue development of many of Solaris' core innovations, which include ZFS and DTrace. With that in mind, you would be better off with OmniOS:

    illumos based server OS with ZFS, DTrace, Crossbow, SMF, Bhyve, KVM and Linux zone support


    It is an Illumos distribution with commercial support. In many ways, they are to Illumos what Redhat is to Linux.
    yeah i heard many good things about it but didn't got a chance to test it myself but still for many executive they will go for solaris or oracle linux just because oracle says so tho but would be interesting for other projects outside big irons market

    Comment


    • #32
      Originally posted by ryao View Post
      This is the silent corruption that ZFS was created to prevent. Unfortunately, KDE's servers were not using it.
      Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.

      As far as ZFS stopping the corruption altogether, you don't know that. The corruption could have been caused by a software glitch in VM software, git, or the VM guest OS and wouldn't be detected at all by ZFS since the data could have been bit-perfect but still invalid. It might have been data corruption, but perhaps not filesystem level corruption.

      So while ZFS may or may not have helped here, it's still no substitute for backups.




      Originally posted by ryao View Post
      Unfortunately, proper backups would not have helped as much as one would think without a way to know whether or not the data was bad prior to doing them. It is not clear when the repositories became corrupted, although the blog post suggests that fsck was the trigger.
      Agreed. Although KDE cited their problem being that backing up git servers isn't properly documented.

      There is too much churn in the data which makes it difficult, if not impossible to take a snapshot or backup at any time on their main server and be guranteed it's consistent.

      Typically you would take the server offline for maintenance in these kinds of situations, but KDE believed they had a better solution for backups by using the --mirror.

      Unfortunately it appears git --mirror is a lot like rsync --delete where if things get corrupted at the source, that corruption will be mirrored on sync. KDE claims this isn't properly documented behavior of --mirror, and they're probably right.

      Yes, KDE had tarballs, but they didn't have tarballs for backup purposes.. The Tarballs were individual tarballs of each respository and were meant to make it easier to download repository contents. The tarballs did not contain everything to restore all content that was on the git server. git bundle to suffer from the same problem that it just doesn't copy everything. The only way to copy everything and keep the main server online 24/7, AFAIK, is with --mirror.


      So KDE was right that git --mirror was probably the best way to go if they didn't want to take the server down for maint.. What they failed to do was:
      1. Take a mirror server offline and run a git fsck to make sure that what it mirrored was sane.

      1a. AFAIK, all the KDE's mirrors were always online and KDE claims that since the servers were all online, the data churn on them made it impossible to get a sane snapshot for tarballs, ZFS, or any other backup purposes. Thus, it should have been obvious to KDE, IMO, that they need to take the main server offline or one of the mirrors offline to do proper backups. Since everything is mirroring off of the main server, and git fsck is known to take forever on 1500 repositories (as KDE claims) it makes sense to take one of the mirrors down and use the mirror to do the backups.


      1b. KDE claims the corruption might have started months ago. If they ran git fsck on any of the mirrors after doing a mirror and taking it offline, they would have discovered the problem months ago.

      2. After running git fsck on the mirror, they should then take a snapshot /backup of it. They'd be guaranteed that all their snapshots / backups were sane because the mirror is offline so people aren't pushing to it and it's also git fsck clean.

      3. Keep backups that date back for a solid year, I'm shocked that people still don't do this, but it should be obvious.. Also, make sure those backups are stored at least a couple hundred miles away from your nearest online server because I've known plenty of companies that spend a lot of money doing backups to tape, only to have them wiped out in the same natural disaster.
      Last edited by Sidicas; 25 March 2013, 09:30 PM.

      Comment


      • #33
        Originally posted by Sidicas View Post
        Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.
        The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.

        Comment


        • #34
          Originally posted by ryao View Post
          The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.
          I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

          In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
          Last edited by Sidicas; 25 March 2013, 09:47 PM.

          Comment


          • #35
            Originally posted by Sidicas View Post
            In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
            What's the saying? When you've got a nice hammer, everything starts looking like a nail?

            ZFS is nice. It's not the answer to everything.

            Comment


            • #36
              Originally posted by Sidicas View Post
              I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

              In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
              Your belief is wrong. fsck is supposed to leave repair to the system administrator. `git fsck` adheres to that concept, but fsck.ext4 violates it. What damaged the git repository was fsck.ext4 when it tried to do the system administrator's job for him. ZFS is designed to prevent the issues that fsck is meant to detect. Consequentially, it cannot be broken in the manner that ext4 was.

              With that said, the people that claim ZFS is a substitute for backups are those trying to refute the idea, such as yourself. No one who uses ZFS makes such claims.
              Last edited by ryao; 25 March 2013, 10:55 PM.

              Comment


              • #37
                ZFS (and BTRFS) *are* solutions

                ZFS (and BTRFS) are a *possible backup* solution, but it has not much to do with checksuming and bit-perfect control or in any other way being bullet-proof.

                Shit happens. Today it was a bad fsck.ext4, tomorrow it might something completely different (hardware failure, bug in git itself, whatever)...
                No matter how much you argue that problem XyZ would have been prevented by that special feature #1322 of filesystem AbC, one day will come when some un predicted problem will fry your setup. It's not a matter of robustness, it's a matter of *TIME*. All things fail eventually, for some reasons or another.

                A proper backup solution isn't a mirror. A proper backup solution is the one which can answer the case : "Oh shit! I don't have proper copy of file Foo.Bar I need to go back a few months ago when I had the correct version still around".

                ZFS and BTRFS happen to be also good solutions for that. Not because of robustness, but because they feature "copy-on-write", data deduplication and so on. Thus snapshotting comes for "free".

                If your precious data is periodically duplicated (simply mirrored) to a backup server which runs BTRFS and does periodic snapshots (daily for the last 2 weeks, weekly for the last 2 months, monthly for the last year, and yearly since begining of project) you would be safe. If anything happens to your precious data, even if it got silently corrupted, with nobody noticing, and the files getting mirrored/duplicated.
                Should you realise that garbage has appeared in your files due to some broken fsck you ran last week, just grab the copy from the day prior to the problem from the backup.

                BTRFS and ZFS are just more modern alternative to what was previously achieved with regular posix-compliant filesystems and a combo of rsync + hardlinks + cron. rsync isn't a magical backup bullet because it can make bit perfect copy by checksumming everything, its a good backup solution because when configured correctly, the hardlinks can keep older versions without eating too much space.

                So yes, ZFS will be a good solution. Not because it has some magical immunity to problems making it a *substitute* to classic backup (yes, it is resilient to some type of problems, that comes as a plus). But because it can configured to work as an *actual backup*: something which help you move back in time to before the problem hapenned. (As does BTRFS, and as did the rsync/hardlink/cronjob trio before CoW filesytem became the latest craze).

                Comment


                • #38
                  Git specific alternative

                  Or in their case specific case, KDE could have a bunch of servers whose "git --mirror" lags behind on purpose, so it contains older versions of the git repository.

                  Compared to the other solution (btrfs/zfs, rsync+hardlinks+cronjob, etc.) it is a bit more expensive as it require more ressource (this setup doesn't use any form of copy-on-write to reduce the duplication)

                  Comment


                  • #39
                    The key take-home that everybody seems to be missing, is VIRTUAL MACHINE. It doesn't matter what filesystem they were using, if the filesystem itself isn't the cause of its own corruption.

                    Now as for data protection in this kind of problem, there are multiple kinds of protection that can and should be implemented here. To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began. Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.

                    Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.

                    Comment


                    • #40
                      Git was the problem.

                      Originally posted by droidhacker View Post
                      To start with, it was a git server. Mirroring over git does build in a degree of safety, because you can step back in time to before the corruption began.
                      THAT WAS THE WHOLE problem to begin with. They managed to hit a corner case (i.e.: certain specific type of behaviour which happen when "--mirror" is in play) were corrupted objects where getting copied over and the servers began make perfect copies of the corruption that was on the master.

                      Originally posted by droidhacker View Post
                      Now if the git servers were being mirrored at the filesystem level, that would just replicate corruption and become impossible to recover.
                      That's how it ended-up looking. As if the corrupted files got copied all-over. Because they hit a specific path where the usual sanity checks in GIT don't happen.


                      The people at KDE *are* competent. They know they way around administration. But they met 2 sets of problems.
                      - Technical constraints. It's not easy to deploy a perfect backup solution with limited resources.
                      - A corner case: They hit a situation where GIT doesn't behave as it they though it would and corruption got passed around.


                      Speaking of file systems that's also a problem they pointed out. One of the problems they had is that there is no way to make sure that the files on the filesystem are consistent while git is live (well, that's normal in a way). The only way would be to force git to check its consistency before make the tarball, rsync'ing, etc.
                      But that is computationally expensive (and again they have limited resource. Again they know they way around administration. Any good sysadmin can in theory setup a perfect backup system. The problem is doing it on the scale of KDE while on a shoestring budget).
                      Either you freeze the whole GIT for a long time and wait until everything settles down in order to have consistent file to backup (but that blocks users).
                      Or you need a dedicated machine, which pulls (imperfect) mirror of the repo, then does the consistency checks and then only performs a classic filesystem backup at this point (but in this cas, this require CPU cycles, which aren't cheap in a data center).

                      The alternative they have is keep whole old git mirrors. Now their problem is that with ~25GB per whole repository, that's around 1TB of backup data. And, although a simple 1TB drive is cheap in a parts shop, it's going to be quite expensive again in a data center.

                      Originally posted by droidhacker View Post
                      Backups are a second mechanism for ensuring data safety. Incremental backups using an incremental filesystem like btrfs, for example, of just brutal permanent backups to which you can roll back.
                      Again their difficulty is achieving this while not needing to interrupt the git service and still having files which are actually constant and can be re-used in case of restore. All within what they can afford.


                      In the end they decided to:
                      - update their current backup strategy so it doesn't hit the corner case that caused all this mess, and put better strategy in place to detect when this kind of situation manage to happen (intelligently monitor the repos-list and trigger a warning if anything weird happens to it).
                      - keep a couple of older mirrors (within what they can afford. It will only be a couple of backups. Not a few dozen all the range of daily/weekly/monthly backups)
                      - a new dedicated server which does the mirror + check consistency + classic file-backup cycle.

                      Comment

                      Working...
                      X