Announcement

Collapse
No announcement yet.

KDE Almost Lost All Of Their Git Repositories

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by Pawlerson View Post
    You've got to be kidding me. It's a dead cow. Tell me why nearly nobody is using it? Btw. what are you doing for Gentoo? Last time you were trolling for bsd and now you're trolling for slowlaris.
    well to be honest nobody uses solaris because is lot more complicated than linux itself, specially on sparc hardware but Solaris is still a hell of an OS for server side operations and ZFS is quite an awesome FS[not the fastest but have many many awesome techs].

    Sure i prefer Linux hands down but solaris is not about speed and never was, Solaris is for fault tolerance ambients and in that sense is king of the hill since 2 decades ago alongside AIX and IBM Mainframe OS.

    So your are gravely mistaken solaris/AIX as a linux rival and they are not. For Example:

    1.) Solaris: fault tolerant ambients in certified hardware normally tied to Hugely Massive Databases and Java App Servers or Massive SAP systems that require at least 99.99% uptime with a ridiculousness of ABI compatibility certified timeframe.[in recent years redhat have taken an piece of the pie here]
    2.) AIX: in many countries is your bank Best Friend from workstations to P System Data centers integrated with S Mainframes Systems and is only considered for utmost secure systems.[since is EXPENSIVE!!!!! but well if you pay $1 millon in just hardware it won't hurt your budget to much]
    3.) Linux: is versatile enough and secure enough and stable enough for everything else.

    Sure linux is faster and everything we know but in the examples 1 and 2 those are multi million $$$ systems with strict warranties with very expensive expert in situ handling it, so this ppl is going to go with the most recommended/Tested/Accepted plataform available that guarantee you get your money back and for now AIX and Solaris are kings with RHEL fighting its way up

    Comment


    • #22
      Originally posted by Pawlerson View Post
      I'm not mistaken at all. Operating system and company who does things like this one can't be taken seriously:

      http://blog.lastinfirstout.net/2010/...oss-still.html

      Those "massive" SAP systems are jokes compared to HPC or SGI big irons. I hope you'll agree.
      There is a saying in news. If a dog bites a man, it is not news, but if a man bites a dog, it is news. The fact that Solaris had such bugs is news is a much better situation than the situation with Linux where such bugs are so common place that no one cares.

      Comment


      • #23
        Originally posted by Pawlerson View Post
        I'm not mistaken at all. Operating system and company who does things like this one can't be taken seriously:

        http://blog.lastinfirstout.net/2010/...oss-still.html

        Those "massive" SAP systems are jokes compared to HPC or SGI big irons. I hope you'll agree.
        well in most cases yes, Some SAP systems are massive enough to put those as tiny laptops tho but they aren't the common case i agree.

        about your link i agree is quite a failure but is true too this don't affect Sun/Oracle main targets since is very unlike you will ever hit in a proper certified datacenter. This mostly apply to so company with cheap datacenters that got a couple of server with oracle/solaris for some DB's.

        For example i worked with Sun[at that time][not at sun but i as a client side analist and certificator] in 1999 in a datacenter that costed 15 million USD and the certification went from structure to software and i can say last time any of those servers were shutdown was in 2001 after a 90 hours power failure and they only restart for upgrades since then.[Oracle handles it now] and in all this years Oracle/SAP and solaris never failed once, the sparc servers have never failed either and the SAN[ish] that came with it [hitachi] in all this years only 1 disk died and ZFS[was close to solaris 11 release] recovered the raid like a champ without failing any service. This datacenter is expected to produce until 2015 <-- this is the common case for solaris and the same applies to AIX.

        the point been in a properly certified Solaris/AIX system something like an abrupt shutdown is a non tolerable issue to start with and it implies a grave mistake was made in some part of the certification process or by a very bad sysadmin and the data corruption is mostly nil since those datacenters are heavily redundant and clustered which make very hard to trigger this kind of failure in a fatal fashion.

        this kind of errors are less likely to happen in linux tho since its main target is not those uber datacenters and you can expect power failures and many other issues and the server has to recover properly everytime as long as is possible.

        not saying you are wrong tho, i won't use solaris to mount an Oracle DB of importance if i have a small non certified datacenter with a couple of servers i would use hands down RHEL or Oracle Linux but for a N million dollar certified datacenter ill go with Solaris/AIX eyes closed without a shred of doubt, for HPC ill go linux hands down too

        Comment


        • #24
          Originally posted by jrch2k8 View Post
          i won't use solaris to mount an Oracle DB of importance if i have a small non certified datacenter with a couple of servers i would use hands down RHEL or Oracle Linux but for a N million dollar certified datacenter ill go with Solaris/AIX eyes closed without a shred of doubt, for HPC ill go linux hands down too
          The entire ZFS core team resigned months after Oracle killed Open Solaris, along with plenty of other Solaris engineers. They now work on Illumos at various companies (with the ZFS developers mostly Delphix). They have fixed plenty of bugs that were in the last release of Open Solaris. These bugs affect Oracle's Solaris, but Oracle cannot merge the fixes without releasing its source code. The resignation of so many of Oracle's engineers has also led to a situation where Oracle no longer has the engineering talent needed to continue development of many of Solaris' core innovations, which include ZFS and DTrace. With that in mind, you would be better off with OmniOS:

          http://omnios.omniti.com/

          It is an Illumos distribution with commercial support. In many ways, they are to Illumos what Redhat is to Linux.

          Comment


          • #25
            Originally posted by ryao View Post
            The entire ZFS core team resigned months after Oracle killed Open Solaris, along with plenty of other Solaris engineers. They now work on Illumos at various companies (with the ZFS developers mostly Delphix). They have fixed plenty of bugs that were in the last release of Open Solaris. These bugs affect Oracle's Solaris, but Oracle cannot merge the fixes without releasing its source code. The resignation of so many of Oracle's engineers has also led to a situation where Oracle no longer has the engineering talent needed to continue development of many of Solaris' core innovations, which include ZFS and DTrace. With that in mind, you would be better off with OmniOS:

            http://omnios.omniti.com/

            It is an Illumos distribution with commercial support. In many ways, they are to Illumos what Redhat is to Linux.
            yeah i heard many good things about it but didn't got a chance to test it myself but still for many executive they will go for solaris or oracle linux just because oracle says so tho but would be interesting for other projects outside big irons market

            Comment


            • #26
              Originally posted by ryao View Post
              This is the silent corruption that ZFS was created to prevent. Unfortunately, KDE's servers were not using it.
              Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.

              As far as ZFS stopping the corruption altogether, you don't know that. The corruption could have been caused by a software glitch in VM software, git, or the VM guest OS and wouldn't be detected at all by ZFS since the data could have been bit-perfect but still invalid. It might have been data corruption, but perhaps not filesystem level corruption.

              So while ZFS may or may not have helped here, it's still no substitute for backups.




              Originally posted by ryao View Post
              Unfortunately, proper backups would not have helped as much as one would think without a way to know whether or not the data was bad prior to doing them. It is not clear when the repositories became corrupted, although the blog post suggests that fsck was the trigger.
              Agreed. Although KDE cited their problem being that backing up git servers isn't properly documented.

              There is too much churn in the data which makes it difficult, if not impossible to take a snapshot or backup at any time on their main server and be guranteed it's consistent.

              Typically you would take the server offline for maintenance in these kinds of situations, but KDE believed they had a better solution for backups by using the --mirror.

              Unfortunately it appears git --mirror is a lot like rsync --delete where if things get corrupted at the source, that corruption will be mirrored on sync. KDE claims this isn't properly documented behavior of --mirror, and they're probably right.

              Yes, KDE had tarballs, but they didn't have tarballs for backup purposes.. The Tarballs were individual tarballs of each respository and were meant to make it easier to download repository contents. The tarballs did not contain everything to restore all content that was on the git server. git bundle to suffer from the same problem that it just doesn't copy everything. The only way to copy everything and keep the main server online 24/7, AFAIK, is with --mirror.


              So KDE was right that git --mirror was probably the best way to go if they didn't want to take the server down for maint.. What they failed to do was:
              1. Take a mirror server offline and run a git fsck to make sure that what it mirrored was sane.

              1a. AFAIK, all the KDE's mirrors were always online and KDE claims that since the servers were all online, the data churn on them made it impossible to get a sane snapshot for tarballs, ZFS, or any other backup purposes. Thus, it should have been obvious to KDE, IMO, that they need to take the main server offline or one of the mirrors offline to do proper backups. Since everything is mirroring off of the main server, and git fsck is known to take forever on 1500 repositories (as KDE claims) it makes sense to take one of the mirrors down and use the mirror to do the backups.


              1b. KDE claims the corruption might have started months ago. If they ran git fsck on any of the mirrors after doing a mirror and taking it offline, they would have discovered the problem months ago.

              2. After running git fsck on the mirror, they should then take a snapshot /backup of it. They'd be guaranteed that all their snapshots / backups were sane because the mirror is offline so people aren't pushing to it and it's also git fsck clean.

              3. Keep backups that date back for a solid year, I'm shocked that people still don't do this, but it should be obvious.. Also, make sure those backups are stored at least a couple hundred miles away from your nearest online server because I've known plenty of companies that spend a lot of money doing backups to tape, only to have them wiped out in the same natural disaster.
              Last edited by Sidicas; 03-25-2013, 09:30 PM.

              Comment


              • #27
                Originally posted by Sidicas View Post
                Read the KDE blogs.. KDE claims there is too much churn in the data on a git server with 1500 repositories, so you wouldn't have been able to get a workable ZFS snapshot at any point without taking the server or one of the mirrors offline. Since everything was mirroring from the same server, they pretty much had a central point of failure and they also didn't want to take it offline because they wanted 24/7 availability.
                The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.

                Comment


                • #28
                  Originally posted by ryao View Post
                  The fsck that took down their mirror network would not have occurred had ZFS been used. Snapshots are irrelevant as far as that point is concerned.
                  I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

                  In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
                  Last edited by Sidicas; 03-25-2013, 09:47 PM.

                  Comment


                  • #29
                    Originally posted by Sidicas View Post
                    In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
                    What's the saying? When you've got a nice hammer, everything starts looking like a nail?

                    ZFS is nice. It's not the answer to everything.

                    Comment


                    • #30
                      Originally posted by Sidicas View Post
                      I was pretty sure it was a git fsck that took it down? It's not to be confused with a fsck which checks the filesystem. it's a git sanity check. It's not clear that filesystem corruption was the cause of the problem. It could have been data corruption caused by a software glitch or the VM not terminating properly.

                      In any case, ZFS is still not a substitute for backups. So a failure mode slightly different could have potentially been equally catastrophic.
                      Your belief is wrong. fsck is supposed to leave repair to the system administrator. `git fsck` adheres to that concept, but fsck.ext4 violates it. What damaged the git repository was fsck.ext4 when it tried to do the system administrator's job for him. ZFS is designed to prevent the issues that fsck is meant to detect. Consequentially, it cannot be broken in the manner that ext4 was.

                      With that said, the people that claim ZFS is a substitute for backups are those trying to refute the idea, such as yourself. No one who uses ZFS makes such claims.
                      Last edited by ryao; 03-25-2013, 10:55 PM.

                      Comment

                      Working...
                      X