Announcement

Collapse
No announcement yet.

FreeDesktop.org GitLab Down Due To Drive Failures

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    According to the chat log, they have a Ceph problem. Seems to be the same as https://lists.ceph.io/hyperkitty/lis...5VTO5NPP6VFVX/ or its numerous duplicates from the last few months. Conclusion: never use Ceph, otherwise you'll eventually have to pay 1000000 USD to a data recovery specialist.

    Comment


    • #12
      well..2 drives and its down..
      it it a Raid 5 system? with 2 drives... its game over,
      Or maybe they had a raid 0 for the OS( 2 SSD's ), and a different raid for the *real* storage.

      For *real* storage I go always with HDD's..life is too short to trust SSD's..

      Comment


      • #13
        Originally posted by tuxd3v View Post
        well..2 drives and its down..
        it it a Raid 5 system? with 2 drives... its game over,
        Or maybe they had a raid 0 for the OS( 2 SSD's ), and a different raid for the *real* storage.

        For *real* storage I go always with HDD's..life is too short to trust SSD's..
        raid1 OS SSDs would definitely cause a server to fall over if two drives died, and would be game over, however if you do things "right" then that's less of an issue. No OS drive should ever contain "real" data, all real data should be on non-root drives with an acceptable redundancy and backup strategy in place for those drives. Haven't seen any mention of exactly what went wrong, so no clue if its data drives gone out or if its OS drives gone out yet.
        All opinions are my own not those of my employer if you know who they are.

        Comment


        • #14
          Originally posted by Quackdoc View Post
          interesting that 2 drives were enough to take down the services. seems like a misconfiguration if that was all that we needed
          It depends.. it could be a raid 0 for the OS, and the real storage is in a different place..its also a good Idea to create images of the raid 0 disks, so that when they fail, you just DD the images to new ones..but yeah...you need to store the images somewhere..

          Comment


          • #15
            Originally posted by Ericg View Post

            raid1 OS SSDs would definitely cause a server to fall over if two drives died, and would be game over, however if you do things "right" then that's less of an issue. No OS drive should ever contain "real" data, all real data should be on non-root drives with an acceptable redundancy and backup strategy in place for those drives. Haven't seen any mention of exactly what went wrong, so no clue if its data drives gone out or if its OS drives gone out yet.
            RAID1 does not imply a two-disk setup.

            In fact, you can setup RAID1 of 3 or even more disks.

            Comment


            • #16
              Originally posted by Ericg View Post

              I'm not familiar with how their Gitlab instance is configured at the OS level, so consider this pure speculation, but if its a single server then its pretty common for servers to be configured with raid1 on the OS drive. If both SSDs were from the same batch then its very possible for them to suffer similar issues / age at the same rate and then die at the same time. This would imply that the freedesktop.org Gitlab isn't isnt running in any form of high availability or clustered mode (a quick check of the docs says that it does support such a configuration), which is a major issue but likely a decision that was made for cost reasons.

              Again, pure conjecture but its a story I've seen play out all too often.
              Originally posted by tuxd3v View Post

              It depends.. it could be a raid 0 for the OS, and the real storage is in a different place..its also a good Idea to create images of the raid 0 disks, so that when they fail, you just DD the images to new ones..but yeah...you need to store the images somewhere..
              in any case, if two drives are enough to bring you down for a prolonged period of time, barring extenuating circumstances (which this could very well be), need to hire someone new lol, at least when you are on the scale they are

              Comment


              • #17
                Originally posted by waxhead View Post
                "Unix systems are so reliable they never have to be rebooted"
                I heard a different one that goes about "Unix systems are really quick to boot, which is great because you'll be doing that often".

                Comment


                • #18
                  I'm a firm believer that the backup server should be Spinning Disks. At least if the PCB gets fried you can swap that out -- with SSD you are just toast.

                  Go ahead and use your SSDs or NVMEs for daily live usage, but don't use them for backup or as if they were as reliable as data written to a disk platter.

                  I really think people underappreciated Optical aswell.

                  Sure the probability of a nuclear EMP device going off is low -- but you really need your critical data backed up to storage that is archival. One kind of EMP travels through the power grid and theoretically could toast anything wired in.

                  Comment


                  • #19
                    Originally posted by patrakov View Post
                    According to the chat log, they have a Ceph problem. Seems to be the same as https://lists.ceph.io/hyperkitty/lis...5VTO5NPP6VFVX/ or its numerous duplicates from the last few months. Conclusion: never use Ceph, otherwise you'll eventually have to pay 1000000 USD to a data recovery specialist.
                    In this bug the user overrode sane defaults and wound up setting an odd memory limit of 1MB.... How is this ceph's fault? Also, if you actually read through to the bottom the user got their cluster working again. Where is this expensive data recovery necessary?

                    Comment


                    • #20
                      Originally posted by aufkrawall View Post
                      I don't think the discussion should be SSDs vs. HDDs, but how it could happen in the first place, i.e. what strategy wasn't pursued that would have prevented this failure.
                      In my experience it is best not to use brand new drives from the same batch (or even from the same manufacturer) in a RAID system. They tend to fail at about the same time.

                      Comment

                      Working...
                      X