FreeDesktop.org GitLab Down Due To Drive Failures

patrakov

Phoronix Member

Join Date: Mar 2015

Posts: 100
- Share
- Tweet
#11

12 June 2022, 10:52 PM

According to the chat log, they have a Ceph problem. Seems to be the same as https://lists.ceph.io/hyperkitty/lis...5VTO5NPP6VFVX/ or its numerous duplicates from the last few months. Conclusion: never use Ceph, otherwise you'll eventually have to pay 1000000 USD to a data recovery specialist.
Likes 3
Comment
tuxd3v

Senior Member

Join Date: Nov 2014

Posts: 1731
- Share
- Tweet
#12

12 June 2022, 11:05 PM

well..2 drives and its down..
it it a Raid 5 system? with 2 drives... its game over,
Or maybe they had a raid 0 for the OS( 2 SSD's ), and a different raid for the *real* storage.

For *real* storage I go always with HDD's..life is too short to trust SSD's..
Comment
Ericg

Senior Member

Join Date: Aug 2012

Posts: 2585
- Share
- Tweet
#13

12 June 2022, 11:23 PM

Originally posted by tuxd3v View Post

well..2 drives and its down..
it it a Raid 5 system? with 2 drives... its game over,
Or maybe they had a raid 0 for the OS( 2 SSD's ), and a different raid for the *real* storage.

For *real* storage I go always with HDD's..life is too short to trust SSD's..

raid1 OS SSDs would definitely cause a server to fall over if two drives died, and would be game over, however if you do things "right" then that's less of an issue. No OS drive should ever contain "real" data, all real data should be on non-root drives with an acceptable redundancy and backup strategy in place for those drives. Haven't seen any mention of exactly what went wrong, so no clue if its data drives gone out or if its OS drives gone out yet.

All opinions are my own not those of my employer if you know who they are.
Likes 1
Comment
tuxd3v

Senior Member

Join Date: Nov 2014

Posts: 1731
- Share
- Tweet
#14

12 June 2022, 11:46 PM

Originally posted by Quackdoc View Post

interesting that 2 drives were enough to take down the services. seems like a misconfiguration if that was all that we needed

It depends.. it could be a raid 0 for the OS, and the real storage is in a different place..its also a good Idea to create images of the raid 0 disks, so that when they fail, you just DD the images to new ones..but yeah...you need to store the images somewhere..
Comment
NobodyXu

Senior Member

Join Date: Jun 2021

Posts: 799
- Share
- Tweet
#15

13 June 2022, 12:07 AM

Originally posted by Ericg View Post

raid1 OS SSDs would definitely cause a server to fall over if two drives died, and would be game over, however if you do things "right" then that's less of an issue. No OS drive should ever contain "real" data, all real data should be on non-root drives with an acceptable redundancy and backup strategy in place for those drives. Haven't seen any mention of exactly what went wrong, so no clue if its data drives gone out or if its OS drives gone out yet.

RAID1 does not imply a two-disk setup.

In fact, you can setup RAID1 of 3 or even more disks.
Comment
Quackdoc

Senior Member

Join Date: Oct 2020

Posts: 3765
- Share
- Tweet
#16

13 June 2022, 12:47 AM

Originally posted by Ericg View Post

I'm not familiar with how their Gitlab instance is configured at the OS level, so consider this pure speculation, but if its a single server then its pretty common for servers to be configured with raid1 on the OS drive. If both SSDs were from the same batch then its very possible for them to suffer similar issues / age at the same rate and then die at the same time. This would imply that the freedesktop.org Gitlab isn't isnt running in any form of high availability or clustered mode (a quick check of the docs says that it does support such a configuration), which is a major issue but likely a decision that was made for cost reasons.

Again, pure conjecture but its a story I've seen play out all too often.

Originally posted by tuxd3v View Post

It depends.. it could be a raid 0 for the OS, and the real storage is in a different place..its also a good Idea to create images of the raid 0 disks, so that when they fail, you just DD the images to new ones..but yeah...you need to store the images somewhere..

in any case, if two drives are enough to bring you down for a prolonged period of time, barring extenuating circumstances (which this could very well be), need to hire someone new lol, at least when you are on the scale they are
Likes 1
Comment
sinepgib

Senior Member

Join Date: Aug 2021

Posts: 1093
- Share
- Tweet
#17

13 June 2022, 01:04 AM

Originally posted by waxhead View Post

"Unix systems are so reliable they never have to be rebooted"

I heard a different one that goes about "Unix systems are really quick to boot, which is great because you'll be doing that often".
Comment
ElectricPrism

Senior Member

Join Date: Apr 2013

Posts: 1220
- Share
- Tweet
#18

13 June 2022, 01:23 AM

I'm a firm believer that the backup server should be Spinning Disks. At least if the PCB gets fried you can swap that out -- with SSD you are just toast.

Go ahead and use your SSDs or NVMEs for daily live usage, but don't use them for backup or as if they were as reliable as data written to a disk platter.

I really think people underappreciated Optical aswell.

Sure the probability of a nuclear EMP device going off is low -- but you really need your critical data backed up to storage that is archival. One kind of EMP travels through the power grid and theoretically could toast anything wired in.
Likes 1
Comment
partcyborg

Senior Member

Join Date: Apr 2019

Posts: 282
- Share
- Tweet
#19

13 June 2022, 02:29 AM

Originally posted by patrakov View Post

According to the chat log, they have a Ceph problem. Seems to be the same as https://lists.ceph.io/hyperkitty/lis...5VTO5NPP6VFVX/ or its numerous duplicates from the last few months. Conclusion: never use Ceph, otherwise you'll eventually have to pay 1000000 USD to a data recovery specialist.

In this bug the user overrode sane defaults and wound up setting an odd memory limit of 1MB.... How is this ceph's fault? Also, if you actually read through to the bottom the user got their cluster working again. Where is this expensive data recovery necessary?
Likes 2
Comment
ehofman

Junior Member

Join Date: Dec 2018

Posts: 15
- Share
- Tweet
#20

13 June 2022, 02:53 AM

Originally posted by aufkrawall View Post

I don't think the discussion should be SSDs vs. HDDs, but how it could happen in the first place, i.e. what strategy wasn't pursued that would have prevented this failure.

In my experience it is best not to use brand new drives from the same batch (or even from the same manufacturer) in a RAID system. They tend to fail at about the same time.
Comment

Previous 1 2 3 4 template Next

Announcement

FreeDesktop.org GitLab Down Due To Drive Failures

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment