Originally posted by zxy_thf
View Post
Announcement
Collapse
No announcement yet.
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
Collapse
X
-
Originally posted by mathew7 View PostAll permanent storage devices have error correction built-in. When you recieve a bad sector information, their correction code failed.
<granpa voice>Back in the days, when HDD size was about 3 times of a CD</granpa voice>, I prepared a folder to write to a CD and copied it over network to a friend (over Netbios, not TCP-IP) and everything seemed fine. Even compared the copied data on the PC that had written it.
But when put the CD in my PC and compared with my original data, almost everything was corrupted (no zips could be extracted and all .exe crashed). I don't know which part failed: MEM-MEM of network driver or actual network transfer. But it failed silently. And this WAS a backup.
Also, getting back to ECC necessity, I've had a RAM+MB combination that would always freeze loading windows on the 1st start of the day (at least 8h powered off). It took me a month to remember to load memtest 1st, and when I did, I've had 2x 1bit errors at the 1st pass, followed by hours of stability. ECC would have allowed me to use that reliably, instead of getting angry every time while resetting my PC. Now think about someone that doesn't know anything about PCs who tries to apply to sellers support channel.
Comment
-
Originally posted by schmidtbag View PostThat is only when the bad sector is known. It's not uncommon to see drives that steadily lose data integrity over time. For HDDs, sometimes the data is proven clean when written but gets corrupted later.
Originally posted by schmidtbag View PostWell if you didn't do TCP-IP, that's most likely the problem.
Originally posted by schmidtbag View PostECC is supposed to protect you from things like cosmic rays and neutrinos, which are basically unpredictable accidents. When you have something as consistent as what you described, that's not a fluke of nature, that's faulty hardware. I'm not sure if ECC would actually fix that, though it might have made it less of a problem. Simply swapping out the module with a different one (regardless of ECC) might have been enough to stop the problem.
As for my RAM problem, ECC would have hidden it and continue to work. My case was an edge case which proved problematic. If it would have been a total failiure (like and address at every pass), then I would have been delayed by 2 days for shipping/exchanging with new RAM. But due to it's nature, it bugged me for a months (and I did run memtest with no errors...just forgot to do it at 1st start when win froze).
Also, BSOD with 2-bit-error detection is more info (like a specific address which I could see every time) than silent crash/freeze, which could be from any HW. Oh...and also a faulty (or on it's way out) PSU could create sporadic memory errors where Windows would just free when it could give you a BSOD triggered by specific source (ECC ....Error checking and correction).
- Likes 1
Comment
-
Originally posted by Ivan Dimitrov View PostAMD supports (validates) ECC RAM support ONLY on their PRO processors. ECC is enabled on non-PRO processors but validation and implementation is left to the motherboard vendor. So ECC will most likely work on non-PRO processor but it is not clear if it will work in ECC mode and how will it report the errors.
- Likes 1
Comment
-
Originally posted by carewolf View Post
Error _correction_ does correct errors, it is in the name.. I think you are confusing parity with ECC, ECC can correct one bit errors, and can detect 2 bit errors, but not correct them.
But even more. Let's say we have a faulty module:
ECC = crashes and maybe an error low.
NO-ECC = crashes anyways
So why we even need ECC memories for simple usage? 🤷♂️
Comment
-
Originally posted by waxhead View Post
One way to work around the lack of ECC would be a background task e.g. a kernel driver that does the same job as Memtest86+. E.g. runs on idle or reserves a very small amount of CPU time (2-3%) and slowly but surely walks the memory all the time - continuously. If it finds a bad spot, it could blacklist the memory area or memory module. Building a "bad block list" over time. That would be a poor mans ECC e.g. software memory scrubbing, but it would be interesting if something like this existed - as I am sure it would exposed how big the problem probably is , it might build a strong case for Torvalds's call for more ECC memory
- Likes 1
Comment
-
Originally posted by magallanes View PostHi everybody:
Also, memories are way less prone to fail than in the past. For example, if you are worked on a datacenter, then replacing a hard disk is part and included in the global costs. Instead, it is not common to replace the memories or the CPU, mainly because those components are well inside of the motherboard and protected inside several layers of capacitors.
Don't expect much and seldom disappointed.
- Likes 1
Comment
-
Originally posted by mathew7 View PostDoooh!...the point was that a backup could backup the corrupted data.
Well, yes, but look at size of modules throughout years....you have the same die shrinking as CPUs. But the smaller the cell, the more susceptible is to those radiation. Also, the voltage of the cell gets lowered.
When it comes to servers and workstations, it's the exact opposite: most of the data is personal/critical and therefore the probability of a flipped bit causing a problem is significant. That's why you'd be a fool for not using ECC in such situations.
As for my RAM problem, ECC would have hidden it and continue to work. My case was an edge case which proved problematic. If it would have been a total failiure (like and address at every pass), then I would have been delayed by 2 days for shipping/exchanging with new RAM. But due to it's nature, it bugged me for a months (and I did run memtest with no errors...just forgot to do it at 1st start when win froze).
Oh...and also a faulty (or on it's way out) PSU could create sporadic memory errors where Windows would just free when it could give you a BSOD triggered by specific source (ECC ....Error checking and correction).Last edited by schmidtbag; 04 January 2021, 12:37 PM.
Comment
-
I am searching / waiting for an excellent AMD Epyc motherboard with an expanded selection of BIOS RAS features, specifically ECC related; in that respect Intel motherboards for some Intel processors are way ahead of AMD. AMD motherboard manufacturers seem to only offer "ECC On/Off" and "Patrol Scrub", while some of the better Intel server motherboards (for the upper tier CPUs) offer an expanded range of features:
- Adaptive double device data correction (ADDDC)
- Single Device Data Correction (SDDC)
- Memory Address Range Mirroring (MARM)
- Memory error storm response and Auto self-healing (Analysis Engine)
- Post Package Repair to spare and replace defective portions of DRAMs.
- Not "ECC", but a useful addition: Enhanced Machine Check Architecture Gen 2 (eMCA2)
Each section of an Intel BIOS seems to offer more options than AMD BIOS's rather sparse selection of knobs to twiddle.Last edited by JustRob; 04 January 2021, 01:06 PM.
- Likes 1
Comment
Comment