Announcement
Collapse
No announcement yet.
Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC
Collapse
X
-
Originally posted by zxy_thf View PostIt would be great to have some memtesting kernel module mainlined, but running in background all the time is probably not that great. It will drain the battery on a laptop.
Comment
-
-
Originally posted by schmidtbag View PostECC is supposed to protect you from things like cosmic rays and neutrinos, which are basically unpredictable accidents.
It does what it does, and it's as useful to protect against a failing DRAM cell or dirty DIMM contact as any other source of errors. In fact, the reporting/logging is principally useful for alerting you specifically when you have an actual hardware failure/degradation, so that you can take do preventative maintenance (whether it be to simply block that memory page, or to actually replace the whole DIMM).
Comment
-
Originally posted by coder View PostYou're citing info from Google that's more than a decade old and without regard to the fact that DRAM and DIMM quality varies widely (and you can bet Google never bought the cheapest stuff). Either you need to cite a recent survey of consumer DRAM reliability or stop pretending that you have relevant data.
As has already been noted (but is worth repeating): HDDs and SSDs have long had error-correction schemes far more sophisticated than ECC DRAM.
Comment
-
Originally posted by sandy8925 View Post
Well my RAM "overclocks" to 2666 MHz at 1.2 V . And they've been working perfectly fine for 5 years. I've seen how bad defective RAM is, and had it replaced with working RAM as well.
If the newer RAM modules were *that* defective, people would be ranting and up in arms about bad RAM. And tech reviewers would also be ranting about it.
I purchased two more modules when the old ones went out for RMA, but those new ones can only work under 2933 instead of the labelled 3200.
Not sure which I should blame because the motherboard manual told me it supports up to 2933, but I've seen others posting results running up to 4000.
Anyway I'm not an overclocker and happy with the current result, but meanwhile lost all confidence about the quality of modern DRAM modules.
I'm guessing the situation is: all major RAM manufacturers are having life-time warranty policies and at the end of the week/month costumers will have working modules in there system.
In addition, OCers knew they were buying silicon lottery and won't complain about their bad luck unless something didn't work at all -- I've seen posts about the poor reliability of XMP on reddit. iirc the OP stated there is only ~90% chance the XMP works without any trouble.Last edited by zxy_thf; 04 January 2021, 05:37 PM.
Comment
-
Originally posted by coder View PostAccording to whom?
Note I said "things like" meaning, there are other possible causes.
It does what it does, and it's as useful to protect against a failing DRAM cell or dirty DIMM contact as any other source of errors. In fact, the reporting/logging is principally useful for alerting you specifically when you have an actual hardware failure/degradation, so that you can take do preventative maintenance (whether it be to simply block that memory page, or to actually replace the whole DIMM).
Comment
-
Originally posted by schmidtbag View PostIndeed it could.
Originally posted by schmidtbag View PostOverclocking could trigger it.
Originally posted by schmidtbag View PostA lightning strike could trigger it. The list goes on and on.
Comment
-
Originally posted by JustRob View PostIntel motherboards for some Intel processors are way ahead of AMD. AMD motherboard manufacturers seem to only offer "ECC On/Off" and "Patrol Scrub",
Originally posted by JustRob View Postwhile some of the better Intel server motherboards (for the upper tier CPUs) offer an expanded range of features:
- Adaptive double device data correction (ADDDC)
- Single Device Data Correction (SDDC)
- Memory Address Range Mirroring (MARM)
- Memory error storm response and Auto self-healing (Analysis Engine)
- Post Package Repair to spare and replace defective portions of DRAMs.
- Not "ECC", but a useful addition: Enhanced Machine Check Architecture Gen 2 (eMCA2)
Each section of an Intel BIOS seems to offer more options than AMD BIOS's rather sparse selection of knobs to twiddle.
FWIW, I think at least some of that can be implemented at the OS level, with little/no penalty. Things like the storm response and auto self-healing.
Comment
-
In my own experience bad memory modules definitely happen too often. Sometime it is self-imposed by OC but just enabling XMP can automatically enable OC of other parts as well - this does not lead to stability. The main point of using ECC is when you run a fileserver - some filesystems will be dead very soon in case of instabilities - btrfs was not recoverable at all at the point I had ram issues. In my job I saw lots of 32 GB ECC rams failing over time - the more modules you have got, the more likely is a failure (the systems had 12 modules each). It is good when the defective ram can be found and replaced early.
I personally would always use ECC if my boards would support it - they doesn't however as I do not own AMD systems and the Intel boards with ECC support are too expensive (and only work with i3 or Xeon). I doubt however that Linus' rant will change anything. In the server market ECC is default anyway and many consumers buy cheap systems - there ECC would be too expensive - even if it would be just a few %. If it works for 2 years with normal rams then the revenue is higher - and a crash here and there looks "normal" - even if you have got bad ram. In the case you build your own systems you can (and should) select better combinations of course.
- Likes 1
Comment
Comment