Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Weasel View Post
    I've been using ECC for decades so I don't have to deal with silent data corruption.
    And for developers, the last thing we want to waste our time on is debugging what turns out to be memory errors.

    Originally posted by Weasel View Post
    Even VRAM now has ECC these days.
    Not in general, no. Maybe something specific around GDDR6X, akin to what's going into DDR5.

    Comment


    • Originally posted by zxy_thf View Post
      It would be great to have some memtesting kernel module mainlined, but running in background all the time is probably not that great. It will drain the battery on a laptop.
      Well, it'd be neat to run it only when on AC power. That would still provide basically all of the benefit, since you just need to be able to get through all your RAM once in a while. The main point of running it is to find and block unreliable pages - a list that hopefully won't grow very fast.

      Comment


      • Originally posted by coder View Post
        Not in general, no. Maybe something specific around GDDR6X, akin to what's going into DDR5.
        Workstation/enterprise grapchics normally comes with ECC afaik. You wouldn't want some bridge or building collapse down the road..

        Comment


        • Originally posted by schmidtbag View Post
          ECC is supposed to protect you from things like cosmic rays and neutrinos, which are basically unpredictable accidents.
          According to whom?

          It does what it does, and it's as useful to protect against a failing DRAM cell or dirty DIMM contact as any other source of errors. In fact, the reporting/logging is principally useful for alerting you specifically when you have an actual hardware failure/degradation, so that you can take do preventative maintenance (whether it be to simply block that memory page, or to actually replace the whole DIMM).

          Comment


          • Originally posted by coder View Post
            You're citing info from Google that's more than a decade old and without regard to the fact that DRAM and DIMM quality varies widely (and you can bet Google never bought the cheapest stuff). Either you need to cite a recent survey of consumer DRAM reliability or stop pretending that you have relevant data.
            How about you cite the data? You're the one who is acting like bit flipping is some rampant issue, but the fact of the matter is, it's not. The vast majority of consumer devices don't use ECC and you don't see people plagued with memory corruption issues.
            As has already been noted (but is worth repeating): HDDs and SSDs have long had error-correction schemes far more sophisticated than ECC DRAM.
            Doesn't change the fact that data corruption errors on disks happen more often than ECC errors [for home users].

            Comment


            • Originally posted by sandy8925 View Post

              Well my RAM "overclocks" to 2666 MHz at 1.2 V . And they've been working perfectly fine for 5 years. I've seen how bad defective RAM is, and had it replaced with working RAM as well.

              If the newer RAM modules were *that* defective, people would be ranting and up in arms about bad RAM. And tech reviewers would also be ranting about it.
              Actually this is not the end of my story.
              I purchased two more modules when the old ones went out for RMA, but those new ones can only work under 2933 instead of the labelled 3200.
              Not sure which I should blame because the motherboard manual told me it supports up to 2933, but I've seen others posting results running up to 4000.
              Anyway I'm not an overclocker and happy with the current result, but meanwhile lost all confidence about the quality of modern DRAM modules.

              I'm guessing the situation is: all major RAM manufacturers are having life-time warranty policies and at the end of the week/month costumers will have working modules in there system.

              In addition, OCers knew they were buying silicon lottery and won't complain about their bad luck unless something didn't work at all -- I've seen posts about the poor reliability of XMP on reddit. iirc the OP stated there is only ~90% chance the XMP works without any trouble.
              Last edited by zxy_thf; 04 January 2021, 05:37 PM.

              Comment


              • Originally posted by coder View Post
                According to whom?
                ...Do you not understand what causes bit flipping? Do you know anything about the basics of computer science? Or digital electronics? It is very well-established by many studies. Just look up "ecc cosmic rays" and within the first page you should be able to find scientific journals discussing the topic. Look up ECC on the computerphile youtube channel and it goes into explanation of this sort of this. I'm baffled why you even need to ask this. What next, are you going to ask who says evolution is a real thing?
                Note I said "things like" meaning, there are other possible causes.
                It does what it does, and it's as useful to protect against a failing DRAM cell or dirty DIMM contact as any other source of errors. In fact, the reporting/logging is principally useful for alerting you specifically when you have an actual hardware failure/degradation, so that you can take do preventative maintenance (whether it be to simply block that memory page, or to actually replace the whole DIMM).
                That is one thing it can do, but a simple memory test can prove whether you have a failing cell or contact. ECC is primarily meant to protect from bit flipping.

                Comment


                • Originally posted by schmidtbag View Post
                  Indeed it could.
                  I've even seen the impact of a low-quality PSU on memory integrity quantified, but I can no longer find the article.

                  Originally posted by schmidtbag View Post
                  Overclocking could trigger it.
                  Overclocking is at the very top of the list of things not to do, if you care at all about data integrity.

                  Originally posted by schmidtbag View Post
                  A lightning strike could trigger it. The list goes on and on.
                  Lumping together frequent and infrequent causes is counter-productive.

                  Comment


                  • Originally posted by JustRob View Post
                    Intel motherboards for some Intel processors are way ahead of AMD. AMD motherboard manufacturers seem to only offer "ECC On/Off" and "Patrol Scrub",
                    Which AMD server vendors did you check?

                    Originally posted by JustRob View Post
                    while some of the better Intel server motherboards (for the upper tier CPUs) offer an expanded range of features:

                    - Adaptive double device data correction (ADDDC)
                    - Single Device Data Correction (SDDC)
                    - Memory Address Range Mirroring (MARM)
                    - Memory error storm response and Auto self-healing (Analysis Engine)
                    - Post Package Repair to spare and replace defective portions of DRAMs.

                    - Not "ECC", but a useful addition: Enhanced Machine Check Architecture Gen 2 (eMCA2)

                    Each section of an Intel BIOS seems to offer more options than AMD BIOS's rather sparse selection of knobs to twiddle.
                    Pay for what you're actually going to use - the rest is irrelevant, at best.

                    FWIW, I think at least some of that can be implemented at the OS level, with little/no penalty. Things like the storm response and auto self-healing.

                    Comment


                    • In my own experience bad memory modules definitely happen too often. Sometime it is self-imposed by OC but just enabling XMP can automatically enable OC of other parts as well - this does not lead to stability. The main point of using ECC is when you run a fileserver - some filesystems will be dead very soon in case of instabilities - btrfs was not recoverable at all at the point I had ram issues. In my job I saw lots of 32 GB ECC rams failing over time - the more modules you have got, the more likely is a failure (the systems had 12 modules each). It is good when the defective ram can be found and replaced early.

                      I personally would always use ECC if my boards would support it - they doesn't however as I do not own AMD systems and the Intel boards with ECC support are too expensive (and only work with i3 or Xeon). I doubt however that Linus' rant will change anything. In the server market ECC is default anyway and many consumers buy cheap systems - there ECC would be too expensive - even if it would be just a few %. If it works for 2 years with normal rams then the revenue is higher - and a crash here and there looks "normal" - even if you have got bad ram. In the case you build your own systems you can (and should) select better combinations of course.

                      Comment

                      Working...
                      X