Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by schmidtbag View Post
    That isn't the case for most desktop PC users.
    it is the case... the damange is real. but most people just do not know it.
    i know so many people who are simple Deskktop PC users who lost data...
    this is really a "disease" and if you calculate even the smallest damange is it much more than 12% higher costs of produce ECC ram.

    compare this 3200mhz DDR4 ram with ECC and without ECC:

    Produktvergleich für Kingston Server Premier DIMM 16GB, DDR4-3200, CL22-22-22, ECC (KSM32ES8/16ME), Crucial DIMM 16GB, DDR4-3200, CL22-22-22 (CT16G4DFRA32A)

    the price difference is 22,5€ per 16GB...
    of someone buy a pc with 32gb ram it is 45€ price difference means a 400€ pc goes to 450€ (lets assume a law makes it against the law to do anything anti-ECC like intels do)

    now... one person buys the ecc variant the other one the version without ecc...

    now the one without ECC just has a simple problem with his pc he does not know what happen to the pc the person calls tech support to come.... in germany even the approach of the tech support costs more than 45€... and any tech support costs you at minimum 60€ per hour...

    so it is insane to save 45€ to then spend at minimum 100€ to the tech support...
    Phantom circuit Sequence Reducer Dyslexia

    Comment


    • Originally posted by coder View Post
      If you make a sloppy post, don't blame me. If someone posts bad or questionable claims, they are the instigator.
      It wasn't sloppy, but you kept taking things out of context. That's not how a thought process works, and that's why you're arguing with me.
      Anyway, what I'm blaming you for is our discussion. You chose to reply to me.
      The fact that you're not interested in having good data suggests your interests lie elsewhere than learning and sharing of accurate information. I think that's a loss for us all.
      I didn't say that... The fact you keep making up this bullshit is a loss for us all. I challenged you multiple times to provide better data than me. If you actually cared about sharing accurate information, you'd comply.
      However, I expect you to care whether your facts are solid. I ask questions and request sources to help you revisit some of your assumptions that I think might be off. Maybe I'm wrong, but then I stand a chance of learning something from your answer.
      Don't insert your opinions in place of mine. I feel my facts are solid. Its you who disagrees, hence why you have to prove me wrong. I'm not here to convince you to share my opinion. That would be arrogant.
      That presumes you have the same goals, however. If not, then it all breaks down, like we've seen, and you end up putting your energy into lashing out rather than possibly learning something you didn't know and possibly educating others, in the process.
      I would lash out less if you'd stop twisting my words and assuming the worst. It's getting old.
      The specific claim is needed, in order to see what they measured and how, so we can judge its relevance to the points under discussion.
      THERE WE GO!!! Just like I said would happen! You were going to shut down whatever source I provide! So predictable.
      Next, it's not talking about the rate of random memory errors, but rather failed DIMMs. A failed DIMM is one with bad cells or other faults that lead to reproducible errors. So, you can have a situation in which the rate of random errors actually increases (e.g. due to shrinking cell sizes, higher frequencies, and decreasing voltages), even while the number of reproducible defects decreases.
      In this context, we were discussing whether RAM was more reliable than drives. This is what happens when you mash up different discussions and think they're all one coherent topic. If I happen to be wrong about that, well, what I just said remains valid. You can't have a productive conversation when at least one side isn't on the same page.
      You're misunderstanding or misreading my statement. I made a claim and provided a source. I didn't know why you thought differently, but figured you must've read it somewhere and politely asked you to share it with us, if so. You don't have to, and if you could simply make a compelling argument that my source did not, in fact, support my claim, I would consider that as well. I see no hypocrisy in that.
      No, I'm not. You're pulling the same thing I asked of you.
      The point I was making was about the on-die ECC. And the way I read the statement, it doesn't sound like that part is optional. Again, if you have a good reason to believe otherwise, please educate us.
      To my understanding, the only approach to ECC for DDR5 is on-die. So for it to be optional means you either have it or you don't.
      I can't even follow this statement. First, ECC DIMMs don't do the ECC on-module, so I presume by "extra chips" you mean the extra DRAM chip. Second, for DIMM-wide ECC, you'd still need extra DRAM chips, which blows a hole in the idea that this is just a cost-optimization for servers. Third, the overhead they're adding is 25%, instead of the current 12.5%, so it's significantly more expensive and therefore not something you'd do if it weren't necessary. Fourth, it sounds like you're agreeing that the industry could produce DDR5 DRAM chips without on-die ECC, if it made sense to do so, in which case why wouldn't PC memory OEMs just use those same chips on their non-ECC DDR5 DIMMs? ...unless there aren't going to be any DDR5 DRAM chips without on-die ECC, because they'd be too unreliable!
      Yes, I was referring to the extra DRAM chip.
      The cost savings is to consolidate everything into a single chip (and presumably have fewer traces).
      Why would there still need to be an extra DRAM chip if the ECC is on-die? The die means the chip, not the PCB. Therefore, there shouldn't be extra overhead.
      What about that was worst-case? Sounds like a best-case scenario, to me.
      It was a worst-case scenario because Intel could charge whatever they wanted since AMD was no longer creating competitive consumer-grade hardware with ECC support.
      I already broke it down for you. There are two affordability problems. One is the minimum spend needed to get ECC, which is a lot more than the cost of the cheapest non-ECC PC. With Intel, you need to step up to at least an i3, and with AMD, you can't use any of their APUs, meaning you have to buy a separate GPU card. That's all before we even get to motherboards or the RAM, itself.
      ...why would someone with a CPU crappier than an i3 need ECC? If you have such a low-end system and still require ECC, cost should not be a factor anymore. These low-end products aren't built for critical tasks, and that's why pre-built PCs sold with them don't come with ECC without the masses complaining.
      For the second affordability issue, let's say you were already going for more than a minimum-spec PC and wanted to get a HEDT Intel CPU. The price difference between some of those chips and their Xeon counterparts is more substantial.
      Fair enough, you got a good point there. Intel does deliberately price gouge you for that.
      The fact that you consider this an insult is part of the problem. First of all, it's about your point and not you. Secondly, it's merely a claim that I back with an argument, which itself is open to counterargument. If the relevance of points cannot be disputed for fear of hurting someone's feelings, then we can't have proper arguments on this site.
      No, the reason we can't have proper arguments on this site is because people like you aren't accustomed to what counts as normal human interaction and taking every little thing as literally as possible. I'm on several tech forums, and only on Phoronix do I encounter people like you on a regular basis. And I don't mean just for myself, I sometimes just read other threads I'm not involved in and it's just relentless nitpicking over the most mundane things. My first post was only a paragraph, and look what you turned it into.
      Your own words. That's all I'm talking about. Stand behind what you say. That's a pretty basic standard.
      I do stand by my own words, though, I'll admit I phrased myself very poorly there. I meant to say "disk errors occur more often than problems faced caused by RAM errors". I can see why my original statement didn't make sense; even now I'm like "wtf, that's not true at all" haha.
      Am I supposed to argue with points I agree with?
      No, but when you ignore some words or statements, you can remove context behind something you're arguing against. So if you look at what Weasel has been doing, he sees my statement of "ECC isn't a necessity for home users" as "ECC isn't a necessity".
      If you never said it, then it should be a pretty quick and easy misunderstanding to clear up, no?
      Very much so, but that's not possible when you take every minute detail so literally..
      Again, it's a group discussion, by definition. Every post has a Reply button, for all users!
      That's fine, when you're not taking a bunch of random snippets out of context and debating them. You've misinterpreted me way too many times, hence my frustration.
      Not every statement requires or even deserves a citation. Some arguments can stand on the basis of their own internal logic, and facts which are commonly held to be true. You don't have to agree with a claim, but this issue of yours seems to have taken on a life of its own. Rather than a real discussion of the point in question, it seems like you're just using it as a means to distract and redirect.
      I don't really give a shit. The reason I push you for citations is because you kept asking me for them over something I'm not even passionate about, so I was trying to show you how irritating it is. It was of your opinion that my argument didn't stand on the basis of its own logic.
      It says a lot about their underlying motivations.
      My underlying motivation is for you to stop nitpicking people's quotes out of context and taking this stuff WAY too seriously.
      And if you don't read other posts, aren't you missing a chance to learn more about the subject?
      I made my first post when this thread was only 4 pages. The next time I stopped by, it was 9 pages, and a lot of it was debates. I already had 2 people who blew my thoughts way out proportion by then and I'm not about to be treated like an idiot and let them walk away. Arguing in Phoronix is exhausting enough as-is and half the posts here are depressing. I'm not a masochist.
      So, you were just trying to take a drive-by shit on the topic? You should weigh the value of making a post about something so low-stakes for you. If you're not prepared to back up your claims, perhaps it's just not worth it simply to share an opinion.
      Do you know what an opinion is? Because when opinions can be backed up with enough facts, it's not an opinion anymore, and then there's no more discussion. You know what a forum is for, right?
      Regardless, I did back up my claims. You stated, in your opinion, that you don't support my sources. You still have yet to find better ones. Y'know, for the sake of sharing knowledge!
      So maybe its not worth arguing your way into an A-B conversation you're not apart of when you should already know that it's some opinion "unworthy" to be shared.
      Last edited by schmidtbag; 10 January 2021, 02:24 AM.

      Comment


      • Originally posted by Jeff View Post
        And how will you be sure, especially without ECC, that it is stable in you particular setup?
        i expect it to work at xmp settings without issues. anyway, thanks for reply, i'll keep it in mind during upgrade

        Comment


        • Originally posted by coder View Post
          From what I can find, memory scrubbing only involves writing when a correctable error was found:

          It's probably configurable, but what I've read says that corrected data is written back, if a single-bit error is detected.

          You might offline a DIMM when a double-bit error is detected, but the only way I see that being workable is to simply prevent new allocations from using it. And even that would require you're not interleaving memory channels (or interleaving at page-granularity).
          I do agree that it seems sensible to only write if a corruption is detected.

          Offlining of memory happens after a certain amount of single bit failures, or in case of a double bit failure immediately. If the memory module stores only anonymous pages or cache pages it can be offlined, For other pages the kernel may have to wait until they are freed because not all pages may be possible to migrate.
          Interleaved memory or not , it does not matter. If you migrate all pages of a certain (physical) address range, then that memory module should be "empty" regardless. The only difference is that you may have two or more memory modules "empty" at the same time.
          You can also manually offline a memory module by echo offline > /sys/devices/system/memory/memoryXXX/state according to kernel.org's documentation of memory hotplugging.

          http://www.dirtcellar.net

          Comment


          • Originally posted by Zan Lynx View Post

            According to Wikipedia (and its sources) SMT was invented by IBM and then first used commercially in the Alpha CPU by DEC. As I remember it, AMD bought most of DEC and their engineers, and DEC technology was a major part of the really great AMD64 design. Although it didn't use SMT.

            Edit: I guess I remembered that wrong. Compaq bought DEC and HP bought Compaq? And DEC technology went into Intel Itanium? But for some reason I was sure AMD64 used some Alpha tech. Hmm.

            So if anyone owes license fees for SMT it sounds like it would be Intel. Although the original IBM patents would have expired decades ago.
            I don't actually know but my guess is that AMD does pay license fees to Intel, but not for SMT itself, but for its public-facing interface ("API", register interface, whatever). SMT as a technique is probably something that anybody can implement without paying fees, but if you want to implement it such that you are compatible to Intel's Hyperthreading and thus "emulate" it, you have to pay up. A bit like the Google vs Oracle case in the software world: anybody can write a Java interpreter, but if you want to keep the API compatible, Oracle demands licensing fees.

            Comment


            • Originally posted by schmidtbag View Post
              If you're actively using the PC, you find out quickly because applications often spontaneously crash or your whole OS locks up.
              It has to get pretty bad, for that to happen.

              You're sort of contradicting yourself -- on the one hand, you're making a statistical argument that memory errors are rare and unlikely to cause problems, and here you're making the opposite claim that if there are memory errors, then you'll know because stuff will break.

              Again, there's a blind spot that you have, between the point where the errors are sparse enough to go unnoticed and when things get bad enough to cause loss of time or data. ECC reporting is really your only way to know how long that span of time is, and how quickly the RAM degrades. Further, it's the only early warning you get of problems before they cause loss of time or data.

              Comment


              • Originally posted by waxhead View Post
                Interleaved memory or not , it does not matter. If you migrate all pages of a certain (physical) address range, then that memory module should be "empty" regardless. The only difference is that you may have two or more memory modules "empty" at the same time.
                You can also manually offline a memory module by echo offline > /sys/devices/system/memory/memoryXXX/state according to kernel.org's documentation of memory hotplugging.
                Whether it's interleaved determines whether it can be offlined. And by interleaved, I mean that each 64-bits (assuming we're talking about DDR4 DIMMs) alternates which memory channel it goes to. In order to offline a DIMM (or pages of one), you need either a non-interleaved configuration or a memory setup that's at least partially-mirrored (e.g. a server CPU where half of the channels are mirrors of the others).

                In other words, you won't find memory hot-plugging supported on any fully-interleaved configuration.

                Comment


                • Originally posted by schmidtbag View Post
                  ...why would someone with a CPU crappier than an i3 need ECC? If you have such a low-end system and still require ECC, cost should not be a factor anymore. These low-end products aren't built for critical tasks, and that's why pre-built PCs sold with them don't come with ECC without the masses complaining.
                  There are embedded and industrial use cases where cost or power/heat prevent the use of a faster CPU -- and that amount of compute power often isn't even necessary.

                  AMD has an embedded series of Ryzen APUs that support ECC: https://www.amd.com/en/products/embe...n-v1000-series

                  Intel now uses its Atom branding for this purpose: https://ark.intel.com/content/www/us...-2-20-ghz.html

                  Not all Atom-branded Intel chips support ECC memory, however. I was disappointed to see no Elkhart Lake CPUs so far announced seem to support it. However, I've seen some talk of them now supporting "in-band ECC", which might be the reason.

                  Comment


                  • Originally posted by coder View Post
                    It has to get pretty bad, for that to happen.
                    Not at all. All it takes is a few physically damaged bits here and there and a little time to notice.
                    You're sort of contradicting yourself -- on the one hand, you're making a statistical argument that memory errors are rare and unlikely to cause problems, and here you're making the opposite claim that if there are memory errors, then you'll know because stuff will break.
                    How is that a contradiction? That can be paraphrased as "in the very unlikely event you will face a problem from memory errors, you will know it".
                    Again, there's a blind spot that you have, between the point where the errors are sparse enough to go unnoticed and when things get bad enough to cause loss of time or data. ECC reporting is really your only way to know how long that span of time is, and how quickly the RAM degrades. Further, it's the only early warning you get of problems before they cause loss of time or data.
                    You act like RAM is known to regularly and steadily degrade over a relatively short period of time. This isn't the 90s anymore - RAM may be more sensitive to bit flips than it used to be (due to things like cosmic rays or "dirty" power delivery) but it's also much more robust, in the sense that it doesn't get cumulatively worse. Nowadays, the only reason for RAM degradation is overclocking (and in turn, overvolting).
                    There are embedded and industrial use cases where cost or power/heat prevent the use of a faster CPU -- and that amount of compute power often isn't even necessary.
                    I don't get it... these platforms can come with ECC, so, what exactly is it you're complaining about here if ECC is available in these low-end parts?
                    Most of such systems are so low-end with so little processing that they're not handling any especially critical data. They use slow RAM, which tends to be more stable. Of course, that doesn't mean it's failproof.
                    Last edited by schmidtbag; 10 February 2021, 01:01 PM.

                    Comment


                    • Originally posted by schmidtbag View Post
                      You act like RAM is known to regularly and steadily degrade over a relatively short period of time. This isn't the 90s anymore - RAM may be more sensitive to bit flips than it used to be (due to things like cosmic rays or "dirty" power delivery) but it's also much more robust, in the sense that it doesn't get cumulatively worse. Nowadays, the only reason for RAM degradation is overclocking (and in turn, overvolting).
                      That is exactly what RAM does in its most common failure modes. The big server farms like Google have published data on this and they say that the most common indicator of future errors is past errors.

                      In other words, as soon as a memory stick starts reporting errors, it often starts to get worse.

                      Comment

                      Working...
                      X