Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post
    My understanding of patrol scrub is that it slowly walks through RAM and verifies the checksums of the existing contents. While this is not as efficient at finding memory faults as writing carefully-patterned data, it's a lot simpler and you just hope that repeating the process on a continual basis will turn up any faults, eventually.

    That's how RAID scrubbing (sometimes called "consistency-checking") works, anyhow.
    Ok, but this is not quite how I have understood patrol scrub. Unlike what you correctly describes as a standard RAID consistency check, patrol scrub reads a block, verifies checksums, write the same block and verifies that there is no error. e.g. a READ/WRITE cycle. This is supposed to exercise all memory and catch single bit errors early so that one can offline a memory module before it goes completely bonkers.

    http://www.dirtcellar.net

    Comment


    • Originally posted by sandy8925 View Post

      True, I do agree that ECC everywhere would be good to have. Only reason we don't use it is because of the higher cost.
      Actually in terms of components, ECC is only marginally more expensive than non ECC memory. You basically have an extra memory cell that stores parity data , which is in the ballpark of 10-15% of the cost.

      The main issue (which is what Linus is actually complaining about) is that Intel artificially segments the market so that the motherboards and CPU's that properly support ECC memory are much more expensive (i.e. typically server CPU's) than consumer ones.

      So in an ideal world where all motherboards/CPU's would already come with ECC support then you probably wouldn't be able to even get non ECC memory in midrange and higher systems (only your wallmart $200 laptop would probably ship memory without ECC).

      Comment


      • Originally posted by coder View Post
        Save the post-modern relativistic BS for literature class essays, please.
        Keep your unproductive comments to yourself, then. You instigated this.
        Evidence of what? I'm not the one making the claim, here. You did, and you cited a source which doesn't support it. That's different than saying your claim is incorrect.
        Evidence that I'm wrong. I made a claim and cited a source. You don't like the source. I could just keep posting sources but you'll just reject them anyway, so, it's your move. I don't give a shit if you like my opinion, I owe you nothing. You're the one who intervened, so it's your problem now.
        I find it interesting that you didn't even link it. That's hardly the mark of an excellent source.
        I would think the name Puget alone would be enough, but fine, here's the source:
        At Puget Systems, one of the most important things we track in our workstations is the failure rates of individual components. Overall, 2018 was a very good year for hardware reliability with about half as many parts failing this year versus 2015, 2016, or 2017. But what models were the best of the best?

        According to Anandtech's latest DDR5 coverage, it is.
        If you have a better source to the contrary, please share it for our collective education.
        Oh, so now you agree that if someone doesn't like a source that they have to provide their own? Hypocrite.
        In either case, how about you read the whole article?
        To quote it directly:
        As we know from the official DDR5 specifications, each module will include on-die ECC for cell-to-cell data coherence (module-wide ECC is still optional).
        The default configuration appears to include ECC. That doesn't mean all will have it.
        It's evidence that single-bit error frequencies are indeed becoming too high, with newer cell sizes. Otherwise, why would they burn the overhead on it? In this case, it's yet worse than DDR4, with an overhead of 25%, instead of a mere 12.5%!
        Cite your sources about the evidence.
        That doesn't even make sense. The consumer market is large enough to support a different set of DRAM chips, to the extent it makes sense to do so. For each client CPU sold, there will be about a couple dozen DRAM chips accompanying it. Compared to CPUs, DRAM chips are tiny and simple. If Intel can justify at least 4 different CPU dies in each generation (or about 7, if you include laptops), then surely DRAM makers can afford to design a separate chip for servers vs. clients!
        Remember, we're talking DDR5 now, where ECC will be the norm. So yes, it does make sense. You want to reduce the cost of mass-production. Integrating it is most likely cheaper than having separate chips. For those who don't care about ECC (like budget phones), they'll likely cut costs and go with chips that don't have ECC.
        Given that Intel's standard desktop CPUs do not support them (with a few exceptions), how are they not niche? Probably no more than 10-15% of the desktop board models out there support ECC. And those are mostly premium models that cost much more than average. I don't know what you consider niche, but I think that fits most people's definition.
        Let's clear things up a bit: you said niche to describe how there's a limited availability of desktop platforms with ECC support. If you account for existing desktop PCs sold, yes, that would be niche. However, the vast majority of people with such platforms don't care about ECC. Of the ones who do, it isn't at all difficult or expensive to get a computer with ECC support.
        The point isn't that it's cost-prohibitive for most, but that it's a nontrivial difference, for many.
        No, it isn't. Back in 2014 when Intel was basically a monopoly, I built a PC for someone with a 4c/8t Xeon and ECC. The Xeon was roughly the same price as the equivalent i7 (but lacked an iGPU, which he didn't need) and the motherboard was maybe $15 more expensive. That was a worst-case scenario and I yielded good results.
        With today's competition, it's really not a whole lot different. I already linked to a decent Xeon with ECC support for a reasonable price. If you find the motherboards are too expensive (which they're probably not but I'm too lazy to check), go with AMD.
        Getting ECC affordably is a non-issue. If it's really that important to you and you're on that tight of a budget, don't go with Intel.
        This is a community forum, where all users can read, post, and reply to all messages. If you think my behavior is out of line, you're free to take it up with the mods, but I'll point out that I'm not the one hurling insults.
        You responding to me isn't out of line. Your demands over a conversation you weren't apart of is.
        And yes, you are hurling insults. You call my points irrelevant, you question my credibility, and you [falsely] claim I'm strawmanning.
        As for not being the one who brought it up, it's your exact words that I quoted. Don't say things you can't back up, or at least be a decent person and admit when you've done so.
        No, actually, I didn't bring it up. Read back through the whole thread: I did not bring up hard drives.
        "be a decent person" - there goes another insult, hypocrite.
        If you consider expecting you to stand by your words obnoxious, then I guess so.
        I have been standing by my words, you're just picking the ones you feel like arguing with and ignoring the rest.
        All I expect is for people not to play fast and loose with the facts, to be accountable for their statements, and to maintain a basic level of decorum. You don't have to concede anything, but it's hard to have a productive discussion when one party is twisting and shifting their position and refusing to be pinned down. We can certainly agree that a point is irreconcilable and move on, but that at least takes some agreement on what point is in dispute.
        Then practice what you preach and provide counter-sources.
        I agree with your 2nd sentence. So how about you stop jumping to conclusions about things I never said when you insert yourself in a conversation you weren't apart of?
        I already established what my 2 points are. Anything else we're arguing about (which is most of it) is stuff I didn't bring up.
        For opinions to carry weight, the details matter. That's why I'm focusing on deails. I trust you know the difference between an informed and uninformed opinion? There can also be misinformed opinions and underinformed opinions. What I'm trying to do is help nail down the details, so that more people can hopefully hold more informed opinions (myself included).
        I agree. But you're not doing your job when you disagree/disapprove and don't provide counter-sources. I support my way of thinking the way I feel makes sense. It's not my problem if you don't share that view. Remember: you responded to me.
        Maybe if your interest extended beyond winning what you perceive as arguments, you'd see a different theme in my contributions to this thread. Maybe not. But, if you're only reading my replies to you, and if you view this as a zero-sum interaction, then it can't help but shade your perspective.
        How exactly are you doing anything different than me? You instigated an argument with me.
        I haven't been reading your contributions to this thread other than the ones you've sent to me, because I don't care, and I'm not interested in inserting myself in the middle of someone else's debate. Contrary to what you might believe, I don't care about this topic that much in general. I only came here to say that Linus is overblowing the severity of this problem. I'm not even saying he's wrong.
        I had not proposed to use it in the way you suggested. Nobody did. It looked to me like you setup that strawman and burned him down.
        For someone who acts holier-than-thou, you should understand the difference between a miscommunication/misinterpretation and pulling a logical fallacy. From what I can tell, we agree more than it seems, but you're pushing things to unproductive extremes.
        Like we keep telling you, the reactive approach puts your data at risk. I get that you're fine with the level of risk, but it's nonzero for sure.It's up to the individual to value their own time and data. I know that if my ECC RAM saves me from data loss, I'd indeed consider it a good return on investment.
        As I said, it's like an insurance policy. Insurance doesn't always get claimed. But, when it does, it's usually much appreciated.
        I agree with all of the above. Never suggested otherwise.
        I apologize that it wasn't clear. Not everyone needs Kevlar attire, but we're all at some risk of being shot (in this analogy). I think we agree that the level of risk and exposure varies.
        For once, we're circling back to my original point. Yes, the level of risk does vary. If you are walking around the ghetto in Baltimore, a bullet proof vest isn't a bad idea. If you're taking a hike in the woods, you don't need one. That doesn't mean a hunter won't mistake you for a bear, but there's a good chance you'll come home without holes in your body. So I agree, no matter where you are, you have some risk of being shot, but the hiker doesn't need to worry about it. That being said, a home PC without ECC isn't a major issue. Can a major failure happen? Absolutely, but is it something we need to make a fuss over? I would say no.

        Comment


        • Originally posted by coder View Post
          This is perhaps too trivial an example, but if we take the case of a document, what if the error occurred on a different page than where the user is editing?
          As I said, if the document was encrypted then it would cause a cascading effect. It would likely affect most, if not all pages. But really, the more likely scenario is the whole program would crash. Or to be even more realistic: any program that is anal enough about security to encrypt the RAM for a text file likely has software-level error detection. If a bit flipped, it would be caught by the application.
          If the document wasn't encrypted, then you would probably only see 1 character change, which statistically is not going to cause a major issue. Statistically speaking, you would write more typos than your RAM would cause.
          Also, a bit-error can have a disproportionate impact, such as shifting a memory address or array index by a large amount. Usually, such changes would result in a program crash, but they could instead have the effect that part of the document goes missing or is replaced by a copy of some other part.
          I agree, but that's still something you would most likely notice.
          Finally, the very process of saving a document usually involves a number of copies and transformations, during any of which a memory error could corrupt what's eventually persisted to nonvolatile storage.
          That is true, but if you're that concerned over such a thing, you can always try to re-open the document to see if it loads properly. Though if you're really that paranoid, just get a system with ECC.
          You seem to be contradicting your earlier position that "simple" computer users don't need ECC memory because errors are likely to occur in unimportant data, and therefore will go unnoticed.
          I'm not understanding how that contradicts anything... Memory errors aren't super common. When they do happen to the average PC, it's not affecting important data. So if the probability of you being affected is so rare, you don't need ECC.
          Usually, bad RAM gets noticed only when it's so bad that it leads to program or OS instability. However, even before that point, it quite plausibly could've corrupted some of a user's data. I see it as a continuum, rather than the sort of sudden cliff that you suggest.
          In my experience, when RAM gets faulty, it's pretty abrupt. That's not to say you're wrong, but once RAM has a physical defect, it doesn't take long to notice there's a problem.
          His basic scenario is as legitimate for them as anyone else. It's just a matter of how much content they're editing, how long it sits in RAM, and how susceptible it is to memory errors. Uncompressed image data is probably the most resilient to errors, while anything that's highly-structured is probably the least.
          It is a legitimate scenario, I didn't say it wasn't. But if you're doing something important enough to warrant encrypted memory, you should be investing in ECC, which you're not going to get from some crappy Dell you pulled from Walmart. The average home PC isn't doing anything that important, and the average person doesn't take such things too seriously.

          Comment


          • Originally posted by mdedetrich View Post

            Actually in terms of components, ECC is only marginally more expensive than non ECC memory. You basically have an extra memory cell that stores parity data , which is in the ballpark of 10-15% of the cost.

            The main issue (which is what Linus is actually complaining about) is that Intel artificially segments the market so that the motherboards and CPU's that properly support ECC memory are much more expensive (i.e. typically server CPU's) than consumer ones.

            So in an ideal world where all motherboards/CPU's would already come with ECC support then you probably wouldn't be able to even get non ECC memory in midrange and higher systems (only your wallmart $200 laptop would probably ship memory without ECC).
            AMD also artificially segments the market. As Ian Cuttress has stated, only the PRO business versions of AMD CPUs officially support it. On the normal versions, it may or may not work - it depends on your motherboard. Given AMDs usual state of quality even for supported features, I wouldn't trust that it works even decently.

            Comment


            • Originally posted by waxhead View Post
              Ok, but this is not quite how I have understood patrol scrub. Unlike what you correctly describes as a standard RAID consistency check, patrol scrub reads a block, verifies checksums, write the same block and verifies that there is no error. e.g. a READ/WRITE cycle. This is supposed to exercise all memory and catch single bit errors early so that one can offline a memory module before it goes completely bonkers.
              My understanding of patrol scrubbing is that it cycles through every memory location, reads and checks against ECC bits, and writes back in the event of a correctable error. The primary purpose is to catch and fix single bit errors before they turn into uncorrectable double-bit errors, but it serves other useful purposes as well like giving a chance to catch problems in memory that isn't being used much at the current time.

              One thing worth mentioning is that at least for AMD CPUs we have a lot of other ECC-ish reliability features running at all times on caches, data paths and other blocks. System memory and on-chip RAS features all report up through the Machine Check Architecture (MCA) subsystem.

              Someone mentioned that Puget Systems was reporting that RAM was becoming more reliable - this surprised me a bit. It's possible that RAM is becoming more reliable on a per-bit or per-byte basis, but since we are also using ever-increasing amounts of memory my impression was that aggregate reliability was going down rather than up, and that ECC was becoming more important rather than less.

              One interesting exercise is to put a bunch of CPUs and GPUs in a box then figure out how long they will run before the first memory error. I was horrified when i did the math for one of our early supercomputer prototypes, and had to check with our RAS architect to make sure I was doing the math correctly. It was something like 1/2 day.

              Last point is that systems used primarily for graphics tend to be more tolerant of memory errors than systems used primarily for compute or other processing, simply because the human eye and visual system are so good at dealing with minor errors. Doesn't help with crashing but even there the nature of graphics code is that a lot of the critical code lives in CPU cache (which is ECC'ed on our products AFAIK) and so doesn't get read that often.
              Test signature

              Comment


              • Originally posted by coder View Post
                Though I doubt you're being serious, I disagree. Videogame consoles and video streaming devices seem to do alright without it, and it's difficult for me to see what positive impact it can have for them that would offset the downsides of its added cost.

                The law is a very blunt instrument and undermines (potentially) intelligent decisions and tradeoffs made by designers and engineers. If anything, laws should focus on disclosure of a device's data integrity properties. It's not so crazy, if you consider we have energy-efficiency labeling requirements for automobiles and appliances, and we have nutrition labels on food. But, since politicians skew older, less technical, and tend to have non-engineering backgrounds, they'd probably manage to bollocks it up.
                i am 100% sure video game consoles and video streaming devices profit from ECC to.
                why make another stupod label no one follows if you can just ban non-ecc ram.

                most people ignore the labels anyway.
                Phantom circuit Sequence Reducer Dyslexia

                Comment


                • Originally posted by bridgman View Post
                  ECC was becoming more important rather than less.
                  do you remember the yellow bug i reported to you years ago?
                  many people have it now: https://www.phoronix.com/forums/foru...e5#post1229970
                  i already replaced 3 TV/monitors with HDMI2.0 devices to avoid this bug.
                  looks like the amd opensource driver is broken with HDMI1.0-1.4
                  why not develop a driver GUI ot options in the gnome control center for the users to set a fix for that problem.

                  i am very error tolerant and even replace a monitor if it has bugs and i do not blame it on the AMD gpu if the bug is in the monitor. but other peopel are not as error tolerant than i am.
                  means for other customers a gui would be nice to set such options.

                  also i have a question could amd produce gpu cards with ECC and 1-2 instances(instead of the full 16/32) of VM ? i found out that in a social engineering attack with games who have a trojan horse included shows that we should better run untrusted games in a VM.

                  now that the 6800/6900 is out can we now have radeon7/Vega 20/AMD Radeon Pro VII with full 4096 shaders ? because the apple privilege to sell the 4096 shader version should be over in time of the 6800/6900...

                  and another question: can we have a gpu card who does avoid patent payment for HDMI by only displayport ? and also avoid GDDR6 patent payment by DDR5 with infinity cache?
                  would be good to have low cost alternatives without any patent payment.
                  Phantom circuit Sequence Reducer Dyslexia

                  Comment


                  • Originally posted by Qaridarium View Post
                    do you remember the yellow bug i reported to you years ago?
                    many people have it now: https://www.phoronix.com/forums/foru...e5#post1229970
                    i already replaced 3 TV/monitors with HDMI2.0 devices to avoid this bug.
                    looks like the amd opensource driver is broken with HDMI1.0-1.4
                    why not develop a driver GUI ot options in the gnome control center for the users to set a fix for that problem.

                    i am very error tolerant and even replace a monitor if it has bugs and i do not blame it on the AMD gpu if the bug is in the monitor. but other peopel are not as error tolerant than i am. means for other customers a gui would be nice to set such options.
                    Only very vaguely - are you saying there is a setting today that can work around the issue, or just that it would be good if we had something like that ?

                    Originally posted by Qaridarium View Post
                    also i have a question could amd produce gpu cards with ECC and 1-2 instances(instead of the full 16/32) of VM ? i found out that in a social engineering attack with games who have a trojan horse included shows that we should better run untrusted games in a VM.
                    I did ask about this - turns out that quite a few of our datacenter customers only run a single instance, looking for isolation more than sharing AFAIK. That makes producing a consumer card with SR-IOV more problematic. In the short term our focus has been improving the pass-through experience although I do agree we're probably going to have to do something for consumer card sharing at some point.

                    SR-IOV adds a fair amount of hardware cost so I'm not sure it is the best approach for consumer solutions though.

                    Originally posted by Qaridarium View Post
                    now that the 6800/6900 is out can we now have radeon7/Vega 20/AMD Radeon Pro VII with full 4096 shaders ? because the apple privilege to sell the 4096 shader version should be over in time of the 6800/6900...

                    and another question: can we have a gpu card who does avoid patent payment for HDMI by only displayport ? and also avoid GDDR6 patent payment by DDR5 with infinity cache? would be good to have low cost alternatives without any patent payment.
                    I'm not 100% sure, but since both MI and Pro cards are only shipping with 3840 shaders enabled I suspect that is where the chips are yielding out. The process yields do improve over time so presumably a fully configured chip should become more do-able over time, but my impression was that there was not a big performance difference between 3840 and 4096 shaders on most workloads.

                    Dropping HDMI on some SKUs seems problematic - low end cards are the most likely to need HDMI, while the savings on anything but the least expensive cards seems too small to justify the costs of carrying another SKU. I don't *think* we would need to actually remove the logic from the chip, but if another chip was required that would be a non-starter for sure.

                    DDR5 + Infinity Cache is an interesting thought for low end products.
                    Test signature

                    Comment


                    • Originally posted by piotrj3 View Post
                      It is not. Some companies like Gigabyte claim they support "ECC memory" on motherboards but it doesnt' support validation or correction on consumer grade motherboard. Which enhances further fragmentation. As far as I know only Asus does support proper validation/correction on motherboards and only on some of them, everyone else is "no". Asrock i think only claims ECC correction on those "pro" processors.
                      so apparently you were able to get an answer by looking at mb spec

                      Comment

                      Working...
                      X