Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Zan Lynx View Post

    This has to be old information. I am running with ECC RAM on this Ryzen 5950 and its latency stats are right in line with any other 2,666 MHz CL19 RAM. In other words, it isn't great, but that has nothing to do with the presence of ECC. It's because the RAM manufacturers don't bother to produce high speed ECC. If they did, it would be exactly the same.

    AMD (and Intel I assume) build the RAM controller with integrated things like ECC, AES encryption, PCIe and cross-CPU (for EPYC) routing. All of this runs inline at full RAM speed. At the transistor level this is not hard.
    *looks at his DDR4 3,600, CL16 kit*
    *decides to move on*

    Comment


    • Originally posted by magallanes View Post
      About Linus and ECC: bollocks.
      Check your facts, before you fact-check others.

      Originally posted by magallanes View Post
      What is the problem of ECC?
      * It costs more because instead of 8 modules, we should have a 9 module plus a chip.
      * It adds latency. Why? Because the tiny chip should validate every operation.
      This is wrong. Unbuffered ECC memory doesn't have an extra chip, because the CPU's memory controller does the error checking-and-correction. This is superior not only because the memory controller is faster, but because that enables it also to check & correct errors introduced between the DIMM and the CPU.

      The extra chip you sometimes see is for registered memory (which is often ECC memory, but needn't be). It's used to buffer the memory operations to reduce the electrical load on the CPU, which in turn enables the system to accommodate more DIMMs per channel and run at higher speeds.

      Originally posted by magallanes View Post
      For instance, ECC does not automatically correct memories.

      Usually, ECC does the next steps:
      * The system reads some space of memory including the parity check.
      * If the parity fails, then it tries it again, it is the "correction" and yes, it is flaky and usually it doesn't solve the problem at all and usually continues with the next step.
      * If the memory fails it again, then the system halts (or enters into interactive/maintenance mode). If you are using an expensive server, then it could show some lights and even says which module failed.
      This is also incorrect. If there's a single bit error, it's corrected. Double-bit errors are logged. I don't know whether the CPU tries to re-read on double-bit errors, but I doubt it.

      Anyway, the rest of your claims are dependent on the OS and how it's configured. I've personally seen recent SuSE linux systems log even double-bit errors and continue running. I suppose a system could drop into maintenance mode, which could make sense if you want to minimize the potential for eventual corruption of data that subsequently reaches disk, but I don't see the point of immediately halting.

      Originally posted by magallanes View Post
      Also, memories are way less prone to fail than in the past.
      Do you have a source on this? I have personally seen RAM go bad over time. Memory that once tested fine can eventually start encountering errors, with sustained use. IMO probably the second most common server component to fail after HDDs.

      Comment


      • Originally posted by mathew7 View Post
        This is a topic of ECC on memory, not disk corruption. ECC is the actual checksum for RAM that you keep promoting.
        Stop thinking of memory data as files. You file is an organized data, but in memory it will be spread all over. And guess what: your checksum routine will be executed from RAM over contents in RAM. That's how CPUs work.
        If you keep data in RAM for long enough time, it can get corrupted (bit flipped). So you do a checksum over corrupted data and you write that to disk, do a backup etc. All on corrupted data.
        I know what the topic is, but people keep bringing up corrupt data on a drive as though it is assuredly a cause of a flipped bit. The reality of the matter is: there's a very slim (but not impossible) chance that's the case.
        I'm well aware the checksum is executed from RAM. Whether a bit flipped when you made your copy or whether a bit flipped during the checksum (or both), you're still going to get mismatched results. So what's your point?
        File transfers don't keep the entire file in memory, so it won't remain in RAM long enough...
        The difference between servers/workstations and regular PCs: regular PCs are intended to be restarted daily. And if not, Windows symptoms would make you restart it at least monthly. Also, home/small business PCs are used with small workloads where a compounding error could be caught by humans and redone. Or will just have crashes, but they also would not be in high-EMF area.
        Yes, and those differences are pretty substantial, which is why ECC is considered an implied necessity for servers and workstations. Much of what you described is exactly why ECC isn't a necessity for home use.
        Actually I see you did not diagnose with memtest and did not read what I said.
        Memtest tells you what pattern it expects and what it found....and I've had exactly 2 addresses with 1 bit difference in each within 5 seconds of the test and no other error during the following hour (when I stopped it)...so ECC would have spared me of the headache of why I had to press reset after each power on of the day.
        You didn't describe the situation very clearly, but I understand better now (I think, unless you just further described the situation poorly). Regardless, you didn't think to just run the test again, rather than jump to the conclusion that your RAM was defective? I get it, ECC would have prevented the error, but what exactly is the point you're trying to make here? I'm well aware that a flipped bit is inevitable.
        The consumer cost of implementing ECC is just in the memory module, as I said +12.5% capacity, for which you pay 100% more because of low production of such modules (which could come down massively if it was widely implemented). CPU already has it (but disabled in i5/i7/i9, which is the main point of Linus), memory slot has pins for it and MB cost is neglijible for 16 more lines (dual channel) between CPU and RAM slots (+1$??). The memory controller in CPU is the one that does all this, even from reads/writes comming from PCI(e), and requires a single extra clock cycle to do any correction (for a complete burst access). Oh....and CPU caches already have ECC enabled.
        If ECC became the only type of memory you could get, it wouldn't be low-production anymore.
        ECC for a CPU cache is an absolute necessity, regardless of what workload you have.

        Comment


        • Originally posted by sandy8925 View Post
          Pretty sure most people don't know what CPU their devices are using either, atleast on mobile phones and tablets. If you ask Apple fans they'll say it's Apple's A12 CPU , and they'll believe that it's some entirely custom made, from scratch Apple CPU and not ARM.
          Uh, it's both an entirely made-from-scratch Apple CPU and it's also true that it implements the ARMv8-A ISA.

          Originally posted by sandy8925 View Post
          Only reason we know on desktop is because Intel and AMD hardware usually comes with their stickers plastered on.
          Windows shows it in the system info (or you can use 3rd party tools) and Linux has /proc/cpuinfo.

          You could make the same sort of claim about any product spec or feature, like the number of cylinders in a car engine or the type of display technology in a TV or monitor. Yes, it's not apparent without opening it up or turning it on what's in the guts, but that's not to say we can't know or that it's irrelevant.
          Last edited by coder; 04 January 2021, 03:50 PM.

          Comment


          • Originally posted by schmidtbag View Post
            If ECC became the only type of memory you could get, it wouldn't be low-production anymore.
            The extra DRAM chip on DIMMs and extra motherboard traces will always add cost, but we just don't want end-user prices to be disproportionate. If there's enough demand, then the prices should converge towards the underlying cost differences.

            Originally posted by schmidtbag View Post
            ECC for a CPU cache is an absolute necessity, regardless of what workload you have.
            I'm pretty sure that a lot of CPUs don't have ECC for at least one level of their cache hierarchy, though I haven't been paying enough attention to this that I could cite specifics. I've just seen it explicitly called out, when a CPU cache does have it.

            Two reasons why it might be no more essential than for DRAM are that caches tend to be SRAM and that data tends not to sit in cache for long. The main argument for having it, even when not using ECC DRAM, is probably that if a CPU cacheline goes bad, you have to replace the entire CPU. However, if the CPU is relatively inexpensive and that's a rare enough event, then there could still be an economic argument (i.e. from computer OEMs' perspective) that you don't need it.
            Last edited by coder; 04 January 2021, 03:49 PM.

            Comment


            • Originally posted by schmidtbag View Post
              That's why you checksum against the source data...
              You are beyond clueless. You literally have no idea how a computer works, so please stop wasting people's time.

              Source data gets cached into RAM (filesystem access cache), or is even created in RAM (e.g. you use a software to edit it, depending on what kind of data it is, then Save it). Bit flip happens in RAM, either on the data itself, or on the code that deals with saving it.

              Originally posted by schmidtbag View Post
              Right... I'm the moron... Defects in drives are not insignificant. They're more likely to affect the average person than a flipped bit.
              Couldn't care less about the average person who has no backups, in that case bit flips or not are irrelevant to begin with.

              Originally posted by schmidtbag View Post
              Wow you are dense. A bad sector is also a silent problem. You might not know something was corrupt until after you try to open it.
              It's not silent because you know when you access it. You'll never know a silent ECC error until you actually verify the data manually (not with checksum!). i.e. Load it and see if it works or is fine. Play a video. And so on. But by that time it could have propagated to all of your backups!

              Originally posted by schmidtbag View Post
              No, you're just blowing things way out of proportion, like you always do. I use multiple computers for most of my waking hours. 3 of them are on 24/7, and none of those have ECC. My life has been like this for years. Never once have I faced a noteworthy issue caused by bit flipping. Has it happened? Yes, obviously it must have, but I'm not some special snowflake who gets in a hissy fit because the 700th selfie resulted in a crash, or the hardcore dungeon porn artifacted for one frame.
              You sound like those karens who put their oh so valuable opinion about COVID-19. Their life has been like that for years, they never got the virus, it's insignificant or a conspiracy, refuse to wear masks and vaccinate and all. Of course, you stop hearing from them after they get the virus, but the other ones who still did not get it will continue on their bullshit, making it seem like nothing changed at all.

              Even if bit flips affected just 0.1% of the population that's already more than enough to consider it an absolute necessity. That's still millions of people. And nobody cares that it didn't happen to the rest 99.9%, because for the 0.1% it's far too late already.

              If we used your logic, we wouldn't have any protections against airplane crash disasters, because they're even rarer. Let's save money, right?

              Comment


              • Originally posted by sandy8925 View Post
                Lol, what? Almost all software has bugs.
                I'm trying to distinguish between bugs that could plausibly be caused by memory errors from other sorts of bugs, and I'm not talking about whatever XYZ phone app, but about things like core OS functionality and web browsers.

                When someone makes the claim that you don't need ECC because software sucks worse, there are two flaws with that argument:
                1. It presumes memory errors are uniformly distributed (they're not -- some DIMMs have way more errors than others).
                2. You have to restrict yourself to the types of software bugs that could produce the same sorts of behaviors as memory errors. And while it's difficult to bound the behaviors produced by memory errors, any bug that's reproducible is likely not due to a memory error.

                Comment


                • Originally posted by Citan View Post
                  EVERY file is important for EVERY user. I'll put aside the big, stupid condescendance that comes from your post (like "those people are not professional so their data has no interest hence no value in the first place"), which is impressively ridiculous.
                  People are students, independant workers, peoples detached and working at home, etc...
                  EVERYONE has "production" documents they care about because it expresses some projects or tasks.
                  But let's even consider an hypothetical world where people outside entreprises have only "non productive" data like music, personal photos or videos. The latter are memories, expression of their life. Why in the world would they not have a right to be guaranteed their integrity in time?
                  The reason I draw a distinction is:
                  1. The size of high-value data, relative to memory capacity.
                  2. The potential for errors to get persisted or for corruption to multiply.

                  If we look at point #1, a document someone is editing will occupy only a tiny amount of RAM. Your error rate would have to be extremely high for there to be a reasonable likelihood of it getting corrupted, and long before it gets that bad, the machine would be so unstable that it would be unusable and the user would replace it. While photos and videos are much larger, they're also fairly error-tolerant.

                  Compare that to a large CAD model that a workstation user might edit, and the math starts to look a lot different. Worse, you could have a scenario where an error creeps into the model at some point and gets persisted, contaminating all subsequent iterations of the model. If, at some point it's finally noticed, the user might have to revert many iterations to get rid of it, resulting in much lost work. And if it's not noticed until production (or later), the cost of the error could multiply further.

                  Regarding point #2: for servers, bad RAM can mean serving up the same (or different) incorrect data to multiple clients, on multiple occasions. Those clients might edit the data and resubmit it to be saved, resulting in the errors being persisted. Even worse would be if memory errors actually corrupt filesystem datastructures, which could result in much larger data corruption that you hopefully notice soon enough to revert to a backup, but maybe not before users have made additional changes that would also be lost by reverting.

                  Suffice to say that the economics of data integrity looks a lot different for "workstation" datasets and server usage patterns than typical end-consumer client PC usage. That's not to say I don't wish everyone had reliable computing devices with excellent data integrity, but it merely serves to illuminate the conventional distinction. We cannot deny that the consumer world has been managing alright without ECC, so it deserves an honest examination of why that might be so.

                  Originally posted by Citan View Post
                  If constructors don't put their best into reliability, the minimum should be to be upfront about it. IF they know the reliability of their hardware has been degrading to the point where failures may represent more than 0.5% of the full capacity (and I'm very generous here) and yet have been doing nothing about it, that's damn close to a scam.
                  If a government or industry body wants to establish such a ratings scheme, I'd be in favor of that. In general, you cannot blame the industry for doing no more than what customers or regulations demand. Well, you can, but you're just wasting your breath.

                  Comment


                  • Originally posted by Weasel View Post
                    You are beyond clueless. You literally have no idea how a computer works, so please stop wasting people's time.
                    Ironic that you accuse me of wasting time, considering you just said nothing of any use.
                    Source data gets cached into RAM (filesystem access cache), or is even created in RAM (e.g. you use a software to edit it, depending on what kind of data it is, then Save it). Bit flip happens in RAM, either on the data itself, or on the code that deals with saving it.
                    Yes, and so in the unlikely event a bit is flipped, whether that was during the backup/copy process or the checksum process, an error will be found. If an error was found, re-run the checksum to double-check if that process failed or if the transfer failed.
                    Not a hard concept to grasp, but considering you're clearly rabid with anger, you aren't thinking clearly.
                    Couldn't care less about the average person who has no backups, in that case bit flips or not are irrelevant to begin with.
                    The average person is the entire premise of my post. You're just so full of irrational rage to understand that.
                    It's not silent because you know when you access it. You'll never know a silent ECC error until you actually verify the data manually (not with checksum!). i.e. Load it and see if it works or is fine. Play a video. And so on. But by that time it could have propagated to all of your backups!
                    Do you realize how you're proving my point here?
                    If you do a checksum, you'll figure out immediately if a flipped bit occurred (or if your drive is faulty), in which case, run the backup again. Problem solved. The chances of an error occurring twice on a non-critical system (since critical systems need ECC) is astronomically unlikely.
                    If you don't do a checksum, then the problem is silent because you won't know something went wrong until you need it.
                    You sound like those karens who put their oh so valuable opinion about COVID-19. Their life has been like that for years, they never got the virus, it's insignificant or a conspiracy, refuse to wear masks and vaccinate and all. Of course, you stop hearing from them after they get the virus, but the other ones who still did not get it will continue on their bullshit, making it seem like nothing changed at all.
                    Uh... you're the one foaming at the mouth with petty insults, a misunderstanding of the situation, strawmanning, and poor reading comprehension. You sound much more like a Karen to me.
                    That being said, good job ignoring everything I just said and digging yourself into a deeper hole of things I never said.
                    Even if bit flips affected just 0.1% of the population that's already more than enough to consider it an absolute necessity. That's still millions of people. And nobody cares that it didn't happen to the rest 99.9%, because for the 0.1% it's far too late already.
                    Bit flips happen to 100% of people, dumbass. And you think I'm the one who doesn't know anything? The difference is that the probability of a bit flip happening every day to the average home PC is far less than 0.1%, and the probability of one of those bit flips causing a noteworthy problem to the average PC user is so small it isn't worth concerning over. I say "noteworthy" because in most cases, the application will recover or the effect of the flipped bit is nothing more than a glitch.
                    If we used your logic, we wouldn't have any protections against airplane crash disasters, because they're even rarer. Let's save money, right?
                    How the hell are you coming to that ridiculous conclusion? An airplane is the vehicular equivalent of a server, which I have already established many times must use ECC (at least if the data is worth concerning over).
                    Last edited by schmidtbag; 04 January 2021, 04:52 PM.

                    Comment


                    • Originally posted by schmidtbag View Post
                      Yes, you are being generous, in the opposite way you think. Way back in the 90s when bit flipping was a much greater threat, it was still about 4 bits per gigabyte. Back in 2009, Google showed that 1-5 bits would flip per hour per 8GB. In either case, that's far below 0.5% full capacity. You are heavily exaggerating the problem here.
                      You're citing info from Google that's more than a decade old and without regard to the fact that DRAM and DIMM quality varies widely (and you can bet Google never bought the cheapest stuff). Either you need to cite a recent survey of consumer DRAM reliability or stop pretending that you have relevant data.

                      Originally posted by schmidtbag View Post
                      As for your disks, are you not aware that a HDD can have a bad sector but is otherwise perfectly clean? Are you not aware that SSD cells ...
                      As has already been noted (but is worth repeating): HDDs and SSDs have long had error-correction schemes far more sophisticated than ECC DRAM.

                      There's a low (but nonzero) probability of getting an uncorrected error from a storage device. For HDDs, it's usually rated around 1 bit in 10**15 read. However I think long before that happens, you'll likely see the count of correctable errors tick up. Check smartctl (or gsmartcontrol), if you're interested.

                      Comment

                      Working...
                      X