Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #81
    Originally posted by JS987 View Post
    AMD APUs don't support ECC according to various pages.
    Only the Pro APUs support ECC. And you really have to search around and read their specs because some of their newest APUs don't. It's AMDs way of helping Intel fragment the market. I'm a ZFS user that wants to upgrade to an 8c16t APU for VM reasons so I've been trying to keep up with AMD APU ECC support. That Ryzen 7 Pro on eBay -- according to AMD it doesn't support ECC. That's the best APU a person can buy outside of a Dell/HP/Lenovo workstation and it doesn't support ECC. Damn shame, AMD.

    IMHO, Intel and AMD are just as bad in regards to ECC fragmentation.

    So is Linus a ZoL user now? It's just, anecdotally, a lot of ECC talk comes from the ZFS crowd.

    Comment


    • #82
      Originally posted by mdedetrich View Post
      There are so many cases in human history where this is false.
      Yeah, because in those instances it is cost prohibitive or there are people who have an incentive to prevent a new standard. ECC isn't that much more expensive. Preventing people from having ECC doesn't magically improve the lives of corporate leaders. The fact of the matter is: ECC just isn't a necessity for the average person.


      Originally posted by Citan View Post
      Absolutely wrong.
      EVERY file is important for EVERY user. I'll put aside the big, stupid condescendance that comes from your post (like "those people are not professional so their data has no interest hence no value in the first place"), which is impressively ridiculous.
      People are students, independant workers, peoples detached and working at home, etc...
      EVERYONE has "production" documents they care about because it expresses some projects or tasks.
      Uuuugh you're probably one of those people who freaks out over the idea of the government and companies spying on you. Get a grip on reality: chances are, you're not that special and neither is your work. If you want to burn extra money on ECC, go ahead, I don't care. But don't pretend that essay you put off until last minute is equivalent to a declaration of war or a multi-million dollar business deal.
      I've known people who lost hours of work because of something like Excel crashing or the internet disconnecting for an unknown period of time (using a cloud service). While a major frustration and waste of time, such events didn't ruin their lives.
      But let's even consider an hypothetical world where people outside entreprises have only "non productive" data like music, personal photos or videos. The latter are memories, expression of their life. Why in the world would they not have a right to be guaranteed their integrity in time?
      Um... where did I suggest they don't have a right? But let's buy into your stupid hypothetical world anyway: ever heard of a backup?
      If constructors don't put their best into reliability, the minimum should be to be upfront about it. IF they know the reliability of their hardware has been degrading to the point where failures may represent more than 0.5% of the full capacity (and I'm very generous here) and yet have been doing nothing about it, that's damn close to a scam.
      Yes, you are being generous, in the opposite way you think. Way back in the 90s when bit flipping was a much greater threat, it was still about 4 bits per gigabyte. Back in 2009, Google showed that 1-5 bits would flip per hour per 8GB. In either case, that's far below 0.5% full capacity. You are heavily exaggerating the problem here.
      If the needs of people were always met with adequate quality or rather, if the products with the best quality to answer *actual* needs were always the ones winning, junk food would be a niche thing, Microsoft Windows (you know, the OS that yet still in 2020 is not even able to update properly) would have died a long time ago, and we wouldn't have a lot of junk service machines taking care of chores that fall apart every few years.
      In each of your examples, you're showing the faults of human behavior (and yes, Windows failing to update is a failure upon the human programmers). So, you're basically pulling a strawman here.
      A better example is the automotive industry, where certain safety features are standardized and legally enforced because they're meant to prevent inadequacy in a way that would affect consumers. Consumers are forced to pay more for something that is safer and more reliable, because it was deemed necessary. The average PC isn't going to threaten anyone's life. ECC isn't that much more expensive. And yet, it's not enforced. Why? Because it's not a big deal.
      Case in point: I've been interested in IT for more than 20 years, and there are a few areas in the hardware when I tried to get a bit knowledgeable like graphic cards or ssds... But I never realized all those problems about RAM, completely off radar. Yet I have a strong interest to it because apart from my work I have several "personal" and "semi-professional" projects ongoing, including realization and hosting of some professional websites for friends.
      Knowing this puts a new light in some (admitedly rare) data transfer problems I've had in the last ten years from disks I knew were still sane.
      Yeah, they're off the radar because the VAST majority of cases where a bit flips doesn't cause serious damage. Bit flipping happens to everyone on a near-daily basis but the probability of it causing a problem is insignificant. To clarify, insignificant doesn't meant 0%.
      As for your disks, are you not aware that a HDD can have a bad sector but is otherwise perfectly clean? Are you not aware that SSD cells have a lifespan, and if the workload isn't evenly distributed among the cells then some will prematurely fail before others? These are both highly probable situations compared to a major failure due to a flipped bit in RAM.
      Last edited by schmidtbag; 04 January 2021, 10:10 AM.

      Comment


      • #83
        Originally posted by coder View Post
        I had the same idea, and apparently so did others:
        I haven't personally tried either, always opting for ECC RAM in my hardware. I'm surprised it's not more popular.
        I really like the first link. It is a done once pr. startup so it is far from perfect, but should catch really broken stuff and it is better to run once pr. startup and none at all. I am running with this on my kernel commandline now Thanks!


        http://www.dirtcellar.net

        Comment


        • #84
          Originally posted by rastersoft View Post

          Yeah, I know what is RAM and how it is supposed to work, but the point is that row-hammer exists, and it can allow you to just hang the computer, or make it behave incorrectly, which is also an attack. And what's even worse: the fact that it exists means that it can be triggered accidentally. So I understand why Torvalds wants it. But having a bad cell in RAM... that isn't as common. Of course that you need a tool to detect that and ECC does that job too, but what I mean is that I see much more probable to have a bit flip due to an accidental row-hammer than to have a bad cell.
          I would argue that row-hammer proves that MANY cells in your ram module are bad. If you know how RAM is supposed to work you must agree that you should in be able to access anything in memory , whenever you want, at what pattern you want without suddenly having something changed as a result. Imagine reading a book , and the very result of reading a sentence changes another. This is NOT correct behavior.

          The problem is money. Manufacturers has found the problem to be rare enough / or not enough people have complained about corruption (perhaps due to being unaware of them). They sell corrupt memory to YOU , the customer because the customer is either not bright enough to understand that they have purchased unreliable crap. And if you complain , the standard response is - you need enterprise class hardware for such things (e.g. ECC memory).

          ANY ram chip should NOT be vunerable to row-hammer. If it is - it BAD RAM!

          http://www.dirtcellar.net

          Comment


          • #85
            Originally posted by schmidtbag View Post
            Um... where did I suggest they don't have a right? But let's buy into your stupid hypothetical world anyway: ever heard of a backup?
            And then you realize the backup itself is silently corrupted because, when you did the backup, is when the bit flip happened.

            Originally posted by schmidtbag View Post
            A better example is the automotive industry, where certain safety features are standardized and legally enforced because they're meant to prevent inadequacy in a way that would affect consumers. Consumers are forced to pay more for something that is safer and more reliable, because it was deemed necessary. The average PC isn't going to threaten anyone's life. ECC isn't that much more expensive. And yet, it's not enforced. Why? Because it's not a big deal.

            Yeah, they're off the radar because the VAST majority of cases where a bit flips doesn't cause serious damage. Bit flipping happens to everyone on a near-daily basis but the probability of it causing a problem is insignificant. To clarify, insignificant doesn't meant 0%.
            Maybe to morons like you who don't value their data. I've been using ECC for decades so I don't have to deal with silent data corruption.

            Originally posted by schmidtbag View Post
            As for your disks, are you not aware that a HDD can have a bad sector but is otherwise perfectly clean? Are you not aware that SSD cells have a lifespan, and if the workload isn't evenly distributed among the cells then some will prematurely fail before others? These are both highly probable situations compared to a major failure due to a flipped bit in RAM.
            Those are completely insignificant because they're not in RAM. The problem with RAM bit flips is that the data written to the disk in the first place will be wrong. i.e. they're silent.

            There are ways to mitigate issues with bad disks, bad sectors and etc. Write a checksum, or recovery record (some filesystems make this automatic). Since the checksum is calculated based on data in RAM then even if the disk ends up written wrong instantly, you will know that when the checksum fails, and can even repair it or replace from a backup.

            But if the data itself is wrong, and the disk is perfectly fine, then the checksum will be calculated on wrong data as well, so it will appear OK. You then archive it, copy it into 100 different drives for extreme backup! Later you lose your work and think to yourself you're perfectly fine cause you did backups!!! You check your backup, it crashes/fails to load/is wrong. You check the checksum, it says OK because it was a silent bit flip. Every backup is the same, because the source of it (the RAM) was wrong to begin with, there's nothing wrong with an individual disk! It was told to write wrong data, that's what it faithfully did.

            You are beyond fucked.

            I don't know why I write an essay when you should just shut up and listen to Linus. Even VRAM now has ECC these days. You're beyond delusional if you think it doesn't matter for main system memory. Go play with your toys this is clearly out of your ballpark.

            Comment


            • #86
              I started using ECC in my server from around 2012. My Celeron G530 pre-ECC corrupted all crypto-writes after 3 months of uptime (don't recall if it was truecrypt or luks). !!! I don't know if that was a kernel bug or ECC-correctable error !!!!, but I switched to i3-4130 on C222 (Asus P9D-I). No such problems again.

              Symptoms: nothing seemed wrong, but if a reset was needed after 3 months uptime, the newest-written data was corrupted.
              Debugging: the next time I planned a reset, I unmounted my encrypted volume, fscked it, no error; locked-then-unlocked it, fscked, no errors. Finally I locked all volumes and unloaded crypto modules, unlocked them again and fsck got errors. I'm thinking a single bit error in crypto key, but it's strange that it would read old and new data correctly, but, after a reboot, new data was incorrect.

              Comment


              • #87
                Originally posted by Ivan Dimitrov View Post
                Totally support Linus's sentiment although it is not fully correct. AMD supports (validates) ECC RAM support ONLY on their PRO processors. ECC is enabled on non-PRO processors but validation and implementation is left to the motherboard vendor. So ECC will most likely work on non-PRO processor but it is not clear if it will work in ECC mode and how will it report the errors. You can watch Wendel for the current state of ECC on Ryzen - in short it is messy.
                You can argue that Ryzen's "support" of ECC can give false sense of security for some users....
                Anyway whoever needs ECC they are most likely to buy the solution validated by the vendor which means Ryzen PRO and Xeon CPUs - not big difference here. So to be precise instead of "AMD did it", I think the correct statement is more like "AMD raised a valid point, made some noise and scored some marketing points for ECC RAM support".
                Just because it is nonvalidated doesn't mean it doesnt work. ECC is EXTREMELY simple technology that is almost impossible to get wrong, and the logic is the same between the models, the only difference is a in marketing speech. One has paid for the priviledge.

                Comment


                • #88
                  Originally posted by magallanes View Post
                  Hi everybody:

                  I subscribe just for this post.

                  About Linus and ECC: bollocks.

                  What is the problem of ECC?
                  * It costs more because instead of 8 modules, we should have a 9 module plus a chip.
                  * It adds latency. Why? Because the tiny chip should validate every operation.

                  And what is the advantage of it?

                  For instance, ECC does not automatically correct memories.

                  Usually, ECC does the next steps:
                  * The system reads some space of memory including the parity check.
                  * If the parity fails, then it tries it again, it is the "correction" and yes, it is flaky and usually it doesn't solve the problem at all and usually continues with the next step.
                  * If the memory fails it again, then the system halts (or enters into interactive/maintenance mode). If you are using an expensive server, then it could show some lights and even says which module failed.

                  Also, memories are way less prone to fail than in the past. For example, if you are worked on a datacenter, then replacing a hard disk is part and included in the global costs. Instead, it is not common to replace the memories or the CPU, mainly because those components are well inside of the motherboard and protected inside several layers of capacitors.



                  Error _correction_ does correct errors, it is in the name.. I think you are confusing parity with ECC, ECC can correct one bit errors, and can detect 2 bit errors, but not correct them.

                  Comment


                  • #89
                    Originally posted by Weasel View Post
                    And then you realize the backup itself is silently corrupted because, when you did the backup, is when the bit flip happened.
                    That's why you checksum against the source data...
                    Maybe to morons like you who don't value their data. I've been using ECC for decades so I don't have to deal with silent data corruption.
                    Well la-dee-dah, aren't you special.
                    Those are completely insignificant because they're not in RAM. The problem with RAM bit flips is that the data written to the disk in the first place will be wrong. i.e. they're silent.
                    Right... I'm the moron... Defects in drives are not insignificant. They're more likely to affect the average person than a flipped bit.
                    There are ways to mitigate issues with bad disks, bad sectors and etc. Write a checksum, or recovery record (some filesystems make this automatic). Since the checksum is calculated based on data in RAM then even if the disk ends up written wrong instantly, you will know that when the checksum fails, and can even repair it or replace from a backup.
                    Wow you are dense. A bad sector is also a silent problem. You might not know something was corrupt until after you try to open it. It's the same kind of problem as a flipped bit, the only difference is drive failures are more common and are more likely to be noticed after the damage has been done. When a flipped bit is affecting something you're working on, you typically notice the problem while you're doing it. If you're not sure of that last statement, well, that shows how little flipped bits have actually impacted you.
                    You are beyond fucked.
                    No, you're just blowing things way out of proportion, like you always do. I use multiple computers for most of my waking hours. 3 of them are on 24/7, and none of those have ECC. My life has been like this for years. Never once have I faced a noteworthy issue caused by bit flipping. Has it happened? Yes, obviously it must have, but I'm not some special snowflake who gets in a hissy fit because the 700th selfie resulted in a crash, or the hardcore dungeon porn artifacted for one frame.
                    I don't know why I write an essay when you should just shut up and listen to Linus. Even VRAM now has ECC these days. You're beyond delusional if you think it doesn't matter for main system memory. Go play with your toys this is clearly out of your ballpark.
                    Or y'know, if you actually read and understood my point, you'd realize I didn't say I disagree with Linus, I'm just saying he's making this a bigger deal than it really is.
                    How about go back to school and learn how to understand the English language better?

                    Comment


                    • #90
                      Originally posted by schmidtbag View Post
                      As for your disks, are you not aware that a HDD can have a bad sector but is otherwise perfectly clean? Are you not aware that SSD cells have a lifespan, and if the workload isn't evenly distributed among the cells then some will prematurely fail before others? These are both highly probable situations compared to a major failure due to a flipped bit in RAM.
                      All permanent storage devices have error correction built-in. When you recieve a bad sector information, their correction code failed. But the ECC (which is much more than hamming code of 1-bit-correct-2-bit-detect of RAM) is done by manufacturer, except for optical disks that have standards published and required.
                      <granpa voice>Back in the days, when HDD size was about 3 times of a CD</granpa voice>, I prepared a folder to write to a CD and copied it over network to a friend (over Netbios, not TCP-IP) and everything seemed fine. Even compared the copied data on the PC that had written it.
                      But when put the CD in my PC and compared with my original data, almost everything was corrupted (no zips could be extracted and all .exe crashed). I don't know which part failed: MEM-MEM of network driver or actual network transfer. But it failed silently. And this WAS a backup.

                      Also, getting back to ECC necessity, I've had a RAM+MB combination that would always freeze loading windows on the 1st start of the day (at least 8h powered off). It took me a month to remember to load memtest 1st, and when I did, I've had 2x 1bit errors at the 1st pass, followed by hours of stability. ECC would have allowed me to use that reliably, instead of getting angry every time while resetting my PC. Now think about someone that doesn't know anything about PCs who tries to apply to sellers support channel.

                      PS: the cost-cutting part of ECC RAM is just related to motherboard tracing. CPU and memory sockets already have the pins. Intel just does not allow it in their i5 and i7 lines (many i3s have them) and consumer chipsets (which don't even have the memory controller). So the only really cheaper part is the memory, where you need 12.5% extra capacity (64-bit vs 72-bit). But UNREGISTERED ECC memory is even more expensive because it's produced in smaller quantities. DO NOT CONFUSE with really-expensive enterprise-grade REGISTERED memory, which is intended to be used in high capacity with more than 2 sticks/channel.
                      Last edited by mathew7; 04 January 2021, 11:22 AM.

                      Comment

                      Working...
                      X