Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by anarki2 View Post

    What "other popular OS" are you talking about? It's definitely not Windows, so macOS, maybe? Currently my RAM is 7.7GB used, 13.3GB cached and 11GB free on Windows. Or is it that the last time you saw Windows was in '98?
    Well Windows might be one of those yes. I remember when Vista came out , everyone was complaining that Windows did use so much memory suddenly. Well it was used or cache so Windows got a lot of unfair "bad press" there. However what was rarely talked about is that Windows memory manager paged out the cached memory to swap. Not such a brilliant idea. From my (limited) experience Windows 10, NetBSD and perhaps to some extent FreeBSD (not so sure about current versions) has generally left a bit too much free RAM in the system. This is ram that could otherwise has been used for caching etc.

    http://www.dirtcellar.net

    Comment


    • Originally posted by coder View Post
      The chips are interleaved, on single-rank DIMMs. I'm not sure how dual-rank DIMMs are mapped, but it still won't be the case that one bad chip = 1/8th or 1/16 of the address range is unusable. It would either break the whole DIMM or maybe half of its address range, I think.


      Yes, one or more bad pages/rows/etc. can be blocked, at boot time. I've never done it, but I know it's possible.
      Sure, you may be very unlucky or very lucky. Regardless having some memory working on a memory module is usually better than not having anything working at all. So let's pretend that you have two nuked bytes on a 8GB chip. If you need to map away perhaps even 4GB to conceal those two unreliable bytes - you would still have 4GB in good condition. Better than nothing

      http://www.dirtcellar.net

      Comment


      • Originally posted by strtj View Post

        This existed ~25 years ago in Tru64 (Digital Unix / OSF/1). I can't remember if it was a paid add-on or if it existed in the base OS, but it did pretty much exactly what you describe. You could set it to do a certain amount of memory in a certain period of time and CPU use would scale accordingly. I'm always surprised that I've never seen it anywhere else.
        Thanks for that, I was not aware of that. The reason it is not seen elsewhere might perfectly well be that memory chips are worse than we believe. Actually they are , row-hammer is proof of that concept I think. I am still on DDR2 memory which is apparently less vulnerable to breaking by being used than DDR3/4 memory.

        Anyway if we had such a built in background memory testing running constantly in the idle thread(s) - possibly rate limited. e.g. test 1GB pr. 30 minutes max I am sure that in a couple of months manufacturers would get a not insignificant amount of memory modules returned to them pretty often hell maybe is is a good idea to reimplement this for Linux , that way we can have reliable hardware again!

        http://www.dirtcellar.net

        Comment


        • Originally posted by coder View Post
          Yes, and I explained why.
          Poorly.
          I agreed to no such protocol. That's straight out of your imagination.
          I'm well aware you disagreed. But when someone cites a source and you don't like it, that's your problem. Either prove why my source is bad or accept it. Again, Puget says that RAM is only getting more reliable, so the outdated sources I showed were a worst-case scenario.
          How can you possibly think that, after seeing how many of your posts I've replied to? I read every post in this thread.
          Because I'm finding myself repeating things to you I've already said to others. You aren't reading everything.
          If you feel I've mis-characterized your position, feel free to correct the record. If not, why do you say that?
          My position is twofold:
          1. ECC is not a necessity for the average consumer-grade PC, but is a necessity for workstations and servers.
          2. Getting ECC support on Intel isn't that expensive and therefore not worth complaining about.
          Yes, I'm serious. You made a very specific claim about the relative frequency of two types of errors. I very much doubt it. Let's see your data, or drop the point.
          I already cited a source, so wtf else do you want? You can always ask the average computer repair technician; they'll tell you the same thing.
          We're not talking about drive failure vs. RAM failure. You said bad sectors were more common than memory errors. Don't muddy the water.
          I mentioned more than just bad sectors, you're just reading what you want to read (or, just evidence you haven't actually read everything I wrote). But let's for a moment assume I only mentioned bad sectors: you are getting annoyingly pedantic over insignificant details. A bad sector is a drive failure (which doesn't mean a wholly failed drive).
          Is it? I specifically said not. I'm more interested in the nuances of which platforms offer what level of ECC support and specifically when does it make sense to use ECC?
          If you understood what exactly it was you were arguing with, then yes, it is. Whatever nuances you're interested in is not what I'm discussing, so you're basically just arguing with me over nothing. You're the one who responded to me. I'm here to defend my point, not cater to your interests.
          This is what happens when you C your way into an A-B discussion.
          ECC can save your bacon and tell you that you're getting memory errors. Without it, you're just left to suffer the consequences and guess why your system is unstable (unless/until you reproduce the problem with a memtest, by which point you've already suffered sufficient loss of time or data to even justify running it).
          It can but as I've said multiple times: ECC isn't a miracle worker and it won't save you from everything. If you have frequent memory errors, getting ECC RAM is not a sensible solution. You have to fix the underlying problem first. By your logic, that's like getting a bullet wound and you decide to fix it by taping a cotton ball over the hole and downing a bottle of painkillers. You still have a major problem. Depending on ECC to fix frequent memory errors is just plain irresponsible. If the problem was a defective DIMM, then sure, replacing it is a good option, regardless of it being ECC.
          If we're talking lost time in a manner where that time is critical, not going for ECC in the first place was also irresponsible.

          Comment


          • Originally posted by schmidtbag View Post
            Again, if that were true, it would be the standard. You wouldn't have an option.
            You don't need ECC, until you do. Just like backups. It's only a problem when it becomes one, if the risk isn't a concern for what you're doing, then you don't need ECC. If you're an Apple user and your RAM goes bad, just buy another Mac (this is their usual time effective positive UX solution, throw money at it).

            Originally posted by darkoverlordofdata View Post
            I think most would agree that we want ecc on the server - but do we need it on the workstation? I have a 4 year old asus zen book that logs ecc errors; it's never had one.
            Originally posted by Sonadow View Post
            The fact remains that the world has been fine with using non-ECC memory on high-end professional workstations, desktop PCs, gaming PCs and work PCs for decades with no major complications or consequences.

            I have a 2990WX workstation with 128GB of standard DDR4 memory for use in compiling software (with the memory being used as a RAMdisk for faster compilations) and have not run into any problems since day one.
            Depends what you're doing. You're compiling some software, try doing processing tasks that can take 2 weeks 24/7. Did this with a 1950X ThreadRipper back in 2017 with 128GB DDR4. Some errors may not matter, depends on where that bit flip is, since we're processing a lot of numbers (3D meshes), a point in 3D space being a little inaccurate isn't the end of the world, but if it's significantly offset from where it's meant to be, that can be bad but at least fixable (I'm referring to storage, not run-time, so you'd get the issue by restarting the game engine or DCC tool).

            Other parts of the file aren't going to be as obvious that they're borked, but the biggest processing (one that takes several weeks) is all RAM shuffling for photogrammetry processing with proprietary software. I don't know what it's doing, other than using the CPU at 100%, all RAM, big disk buffer(400GB on NVMe), and stressing several GPUs with 8GB or more VRAM. That's a lot of data for a long enough period, and as a business, if something borks and crashes that can be costly. It happened once but was related to some NIC driver causing a bluescreen.

            I've also witnessed some smaller workloads that take a 4-12 hours crashing on non-ECC RAM. Could be due to other reasons, but otherwise was the same workload it usually performs, and repeating it again worked fine, thus wasn't reproducible. That was using about 80GB RAM to process on.




            Comment


            • ZFS solved this issue via software, not hardware.
              Why can't ECC be solved via software checksums?
              I know performance would drop, but seems like it should be an option you enable in the BIOS if you're willing to lose performance.

              ECC memory can still go bad and doesn't even address the worst errors in the first place

              This isn't just system memory. There's GPU memory, CPU L1, L2, and L3 cache, SSD and hard drive cache, RAID card cache, highspeed network card cache.

              Comment


              • Originally posted by waxhead View Post

                Thanks for that, I was not aware of that. The reason it is not seen elsewhere might perfectly well be that memory chips are worse than we believe. Actually they are , row-hammer is proof of that concept I think. I am still on DDR2 memory which is apparently less vulnerable to breaking by being used than DDR3/4 memory.

                Anyway if we had such a built in background memory testing running constantly in the idle thread(s) - possibly rate limited. e.g. test 1GB pr. 30 minutes max I am sure that in a couple of months manufacturers would get a not insignificant amount of memory modules returned to them pretty often hell maybe is is a good idea to reimplement this for Linux , that way we can have reliable hardware again!
                I went and looked up the details because I couldn't remember them (I never personally ran it) - they called it Memory Trolling, which in and of itself is a great enough reason to use it ;-) Documentation here: http://www.polarhome.com/service/man...&of=Tru64&sf=5

                This certainly wouldn't be to difficult to re-implement on Linux, but I think you are right that manufacturers weren't scrambling to implement something like this. Especially in the days when the vendor and the OS were generally one and the same, there would have been very little motivation to increase the number of memory returns and/or service calls. If the customer doesn't notice some bad bits and never complains, who cares, right? :-P

                As far as I am aware the only significant feature of Tru64 that was carried forward in any way was the AdvFS filesystem which was open sourced and ported to Linux, but I'm not aware of anyone actually using it. The only reason that I can think to have used it was if you had a huge investment in AdvFS storage and for some reason couldn't easily migrate it to something else. AdvFS wasn't a bad filesystem but the management tools were awful and by 2008 there were plenty of reasonable large/clustering filesystem options. As far as I can tell, HP's only interest in Tru64 was killing it and migrating its customers to HP-UX.

                Comment


                • Originally posted by sentry66 View Post
                  ZFS solved this issue via software, not hardware.
                  Why can't ECC be solved via software checksums?
                  I know performance would drop, but seems like it should be an option you enable in the BIOS if you're willing to lose performance.

                  ECC memory can still go bad and doesn't even address the worst errors in the first place

                  This isn't just system memory. There's GPU memory, CPU L1, L2, and L3 cache, SSD and hard drive cache, RAID card cache, highspeed network card cache.
                  It can, if you are willing to completely butcher your performance. Software error checking is easy for a slow mechanical drive data, but as the bandwidth increases, so does the cpu cost. It isn't really a viable solution to second-guess the integrity of each and every byte of data. Your cpu will end up doing far more work to check the data than actually use it.

                  There is not only cpu time overhead, but also memory overhead - and if you try to minimize one, you maximize the other. Do you store an extra bit for every byte, or do you store an extra byte for every word? Every time you use any byte, do you check just that byte, or do you check the entire word?

                  ECC offers no cpu time overhead and hidden memory overhead, at a theoretical 12.5% cost increase. You cannot improve on that in software, and you'd be foolish to try.

                  Last, but not least - zfs doesn't really implement ecc - it only implements error detection, if you don't have that block of data elsewhere - it is lost. Zfs blocks are optimally quite big, far too big to be able to restore data from the checksum, it is only used to verify the validity of the block.

                  This is quite easy to facilitate in software - most of the data my software saves to disk has its checksum embedded as well, no one in the right mind would even attempt to deserialize critical binary data without being sure of its integrity - that's a crash waiting to happen, or even worse.
                  Last edited by ddriver; 05 January 2021, 06:17 AM.

                  Comment


                  • Originally posted by schmidtbag View Post
                    Again, if that were true, it would be the standard. You wouldn't have an option.
                    Logic isn't your strongest point I see. Nothing to worry about but at least you could try not to insult others, especially when they only point out your faulty reasoning. Other than this please keep commenting because it's moderately funny and even less moderate when you get butthurt

                    For instance, "necessity implies standard": false. Companies want profit -> they do what they need for it. If a car is faulty and they are responsible for an accident they may have to pay for it -> no profit. Companies will advertise whatever the consumer wants or thinks is cool (rgb?). Consumers don't know what ECC is (even worse, they don't know they need it because every time there is a failure they attribute it to something else) and if companies would put a sticker with it (ECC) on a case, that would give them no profit. Instead advertising rgb may, in many cases, make them good profit.
                    Fragmenting the market with a feature people don't know about and charging a lot for it is much more profitable than making it standard. Making it standard would, at some point, make the price go down enough that you wouldn't even know the difference any more.

                    As for the technical content of your comments, there are enough people here better suited than me that already put you to shame and besides, I am starting to think you are trolling, so I'm not gonna waste any more time.

                    Have a good one!

                    Comment


                    • Originally posted by [email protected] View Post
                      Wonder if Intel will use their political tools like the $300 million they gave to postmodern feminist to "cancel" Torvalds if he keep blasting Intel with the truth on their harmful anti-consumer practices, and let's not even get on the anti-competitive practices Intel has employed. I can totally see Intel looking at that comment by Torvals and going "Oh no, looks like someone needs another reeducation, I mean soft skills".
                      if that ever happens, i see linux going down in quality really fast.

                      Linus set too many people, including senior kernel devs, straight far too many times. He seems to be one of the very few people who have the foresight to minimize the potential technical debt and keep the big picture in mind. even if it means sometimes blocking inclusion of highly requested features into the kernel.

                      Comment

                      Working...
                      X