Announcement

Collapse
No announcement yet.

Linus Torvalds On The Importance Of ECC RAM, Calls Out Intel's "Bad Policies" Over ECC

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by coder View Post
    Do you mean it's not available in the rated speed that you desire, or that memory of the same rated speed performs worse?
    i mean it is not available at rates available for non-ecc. and rate difference is huge, i'd be willing to take one step lower for example.

    Comment


    • Originally posted by Weasel View Post

      Here's a made up example just to illustrate. You open a text file in some encrypted archive format with a special editor. It's all in RAM now. The old one has checksum 0x12345678. You edit it a bit and save it. The bit flip happens during your edit in RAM. You save it. Since it's encrypted, it is totally different on-disk. The checksum is calculated. It is now 0x87654321. It's saved on disk along with it.
      ...
      I couldn't agree more. The same kind of thing happened to me a while ago, in the days when I had a computer without ECC. Basically I memtested my computer originally when I built it, but something happened along the way over the years, after a bios update, some settings that I had originally tested got changed (it may be that XMP turned on but with incorrect settings), and it was no longer totally stable.

      The only thing is I did not notice that until a very long time (I think several years passed).... I started to get some strange crash over time, then corrupted files (save game files often in my case), ultimately Windows was often crashing (back in the days I was still using Windows... never going back BTW). I liked to blame Microsoft or the game developers for their bugs... I finally re-ran a memtest, to discover there was a lot of errors reported now... I then found how to "fix" it in the bios, but all the data damage was already done. I considered all my hard drive data was now useless (mostly the OS installation might have been corrupted after so many updates over time) and never tried to recover anything except some pictures and documents.

      I also, before that, was under the impression that somehow file CRC or checksums should protect for almost all cases of data integrity, but as this example show (and if you really just think about it from a computer science point of view) it doesn't always in many cases.

      Then last year I bought a Ryzen 3900x and put ECC memory in an ASUS X570 PRIME PRO (consumer motherboard) and I've tested that it works. Initially I had some discussion with kernel maintainers, AMD (and Asus) for updating EDAC to support my new AMD CPU and things got resolved eventually, this is also why I hope one day ECC will be more officially supported.

      I'm never going back to non ecc. What's even interesting is that even for a gamer I find it useful, for testing XMP or just memory stability in general (in fact I used overclocking slightly to test ECC error reporting quickly).
      It makes me more confident that if one day something change I'll know right away (I have machine check notifications sent whenever recoverable or non-recoverable error happens).

      Also, I'm not too sure why some people still debates that it is not needed or/and should only be enterprise grade (what is so special about enterprise anyway that is so much more valuable that your own wasted time?). As some have pointed out, error correction is already in so many storage and other technologies. If every memory modules were made with ECC and memory controller supporting it, it wouldn't really cost more at the end, it would be mainstream, so no need to fight against that. Lets just try to raise the bar instead of being against the progress. As memory quantities increase (and amount of data transfer & rate increase), it is even more important I think.

      I guess if someone never had ECC in his personal computers and his still stuck with Intel "consumer" products he may naturally try to defend it (because of jealousy perhaps I don't know).

      Comment


      • Originally posted by pal666 View Post
        i mean it is not available at rates available for non-ecc. and rate difference is huge, i'd be willing to take one step lower for example.
        I also thought it would be a limitation, the ECC modules I found were Kingston only rated at 2666MHz for my 3900x, however they use memory chip die that appear to be very recent and the same as other non-ecc modules use which are clocked way higher (I know there is binning stuff, and many other things involved that I'm not really an expert in...)

        So the good thing with ECC modules is you can (in my experience) overclock them and look more easily for memory errors. In my case I run at 3200MHz (which for a Ryzen 3900x is not overclocked from a point of view of the CPU and its memory controller) and never had any recoverable error (or unrecoverable).

        In fact, I needed to run prime95 with a large amount of memory and at 3700MHz for a long time before I get any ECC error (always correctable). It is fun to watch prime95 doing its self test to check for any result error and reporting everything is fine, while in background I see recovered errors happening every 5 minutes or so. If I turn off ECC in the bios, prime95 reports the errors and stops at the first error.

        When ECC was on with this significant overlock (3700MHz compared to 2666MHz, and BTW, it was with JEDEC 1.2V standard voltage, no overvoltage involved at any speed), I could run prime95 for as long as I wanted and it never reported any error and everything was working fine, errors were corrected automatically all the time. Without it, it would only take 5-15 minutes before it stops with an error (and BTW, the OS doesn't crash most of the time (never really in my case), so normally without a software that checks itself you may not even know you have errors...). But even if it looks ok at 3700MHz I don't run it at that speed, however at 3200Mhz I feel after a year without a single ECC error reported that it is stable, and if things change for any reason, I'll know right away.

        I have another computer I build (a linux server with a consumer Asrock mini ITX and Ryzen 3600) and put the same memory modules and they are as good as the other batch I got 6 month earlier, and one of my friend also has a different motherboard with same memory modules, and they both had great stability at 3200Mhz with stock voltage.

        Now if you need to have very high speed memory modules 4000MHz+, then just wait until the tech is there... (this is just how I see it personally) I mean is this really needed that the memory is a little bit faster? And how will you be sure, especially without ECC, that it is stable in you particular setup?

        Comment


        • Originally posted by bridgman View Post
          DDR5 + Infinity Cache is an interesting thought for low end products.
          I have another good idea amd should sell 5950X as a gaming edition with disabled Hyperthreading
          https://www.pcgameshardware.de/Ryzen...adeon-1364596/
          this will improve gaming performance up to 45% and 5-45% in general.

          about DDR5+Infignity Cache right now this would not be low-end a 256bit ddr5 interface gives you ~220GB/s this is around a speed of a 570XT alone but with infinity cache this would outperform a vega64 or 5700XT easily.

          also amd could build the ram with dimm sockets would be possible to upgrade it to 256GB of vram+ECC ram
          i am sure some professionals would like to use 256GB vram for OpenCL/Compute
          Phantom circuit Sequence Reducer Dyslexia

          Comment


          • Originally posted by Jeff View Post

            I couldn't agree more. The same kind of thing happened to me a while ago, in the days when I had a computer without ECC. Basically I memtested my computer originally when I built it, but something happened along the way over the years, after a bios update, some settings that I had originally tested got changed (it may be that XMP turned on but with incorrect settings), and it was no longer totally stable.

            The only thing is I did not notice that until a very long time (I think several years passed).... I started to get some strange crash over time, then corrupted files (save game files often in my case), ultimately Windows was often crashing (back in the days I was still using Windows... never going back BTW). I liked to blame Microsoft or the game developers for their bugs... I finally re-ran a memtest, to discover there was a lot of errors reported now... I then found how to "fix" it in the bios, but all the data damage was already done. I considered all my hard drive data was now useless (mostly the OS installation might have been corrupted after so many updates over time) and never tried to recover anything except some pictures and documents.

            I also, before that, was under the impression that somehow file CRC or checksums should protect for almost all cases of data integrity, but as this example show (and if you really just think about it from a computer science point of view) it doesn't always in many cases.

            Then last year I bought a Ryzen 3900x and put ECC memory in an ASUS X570 PRIME PRO (consumer motherboard) and I've tested that it works. Initially I had some discussion with kernel maintainers, AMD (and Asus) for updating EDAC to support my new AMD CPU and things got resolved eventually, this is also why I hope one day ECC will be more officially supported.

            I'm never going back to non ecc. What's even interesting is that even for a gamer I find it useful, for testing XMP or just memory stability in general (in fact I used overclocking slightly to test ECC error reporting quickly).
            It makes me more confident that if one day something change I'll know right away (I have machine check notifications sent whenever recoverable or non-recoverable error happens).

            Also, I'm not too sure why some people still debates that it is not needed or/and should only be enterprise grade (what is so special about enterprise anyway that is so much more valuable that your own wasted time?). As some have pointed out, error correction is already in so many storage and other technologies. If every memory modules were made with ECC and memory controller supporting it, it wouldn't really cost more at the end, it would be mainstream, so no need to fight against that. Lets just try to raise the bar instead of being against the progress. As memory quantities increase (and amount of data transfer & rate increase), it is even more important I think.

            I guess if someone never had ECC in his personal computers and his still stuck with Intel "consumer" products he may naturally try to defend it (because of jealousy perhaps I don't know).
            you are absolutely right and as i said we need a law to make any non-ECC ram against the law.
            the costs of damange what non-ECC ram does is much much much higher than the cost of ECC RAM.
            even on Gamer PCs...
            Phantom circuit Sequence Reducer Dyslexia

            Comment


            • Originally posted by Qaridarium View Post

              I have another good idea amd should sell 5950X as a gaming edition with disabled Hyperthreading
              https://www.pcgameshardware.de/Ryzen...adeon-1364596/
              this will improve gaming performance up to 45% and 5-45% in general.
              I don't see the point. If a gamer wants to do that, they can either turn it off themselves, run software to set thread affinity masks, or the game itself can map its threads to avoid SMT.

              SMT is in general a good thing.

              There would only be a point in AMD selling disabled SMT if that was a common failure point in their chip manufacturing. I doubt it is. From what we see of AMD's strategy if there is any problem with a core the entire core is disabled.

              Comment


              • Originally posted by Jeff View Post
                The only thing is I did not notice that until a very long time (I think several years passed).... I started to get some strange crash over time, then corrupted files (save game files often in my case), ultimately Windows was often crashing (back in the days I was still using Windows... never going back BTW). I liked to blame Microsoft or the game developers for their bugs... I finally re-ran a memtest, to discover there was a lot of errors reported now... I then found how to "fix" it in the bios, but all the data damage was already done. I considered all my hard drive data was now useless (mostly the OS installation might have been corrupted after so many updates over time) and never tried to recover anything except some pictures and documents.
                I totally understand how you feel, because I went through a similar thing. I was naive before and thought the exact same thing, even though I had computer science background.

                The first time my RAM failed, it happened during downloads (at first I suspected network errors). But it started to be very reproducible. That was not the issue, though, because it at least opened my mind that RAM can indeed be broken. What was the issue is all the silent errors that happened before that, and one of them propagated to my backups. FWIW, one of the archives I had backed up was corrupted / unable to be decompressed (on three different disks, because it was copied with the wrong data).

                I will never use non-ECC memory ever again, unless it's a throw away PC or one that I have no intention of ever backing up. It's simply retarded to live on luck, and people don't care until it happens to them.

                From my experience, ECC memory isn't even that much more expensive, and if you go with Intel, you can find Intel Xeon processors that are pretty much i7s under the hood and support ECC, but the downside is no overclocking available. What's truly the problem, IMO, is motherboards. To support the Intel Xeons and ECC, they tend to be on the higher end workstation motherboards, or server. So it's better to just go AMD, but with their recent policy changes it's kinda sad.

                Comment


                • Originally posted by bridgman View Post
                  Only very vaguely - are you saying there is a setting today that can work around the issue, or just that it would be good if we had something like that ?
                  i had 3 options; use older kernel i think 5.7RC1 was the last one who worked
                  second option was a command used at startup in GRUB2
                  3. option was to buy new monitor.

                  i did all 3 in the end. but AMD really should have a driver gui(no matter how simple) to make it simple for people for example to add a start up command at grub2 to fix this problem

                  for many people and in the past i had the same problem for myself AMD gpus with opensource driver is broken with old HDMI standards ... 2.0 hdmi instead work well.

                  Originally posted by bridgman View Post
                  I did ask about this - turns out that quite a few of our datacenter customers only run a single instance, looking for isolation more than sharing AFAIK. That makes producing a consumer card with SR-IOV more problematic. In the short term our focus has been improving the pass-through experience although I do agree we're probably going to have to do something for consumer card sharing at some point.
                  SR-IOV adds a fair amount of hardware cost so I'm not sure it is the best approach for consumer solutions though.
                  i really think that we need SR-IOV for consumer cards. the reason is very simple: i had the case that a person was successfully able to install trojan horse on my pc by social engineering me to install a game he wanted to play with me. for cases like this we really need VM features like SR-IOV to install such games inside a VM to trap a trojan horse like this inside the VM.
                  maybe you can do something like this: build the SR-IOV inside the hardware but sell it optional by serial key so people who want to use it pay 50-100€ to AMD and then he has this feature activated.
                  i even think AMD should put this feature in consumer hardware anyway i am sure the people want to pay a good price for it.
                  and if datacenter customers buy this card instead of the more expensive? really AMD should not care and just add even more usefull features to the datacenter hardware.

                  Originally posted by bridgman View Post
                  I'm not 100% sure, but since both MI and Pro cards are only shipping with 3840 shaders enabled I suspect that is where the chips are yielding out. The process yields do improve over time so presumably a fully configured chip should become more do-able over time, but my impression was that there was not a big performance difference between 3840 and 4096 shaders on most workloads.
                  i am sure apple has the exclusive right to sell the 4096 shader version
                  but in time of 6900XT this makes no sense anymore.
                  this means amd should from now on sell the full 4096 shader version.

                  Originally posted by bridgman View Post
                  Dropping HDMI on some SKUs seems problematic - low end cards are the most likely to need HDMI, while the savings on anything but the least expensive cards seems too small to justify the costs of carrying another SKU. I don't *think* we would need to actually remove the logic from the chip, but if another chip was required that would be a non-starter for sure.
                  well then yes you do not have to remove the logic from the chip.
                  if you sell a 6900XT without HDMI port then do you have to pay the HDMI license fee ?
                  Phantom circuit Sequence Reducer Dyslexia

                  Comment


                  • Originally posted by Zan Lynx View Post
                    I don't see the point. If a gamer wants to do that, they can either turn it off themselves, run software to set thread affinity masks, or the game itself can map its threads to avoid SMT.
                    SMT is in general a good thing.
                    There would only be a point in AMD selling disabled SMT if that was a common failure point in their chip manufacturing. I doubt it is. From what we see of AMD's strategy if there is any problem with a core the entire core is disabled.
                    yes you are right but many customers do not even know they can disable it in bios...

                    also AMD pay licence fee to intel for SMT... maybe if they disable it they maybe do not have to pay intel a licence fee ...
                    Phantom circuit Sequence Reducer Dyslexia

                    Comment


                    • Originally posted by Qaridarium View Post
                      also AMD pay licence fee to intel for SMT... maybe if they disable it they maybe do not have to pay intel a licence fee ...
                      According to Wikipedia (and its sources) SMT was invented by IBM and then first used commercially in the Alpha CPU by DEC. As I remember it, AMD bought most of DEC and their engineers, and DEC technology was a major part of the really great AMD64 design. Although it didn't use SMT.

                      Edit: I guess I remembered that wrong. Compaq bought DEC and HP bought Compaq? And DEC technology went into Intel Itanium? But for some reason I was sure AMD64 used some Alpha tech. Hmm.

                      So if anyone owes license fees for SMT it sounds like it would be Intel. Although the original IBM patents would have expired decades ago.
                      Last edited by Zan Lynx; 08 January 2021, 04:24 PM.

                      Comment

                      Working...
                      X