Announcement

Collapse
No announcement yet.

Is ECC RAM worth it for a desktop PC?

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    25,000 to 70,000 errors per billion device hours per Mbit seems very high.
    Say you use 2Gb ram, 4 hours a day, that's 70 to 200 errors per year.
    So, a hundred crashes or file / filesystem corruptions a year.

    Or do this numbers mean something different?

    Comment


    • #17
      Do note that the study was published in 2009, and observed servers over a period of 2.5 years. So you are mostly talking about machines which were commissioned in 2006 or earlier. And the errors are clustered, with DIMMs observing one error being more likely to observe another error.

      Comment


      • #18
        I've been looking at this issue, for either later this year or early next year. In my case, it's for a work-from-home machine. (There are other reasons for putting my own money into this, instead of my employer's, but that's a different issue.)

        For the things I wish to do, basically silicon CAD, I need as much memory as I can get. These days 8G doesn't really cut it for me, any more. As others have said, if a bit flips in code, you'll likely crash. If a bit flips in data, it might later cause a crash due to becoming invalid - the real problem is if that data becomes "reasonable, but wrong." In that case there will be a lurking, potentially undetected error. I want ECC.

        Given that it's my personal budget, that also means AMD. To get ECC in Intel you need to get a XEON CPU, then a XEON motherboard, then typically buffered ECC DIMMs. System cost more than doubles. With AMD, by careful motherboard selection you get ECC for "free". Then you just need to by unbuffered ECC DIMMs, which are more expensive, but not as much as buffered ECC DIMMs. System cost goes up a few hundred or so.

        One caveat... That "32G" motherboard will only take 24G once you're using ECC DIMMs. The 32G capacity appears to be right at the drive capacity of the chipset/CPU. Once you add the extra 11% chips to support ECC you're over spec, and drop back to 24G.

        I had a motherboard picked out a few months ago, basically a 990-series chipset, but I presume things have changed since. My CAD needs are primarily 2D, so while I want decent graphics performance, I'm not talking SLI/Crossfire and such. I had also been thinking of a Piledriver/Vishera, like you. But I've waited long enough now, I'm kind of wondering what Steamroller will bring. My current workload is primarily integer-intensive, which shouldn't suffer from AMD's poorer floating point, as compared to Intel. But Piledriver is supposed to improve floating point, and there's always future workload.

        Comment


        • #19
          Originally posted by DeiF View Post
          But, is it really worth it?
          How common are bit-flip errors in modern RAM?
          Is the performance drop noticeable?
          1. I think it is, partly because I leave my computers running 24/7 and don't reboot for many months. If you power them off at night, or reboot frequently, it's less of an issue.

          2. Bit-flip errors are quite common in "modern" RAM. I'm not sure how you define "modern", but the likelyhood of experiencing a bit-flip error increases as the number of bits increases. I.e. if you have 8 GB of RAM, you are eight times more likely to experience an error than if you only had 1 GB. As typical RAM capacities continue to grow, so does the likelyhood of experiencing an error - And that assumes a constant error rate; some believe the smaller sized memory cells in modern RAM are more error prone than the larger cells of older fab processes.

          3. There is no performance drop. Zero. Benchmark the same system back to back with ECC enabled or disabled and any observed difference will be within the margin of error. Any "performance drop" is a purely theoretical but is not measurable in the real world, not with any standard desktop or server applications.

          Originally posted by PreferLinux View Post
          I've got 12 GB of RAM, and never seen a problem.
          Exactly, and that's why its dangerous. You frequently won't "see" a problem with bit-flip errors. It's called silent corruption. Read the wikipedia page on silent corruption. It happens to hard drives and it happens in RAM. Your data becomes corrupt and you don't notice until its too late. Ever opened a JPG file you had saved, a photo from your camera for example, and found it has weird corruption, colored stripes, etc? That's a bit-flip error. The bit-flip may have occured on your hard drive (very common) or it may have occured in RAM (also common).

          The only way to completely eliminate the effects of a bit-flip error are to use ECC memory, and to use RAID disks for your storage.

          Comment


          • #20
            About error rates:
            In the Google study they found that as memory chips aged, the number of errors would rise. This somewhat offset the higher number of errors due to higher capacity, as newer modules would typically have higher capacity.

            About performance:
            On my AMD FX-8350 I measured kernel compile times with ECC enabled and disabled, and found no difference apart from the normal variation.
            Modern high-end GPUs since NVidia Fermi and AMD Southern Islands use a different method of ECC which uses "normal" memory and not extra address lines. There you can measure a performance impact.

            On data corruption:
            RAID does not typically detect bit flip errors on hard disks, unless it employs some kind of data integrity check. Some expensive hardware RAID controllers do that, and ZFS RAID-Z does it too.

            Comment


            • #21
              Originally posted by chithanh View Post
              About error rates:
              In the Google study they found that as memory chips aged, the number of errors would rise. This somewhat offset the higher number of errors due to higher capacity, as newer modules would typically have higher capacity.
              I remember that one, it was a good read. It's important to distinguish between error rate, and number of errors. Two very different things. Even if error rate stays the same, you will have double the number of errors, if you double the RAM capacity. The combination of a higher capacity plus a higher rate, equals a whole lot more errors. High capacity + age = double whammy.

              Originally posted by chithanh View Post
              On data corruption:
              RAID does not typically detect bit flip errors on hard disks, unless it employs some kind of data integrity check. Some expensive hardware RAID controllers do that, and ZFS RAID-Z does it too.
              All but the cheapest consumer-grade RAID controllers do integrity checking. Even the Linux kernel software RAID runs regular consistency checks. If the kernel software raid cannot read a block from one disk, it will remap that block to a new location, and reconstruct it from the other RAID members. You can also force a manual consistency check at any time, with "echo check >> /sys/block/md0/sync_action" assuming md0 is your array.

              You're correct though in that it doesn't detect disk errors on the fly (unless its accompanied by a read error), only during the regularly scheduled consistency check.

              Comment


              • #22
                If you do many things in parallel and/or develop or/and build software 8GB Ram is not enough,
                Even when I do so a simple thing like building firefox (with arch its quite easy) it takes so much ram that ld gets killied which linking, if I doesn't watch what I have opened.

                Comment


                • #23
                  Originally posted by torsionbar28 View Post
                  You're correct though in that it doesn't detect disk errors on the fly (unless its accompanied by a read error), only during the regularly scheduled consistency check.
                  Indeed, that helps you only if the disk reports an error. If the disk (or the controller, very nasty but happens) returns bogus data pretending everything is a-ok then you are still out of luck. And even during consistency check RAID can only detect, not correct the errors.

                  Comment


                  • #24
                    Originally posted by DeiF View Post
                    I'm planning a big upgrade for a desktop PC which is 6 years old already.
                    The thing I've found it was more limited was in RAM quantity (only 2GB).
                    My most memory demanding tasks now require about 8GB of RAM (without the OS, programs, etc).
                    Since I want to be future-proof, I've decided to max out whatever is the maximum RAM in the new system (32GB)
                    I'm currently considering to buy an Asus Crosshair V Formula Z, and an AMD FX-8350 CPU.
                    Since both components support ECC RAM, and my budget seems to be enough to buy it, I'm considering that option.

                    But, is it really worth it?
                    How common are bit-flip errors in modern RAM?
                    Is the performance drop noticeable?


                    ECC RAM availability in my zone is pretty rare, but I've found this kit that may work:
                    KVR16E11K4/32 (4x8GB DDR3 ECC 1600Mhz CAS 11)

                    It isn't listed in the motherboard docs thought, so it's a bit risky. I haven't contacted Asus yet.
                    My opinion is that ECC isnt really needed for desktop usage, but all RAM should be buffered, unfortunately the CPU makers don't agree with me....

                    Comment


                    • #25
                      Originally posted by torsionbar28 View Post
                      All but the cheapest consumer-grade RAID controllers do integrity checking. Even the Linux kernel software RAID runs regular consistency checks. If the kernel software raid cannot read a block from one disk, it will remap that block to a new location, and reconstruct it from the other RAID members. You can also force a manual consistency check at any time, with "echo check >> /sys/block/md0/sync_action" assuming md0 is your array.

                      You're correct though in that it doesn't detect disk errors on the fly (unless its accompanied by a read error), only during the regularly scheduled consistency check.
                      Indeed, that doesn't protect you against bitflips, which are not read/write errors but incorrect read/writes.
                      Chechsumming helps against that, e.g. btrfs uses checksums in metadata, so it can recover from bitflip in RAID configuration.

                      Comment


                      • #26
                        It turns out that non-ECC RAM is actually a security risk, as bit flips can be exploited. "Bit-squatting" from Black Hat 2011:

                        http://www.youtube.com/watch?v=_si0FYl_IOA

                        Comment


                        • #27
                          Is your desktop computer that mission critical?
                          ECC is mainly used for high available mission critical systems.

                          Comment


                          • #28
                            If your motherboard supports ECC, absolutely go for it. I used to use it on my desktop, and it was rock solid...never had a single problem with it. It's a simple thing that makes your computer more robust and less likely to go haywire.

                            These days, I don't use ECC just because most systems don't seem to have support for it. If they did, I'd switch back in a heartbeat.
                            Free Software Developer .:. Mesa and Xorg
                            Opinions expressed in these forum posts are my own.

                            Comment


                            • #29
                              I seriously question the value of ECC memory. This is not because I believe bit-errors (and more importantly their detection) are unimportant but rather because there are simply so many other sources of error on a modern system.

                              Properly evaluating the utility or necessity of ECC DRAM requires one to adopt a systems approach. First, can you list all sources of DRAM in your system? Every from the main memory to DRAM chips on hard disks and ethernet controllers. Secondly, of those chips how many of them are either ECC or utilise some kind of software pairty checking? Chances are at this point you'll find that your GPU (unless it is a workstation class card) has non-ECC memory.

                              Next, can you tell me the error rates for all disks in your system? What about network connections? Checksums are not as strong as many like to make out. When SCP'ing large quantities of data over a local network it is not usual (often because of a bad cable) for errors to occur. These errors -- in my experience -- are usually picked up at the application level by SCP having first passed the Ethernet, IP and TCP checksums.

                              If you're happy with all of that then it is time to move onto internal interconnects. Take moving data across the PCI(-e) bus as an example. Or between a S-ATA drive and the host controller. What is the error rate here?

                              Finally, lets look at the CPU. What is the error rate (as in miscomputation of a result) here? Sure, it shouldn't happen, but still, for a meaningful evaluation of ECC memory to be made a rate is necessary.

                              If this is still not enough lets go for the elephant in the room. Hardware and software bugs. I suspect these are orders of magnitude more prevalent than any of the hardware issues outlined above.

                              Regards, Freddie.

                              Comment


                              • #30
                                I would buy PC with ECC RAM, but cheapest CPU (Xeon 1225v3) costs about 210 Euro and cheapest motherboard (Asus P9D WS) costs 200 Euro.
                                That CPU become outdated after about 3 years when integrated GPU become outdated because of driver support.
                                Intel doesn't sell discrete GPUs. Open source drivers for AMD discrete GPUs aren't usable and there is too limited offer of fanless 28nm discrete GPUs.

                                Comment

                                Working...
                                X