Announcement

Collapse
No announcement yet.

The ECC DDR4 RAM Overclocking Potential With AMD Threadripper On Linux

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #21
    Originally posted by schmidtbag View Post
    In general, overclocking ECC is pretty counterproductive. The whole point of having ECC is to improve stability and reliability. When you overclock, you're going beyond the manufacturer's specs, therefore pushing beyond the intended limits of the product. So basically the benefit you're paying extra for is negated.
    That's plain wrong. ECC isn't providing generic "stability and reliability". ECC isn't a certification or a warranty that can be voided by "going beyond the manufacturer's specs".

    ECC is providing ability to detect and correct single bit errors and to detect multi-bit errors, and this ability isn't invalidated by overclocking the RAM.

    Overclocking increases your chances of getting memory errors, but ECC will deal with that until you get very very overboard with your OC and start getting multi-bit errors. It will also provide a very useful log of all RAM errors so you know how far you can push the OC with much more info than just "The OS didn't crash yet" which is what you have on a normal consumer system. This allows a much more safe and scientific approach to the matter.

    Then of course what you do with your system depends from the target, not all servers are supposed to be 0-downtime-no-matter-the-cost.

    Overclocking your company's supercritical database server is probably NOT a good idea, while OC your compute nodes or webserver farm make a lot more sense, and there ECC is going to be pretty useful as I said above.

    Comment


    • #22
      Originally posted by schmidtbag View Post
      the lack of ECC at 3Ghz+ is either because:
      A. There's some industry regulation I'm not aware of that legally prevents them from doing so.
      B. There's no demand (which is BS).
      C. Because they can't guarantee performance/reliability - something ECC users care about since that's the sole purpose of ECC.
      Anyway - I have no doubt ECC is capable of going higher. But can!=should. This is probably why neither AMD nor Intel offer overclocking on their server platforms.
      They are all wrong, the correct answer is D (and I'm kind of amazed that I'm the only one here actually able to answer you)

      D. The main reason there are upper limits on the RAM frequency (ECC or not) is the RAM controller which is NOT inside the RAM itself but inside the CPU. RAM does not come with its own controller, at most you get a signal repeater thing in the Buffered/Registered ECC banks.
      RAM manufacturers can easily make RAM can go at higher speeds, RAM per-se is relatively simple and dumb repetitive thing like all types of memory. What they can't do is make the RAM controller in your CPU go faster. This is AMD/Intel thing and is not easy to do.

      Comment


      • #23
        Originally posted by starshipeleven View Post
        That's plain wrong. ECC isn't providing generic "stability and reliability". ECC isn't a certification or a warranty that can be voided by "going beyond the manufacturer's specs".
        I never said anything about certification or warranties.
        Overclocking increases your chances of getting memory errors, but ECC will deal with that until you get very very overboard with your OC and start getting multi-bit errors. It will also provide a very useful log of all RAM errors so you know how far you can push the OC with much more info than just "The OS didn't crash yet" which is what you have on a normal consumer system. This allows a much more safe and scientific approach to the matter.
        I know. I agree. I don't understand why people think I don't agree with this.
        Overclocking your company's supercritical database server is probably NOT a good idea, while OC your compute nodes or webserver farm make a lot more sense, and there ECC is going to be pretty useful as I said above.
        Hence the entire premise of my point.
        Originally posted by starshipeleven View Post
        D. The main reason there are upper limits on the RAM frequency (ECC or not) is the RAM controller which is NOT inside the RAM itself but inside the CPU. RAM does not come with its own controller, at most you get a signal repeater thing in the Buffered/Registered ECC banks.
        RAM manufacturers can easily make RAM can go at higher speeds, RAM per-se is relatively simple and dumb repetitive thing like all types of memory. What they can't do is make the RAM controller in your CPU go faster. This is AMD/Intel thing and is not easy to do.
        Good point - that had occurred to me, but many systems that don't allow CPU overclocking (whether that be because of the CPU itself or the chipset) have RAM overclocking. Granted, I don't know if any server motherboards allow for this, but some Xeon-based workstations do. So, although the potential market for higher-clocked ECC RAM is somewhat small, it would still be profitable. It's important to keep in mind that a single ECC compatible PC is most likely going to have at least 4x DIMMs.

        Comment


        • #24
          applied a -81.25 mV vcore offset to help it clock higher
          Aren't you supposed to increase the voltage for higher clock frequencies? You can't just think about thermal throttling here and ignore voltage-related stability.

          Comment


          • #25
            Originally posted by schmidtbag View Post
            I never said anything about certification or warranties.
            You treated it as such by implying that it does somehow not do that if you "go beyond the manufacturer specifications".

            Hence the entire premise of my point.
            Having the same premise does not make your point any less invalid.
            Your point was that ECC is useful only if you need a high stability and reliable system, and that if you OC "the benefit you're paying extra for is negated."

            This is simply not true, no benefit is negated as ECC still works as advertised. ECC still increases the "stability and reliability" of the system and it is worth the money even if you OC (as it removes a lot of the guesswork to do a stable OC, and can stabilize what would be an unstable OC, if you feel like pushing it).

            Good point - that had occurred to me, but many systems that don't allow CPU overclocking (whether that be because of the CPU itself or the chipset) have RAM overclocking. So, although the potential market for higher-clocked ECC RAM is somewhat small, it would still be profitable.
            I'm not sure you understood D. CPU overclocking is irrelevant here.
            There is 0 market for RAM that can go higher than a certain high frequency because no RAM controller can reach those frequencies, period.

            When you "overclock the RAM" you are overclocking 2 components together: the RAM chips (in the DIMM banks) and the RAM controller (in the CPU).

            The component that is actually holding frequencies back is the controller, which is integrated in the CPU, you cannot change it.

            Granted, I don't know if any server motherboards allow for this, but some Xeon-based workstations do.
            I've seen a fair share of serious server boards that do allow memory OC, it may not be prevalent but it's possible to source them if you need them.
            Random examples from google:
            https://www.hardwarecanucks.com/news...-motherboards/
            https://www.servethehome.com/overclo...memory-speeds/

            Comment


            • #26
              Originally posted by stefantalpalaru View Post
              Aren't you supposed to increase the voltage for higher clock frequencies? You can't just think about thermal throttling here and ignore voltage-related stability.
              Keep in mind, I'm using precision boost overdrive, not a traditional manual OC. PBO automatically applies more voltage when it clocks the CPU up (the default setting in BIOS is to give it an extra 200 mV of leeway.) However, PBO is fairly conservative and applies more voltage than needed, so that someone who loses the silicon lottery won't end up with instability. Even with the offset, the CPU is still adding voltage as it goes into higher clocks, it's just adding less of it, which improves thermals and helps it maintain clocks.

              I found that a -100mV offset was actually rock solid under full synthetic load (mprime torture test on all cores,) but it was unstable on some of the intermmediate power states, where PBO doesn't pile on quite as much voltage. 100mV was crashing while running the phoronix test suite-- specifically the PHP code of PTS itself, not any of the benchmarks! Yesterday I ran into the same problem on my 80mV offset and had to ease it back to 40. The problems at 80mV were showing up while running dpkg of all things, not under heavy load.

              I believe there's a way in BIOS to adjust voltage offsets for each pstate separately. I may write a script to torture-test each pstate individually by locking the CPU to a certain pstate and then running mprime, and then I can probably have my 100mV offset back under full load.

              Comment


              • #27
                Originally posted by starshipeleven View Post
                You treated it as such by implying that it does somehow not do that if you "go beyond the manufacturer specifications".
                I really didn't, and I specified multiple times that wasn't the case.
                Having the same premise does not make your point any less invalid.
                Your point was that ECC is useful only if you need a high stability and reliable system, and that if you OC "the benefit you're paying extra for is negated."
                No, my point is ECC is appealing as a product to buy because of stability, so it's counter-intuitive to overclock it. Think of it like this:
                Let's say you bought a non-ECC module rated at 3.0Ghz and an ECC at 2.4Ghz used with the same motherboard (not at the same time, obviously). Hypothetically, let's say you got the ECC to reach 3.0Ghz, maybe with some timing or voltage adjustments. Since you have no proof of silicon quality at purchase, the ECC module could very easily end up getting enough errors to cause an occasional system crash on a non-ECC module. Of course, the nice thing is now you've got a way to recover from those errors, but the fact of the matter is, your stability is already compromised. So, why get ECC at that point? If you care about squeezing in extra performance, you might as well just get a higher-clocked non-ECC.
                Keep in mind, I do see a real market for higher-clocked ECC. I find that to be necessary, I just think it's counter-intuitive to OC it, when compared to higher-clocked non-ECC modules.
                This is simply not true, no benefit is negated as ECC still works as advertised. ECC still increases the "stability and reliability" of the system and it is worth the money even if you OC (as it removes a lot of the guesswork to do a stable OC, and can stabilize what would be an unstable OC, if you feel like pushing it).
                ECC is meant to be stable. Pushing clocks higher than the manufacturer's specs reduces stability. ECC will remain more stable than non-ECC when overclocked, but since ECC default clock speeds are so low, you've got such a long way to go to reach any real performance, in which case you're gambling a high risk with stability.
                I'm not sure you understood D. CPU overclocking is irrelevant here.
                I know it isn't..... my point is RAM can still often be overclocked regardless of whether the [rest of the] CPU can. I mentioned this because you said the RAM controller can't be made to go faster.

                Comment


                • #28
                  Originally posted by schmidtbag View Post
                  Of course, the nice thing is now you've got a way to recover from those errors, but the fact of the matter is, your stability is already compromised.
                  It literally isn't. The errors are being recovered in this scenario. There are literally zero bad consequences and therefore no risks.

                  So, why get ECC at that point? If you care about squeezing in extra performance, you might as well just get a higher-clocked non-ECC.
                  Because people can have more than one goal.

                  ECC is meant to be stable. Pushing clocks higher than the manufacturer's specs reduces stability. ECC will remain more stable than non-ECC when overclocked, but since ECC default clock speeds are so low, you've got such a long way to go to reach any real performance, in which case you're gambling a high risk with stability.
                  Not really. I'm well within factory spec for Samsung B-Die. I might be exceeding spec on the PCB but that doesn't matter nearly as much.

                  Comment


                  • #29
                    Originally posted by MaxToTheMax View Post
                    It literally isn't. The errors are being recovered in this scenario. There are literally zero bad consequences and therefore no risks.
                    No, it isn't "literally zero bad consequences", because errors are being produced in situations they otherwise wouldn't be. ECC isn't magical; it doesn't guarantee perfect data 100% of the time. Boost the clocks a little bit higher and the error correction won't keep up.
                    Because people can have more than one goal.
                    Yes, and my point is that goal is stupid if the final outcome doesn't yield better results than a non-ECC module proven to work at a higher clock speed from factory. Of course, it can yield better results, but to me, the risk isn't worth it. I'd rather be able to buy ECC that comes with higher clocks out-of-the-box.
                    Not really. I'm well within factory spec for Samsung B-Die. I might be exceeding spec on the PCB but that doesn't matter nearly as much.
                    None of that means anything. What are the factory specs? What did you OC to? What are your voltages or timings? How are you testing for stability? It's not hard to OC to something and get it to run perfectly stable on everyday workloads, but that doesn't mean it won't crash once you put a heavy load on it. I've gone well over a year without crashes/freezing on an overclock that failed stress tests.
                    ECC can handle a few errors here and there no problem, but the fact that it is encountering errors at all where a non-ECC module (of the same clock) wouldn't is not something I would consider a desirable outcome.

                    Comment


                    • #30
                      Originally posted by schmidtbag View Post
                      No, it isn't "literally zero bad consequences", because errors are being produced in situations they otherwise wouldn't be. ECC isn't magical; it doesn't guarantee perfect data 100% of the time. Boost the clocks a little bit higher and the error correction won't keep up.
                      I'm not even close to that point though. The RAM is still rock solid.

                      Yes, and my point is that goal is stupid if the final outcome doesn't yield better results than a non-ECC module proven to work at a higher clock speed from factory. Of course, it can yield better results, but to me, the risk isn't worth it.
                      Not can. Will. Picking ECC in this scenario reduces risk. The primary downside is cost, and that's a perfectly legitimate reason not to buy ECC.

                      I'd rather be able to buy ECC that comes with higher clocks out-of-the-box.
                      And that product doesn't exist.

                      None of that means anything. What are the factory specs? What did you OC to? What are your voltages or timings? How are you testing for stability? It's not hard to OC to something and get it to run perfectly stable on everyday workloads, but that doesn't mean it won't crash once you put a heavy load on it. I've gone well over a year without crashes/freezing on an overclock that failed stress tests.
                      OCed to 2933, which is hardly "fast." 1.2 volts. 16-15-15-36. Stability tested it using memtest86+, mprime, and all the memory benchmarks I ran-- this testing regimen certainly caught the problems when I pushed too far for the stock timings.

                      ECC can handle a few errors here and there no problem, but the fact that it is encountering errors at all where a non-ECC module (of the same clock) wouldn't is not something I would consider a desirable outcome.
                      All RAM encounters some bit errors from time to time, ECC just corrects them and warns you.

                      I'm pleased to report that I haven't encountered a single bit error yet at these settings, including under load.

                      Comment

                      Working...
                      X