Announcement

Collapse
No announcement yet.

Linux 6.9 Adding AMD MI300 Row Retirement Support For Problematic HBM Memory

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Yes I recalled hearing various things about radiation effects on semiconductors over the years but the one I mostly think about
    these days is just "bit flips" due to either cosmic rays or radiation intrinsic to natural radioactive particles inside the mechanics of the computer itself and occasionally there being a decay that produces something that's energetic enough to flip a bit.

    I recall hearing the ceramic packages formerly commonly used around many chips (memories, CPUs, ...) were relatively much more problematic in that respect (ceramics) than maybe the more modern epoxy/plastic potting compounds are.

    But anyway unless the radiation is enough to reprogram the non-volatile / OTP chip it'll be a soft-error that a reboot or ECC can fix.

    Old UV-EPROMs were erased by a dose of UV shining through a glass/quartz window in the package so they could be reprogrammed,
    though sunlight would eventually do the same thing after a few weeks, and x-rays would also. Those devices were basically
    MOS capacitors that had charge trapped on them by basically zener like action of a "high" programming voltage and that stored charge
    would stay there for many years normally unless optically erased.



    So anything in NV/OTP chips of that nature (FLASH, EPROM, OTP) actually is vulnerable to "catastrophic" bit​ flipping by radiation or just slow decay with time (years) / temperature.

    That is actually something that has often caused the "death" of old electronics over decade time scales,
    the EPROMs / OTP / FLASH that once held critical program code / data or calibration / configuration settings is fine for a long time until
    the bits flip and corrupt the memory devices at which point there's no fix unless one has a backup of the correct data and the means to
    reprogram / replace the chip with a correctly programmed one but that's often impractical since such chips are often discontinued & unobtainable already for 10+ years before the problem arises.



    Re: machines humming along for decades -- yes, that's actually one thing I'm nostalgic and upset about. We live in such a
    "throw away" society and "planned obsolescence" is actually engineered deeply into the "quality" and architecture of modern
    equipment. A lot of old electronics I've worked with from say the 1940s-1980s was actually incredibly well built and with just
    occasional maintenance could be kept alive for decades. Even computing equipment from the 1980s-1990s quite often was
    relatively robust beyond keeping the capacitors / corrosion / dust handled.

    Now one is unlikely to find things with warranties exceeding 1-3 years, QC so bad there are often "bought new and it doesn't work" failures and then very marginal quality designs that are pretty likely to fail after a few years vs. decades.

    It is tragic environmentally and also because look at it we basically literally personally have personal supercomputers by any standard which would have prevailed in the 1990s i.e.
    "A 1480-processor Cray T3E-1200 was the first supercomputer to achieve a performance of more than 1 teraflops running a computational science application, in 1998​" -- A single GPU I bought in ~2008 had 1TFlop performance and retailed for under $300.

    So now we've got all this 1, 2, 3, 4 generation old computing equipment that goes into landfills, mostly unbroken, though a tragic lot of it
    dying in the first handful of years in service and yet probably very potentially useful, and if architected for better expansion / reuse they'd
    be a lot more so vs. just ending up trashed in 3-10 years.


    Originally posted by coder View Post
    The radiation one really threw me, though. I have no idea how big an issue it is, but the aspect of it I find troubling is that it doesn't require the device to be operating. I can understand use-related wear, like electromigration, but I find it unsettling that you can't even reliably preserve these devices in storage!

    Regarding wear-related use, one thing I've heard about Intel's semiconductor fabrication technology is that they maintain a standard that their chips should be able withstand 10 years of continuous operation. I'm not sure if I'm quoting that exactly right, and for sure it doesn't apply to all of their CPU models (e.g. not "unlocked" desktop processors, for instance), but that at least some of the models made on each process node should withstand that amount of usage.

    At my job, some of our oldest servers (mostly old test machines) have been humming away in a corner, for longer than that. All Intel CPUs, FWIW. Light-to-moderate continuous load.

    Comment


    • #32
      Originally posted by duby229 View Post
      What's your view about wikipedia then? I post links to provide general information all the time. I don't see much difference between AI generated responses vs wikipedia content generated by any willing contributor. When posting links to general information, can I post the wikipedia article or should I scour the sources quoted by the article?

      Honestly I'd trust GPT4 over most of this forums users any day, even over myself in most subjects.
      I would not trust ChatGPT, it is only a program that is limited by the knowledge of the people programming it. Like any tool it can be useful but it doesn't mean that the results will always be right.

      I asked it to consider the following scenario, the only 3 celestial objects are the Sun, Jupiter and Pluto and the two planets are positioned on either side of the Sun at a distance equal to the furthest reaches of our solar system and they are not in orbit, they are stationary.

      The Law of Universal Gravitation and Relativity both state that both planets should start falling towards the Sun.

      Which will reach the Sun first?

      This question does not seem to have an agreed upon answer.

      I was watching an interview with a NASA physicist that said that in this scenario, Jupiter would reach the Sun first because it is so massive that not only does the Sun pull on it but it pulls on the Sun significantly more than Pluto pulls on the Sun, so in effect the distance that Jupiter has to travel is shorter.

      This seems reasonable, up until you consider that because Jupiter has greater mass than Pluto it also has greater inertia, and the Sun by virtue of its mass has even more inertia, meaning that because it's easier to accelerate Pluto it should make up for any shortening of distance between Jupiter and the Sun.

      Physics professors that I have asked have sided with this second assumption for the most part, at times with slight variations in the reasoning.

      Some variations included that despite the substantial difference in mass between Jupiter and Pluto, it's insignificant compared to the mass of the Sun, so this doesn't come into play.

      ChatGPT gave me three different answers, with physics and math justifications that Jupiter would win, Pluto would win and that they would tie.



      Last edited by sophisticles; 21 February 2024, 12:41 AM.

      Comment


      • #33
        Originally posted by sophisticles View Post
        consider the following scenario, the only 3 celestial objects are the Sun, Jupiter and Pluto and the two planets are positioned on either side of the Sun at a distance equal to the furthest reaches of our solar system and they are not in orbit, they are stationary.

        The Law of Universal Gravitation and Relativity both state that both planets should start falling towards the Sun.

        Which will reach the Sun first?

        This question does not seem to have an agreed upon answer.

        I was watching an interview with a NASA physicist that said that in this scenario, Jupiter would reach the Sun first because it is so massive that not only does the Sun pull on it but it pulls on the Sun significantly more than Pluto pulls on the Sun, so in effect the distance that Jupiter has to travel is shorter.

        This seems reasonable, up until you consider that because Jupiter has greater mass than Pluto it also has greater inertia, and the Sun by virtue of its mass has even more inertia, meaning that because it's easier to accelerate Pluto it should make up for any shortening of distance between Jupiter and the Sun.
        WTF? Objects in a vacuum fall at the same rate, irrespective of mass. That's basic secondary school physics. The NASA person is right.

        Originally posted by sophisticles View Post
        ​Physics professors that I have asked have sided with this second assumption for the most part, at times with slight variations in the reasoning.

        Some variations included that despite the substantial difference in mass between Jupiter and Pluto, it's insignificant compared to the mass of the Sun, so this doesn't come into play.
        Insignificant? You just asked which is first. The problem doesn't say anything about significance.

        I'll throw in another factor: the diameter of Jupiter gives it an extra advantage,if you're counting "reaching the sun" to mean the point of first "contact", for some definition of contact. If, when you say they're at the same distance, you mean their centers, this would give Jupiter a shorter distance to cover.

        I suppose we could consider the effect of the solar wind... somehow, I wouldn't expect that to outweigh the other factors in Jupiter's favor.
        Last edited by coder; 21 February 2024, 03:44 AM.

        Comment


        • #34
          Originally posted by pong View Post
          Yes I recalled hearing various things about radiation effects on semiconductors over the years but the one I mostly think about
          these days is just "bit flips" due to either cosmic rays or radiation intrinsic to natural radioactive particles inside the mechanics of the computer itself and occasionally there being a decay that produces something that's energetic enough to flip a bit.
          I worry about cosmic rays a lot less than I worry about defective or failing memory. I've seen many ECC errors in the past couple decades, and they've pretty much all been specific to one or more DIMMs. I do not see just random, sporadic ECC errors, like you'd expect from cosmic rays.

          Originally posted by pong View Post
          But anyway unless the radiation is enough to reprogram the non-volatile / OTP chip it'll be a soft-error that a reboot or ECC can fix.
          That's not what I read. They were talking about terrestrial radiation causing actual wear on the transistors. I have no idea about the rate...

          Originally posted by pong View Post
          ​That is actually something that has often caused the "death" of old electronics over decade time scales,
          the EPROMs / OTP / FLASH that once held critical program code / data or calibration / configuration settings is fine for a long time until
          the bits flip and corrupt the memory devices at which point there's no fix unless one has a backup of the correct data and the means to
          reprogram / replace the chip with a correctly programmed one but that's often impractical since such chips are often discontinued & unobtainable already for 10+ years before the problem arises.
          The computer module responsible for traction control and anti-lock braking in my car died after 10+ years. I wondered if its flash memory simply suffered bit rot.

          BTW, the easy way to combat this is to have the device periodically refresh itself, which modern SSDs are continually doing in the background.

          Originally posted by pong View Post
          ​​we basically literally personally have personal supercomputers by any standard which would have prevailed in the 1990s i.e.
          "A 1480-processor Cray T3E-1200 was the first supercomputer to achieve a performance of more than 1 teraflops running a computational science application, in 1998​" -- A single GPU I bought in ~2008 had 1TFlop performance and retailed for under $300.
          Yeah, I had basically the same thought, when I bought my first 1+ TFLOPS GPU. Except, mine happened to come in a box with a scantily clad woman on it, which I found especially incongruous.

          Originally posted by pong View Post
          ​​​So now we've got all this 1, 2, 3, 4 generation old computing equipment that goes into landfills, mostly unbroken, though a tragic lot of it
          dying in the first handful of years in service and yet probably very potentially useful, and if architected for better expansion / reuse they'd
          be a lot more so vs. just ending up trashed in 3-10 years.
          I keep most of my stuff for 5-10 years, but one machine I have is in service for 12 years and another is 14 years old. However, I don't run them 24/7.

          One thing that particularly bugs me is that even obsolete graphics cards tend to have perfectly good coolers on them, yet the entire thing gets discarded. Too bad there's not a standard, so that you could reuse them on other cards.

          Comment


          • #35
            WTF? Objects in a vacuum fall at the same rate, irrespective of mass. That's basic secondary school physics. The NASA person is right.
            The NASA physicist actually said the exact opposite, that Jupiter would reach the Sun first and in order to understand the reasoning we need to remember that Newtonian mechanics can not be used to analyze this problem because Newton is a special case of Relativity that only works under conditions like we find on Earth.

            The reason why objects on Earth, and the Moon when they tested using a hammer and feather, appear to fall at the same rate is because the differences in their masses is insignificant compared the difference of each compared to a celestial body, for instance the Earth or Moon.

            If we consider a more extreme example of the problem, we see why the NASA scientist said what he said.

            Consider the scenario where we have our Sun at the center of our solar system, Earth's moon at the edge of our solar system at a distance of 200,000 AU from the Sun and on exact opposite side of the Sun we have a star just like our Sun, with the same mass and diameter.

            In this scenario it becomes clear why the NASA scientist said what he said, all objects pull on each other; even on Earth, it pulls on us and we pull on it and we pull on each other.

            In space, the gravitational pull of the planets is enough to actually cause the Sun to wobble on its orbit.

            With this in mind, in the scenario of the Sun, a star and the Moon, it's obvious that the two stars would be pulling on each other and moving closer together and consequently our Sun in the middle would be moving away from the Moon. Meaning in this hypothetical race the two stars would collide first and then the Moon would eventually join them.

            Of course, there is the wrinkle that the star on the other side of the Sun is also pulling on the Moon, meaning the Moon is being pulled by two stars linearly aligned, and so it would be expected that it would accelerate faster than is only the Sun was pulling on it.

            Would the acceleration caused by the pulling of the two stars be enough to make up for the longer distance it would have to travel?

            The question with the Sun, Jupiter, Pluto problem is would the gravitational attraction between Jupiter and the Sun be great enough to cause them to move closer together by a factor significant enough to result in a measurable difference in the amount of time it took for each to reach the Sun and if they did move closer by a measurable amount would the gravitational pull of Jupiter be enough to accelerate Pluto fast enough to make up for the longer distance it would have to travel?
            Last edited by sophisticles; 24 February 2024, 01:30 AM.

            Comment

            Working...
            X