Announcement

Collapse
No announcement yet.

Failing A PCIe 5.0 NVMe SSD In Less Than 3 Minutes Without Extra Cooling

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #11
    Thanks Michael for further looking into it. I am going to stay away from pcie5.0 disks for now. Btw, did you try to downgrade the pcie speed to pcie4 or even 3 as an alternative to adding a heatsink ? https://www.alexforencich.com/wiki/en/pcie/set-speed

    Comment


    • #12
      Originally posted by bezirg View Post
      Thanks Michael for further looking into it. I am going to stay away from pcie5.0 disks for now.
      Do you have any indication that this is because of PCIe5? Why is there no other PCIe5 drive with this error?

      Comment


      • #13
        Originally posted by Anux View Post
        Do you have any indication that this is because of PCIe5? Why is there no other PCIe5 drive with this error?
        Read the review on TPU (and the thread that goes with it). It's definitely overheating during sustained operations (writes).

        Comment


        • #14
          Sure I read it and there is no indication that it has anything to do with PCIe5, therefore my question.

          Originally posted by bug77 View Post
          It's definitely overheating during sustained operations (writes).
          Exactly, overheating of this specific drive, so how did you conclude that all PCIe5 drives will have this error?

          Comment


          • #15
            Originally posted by Anux View Post
            Sure I read it and there is no indication that it has anything to do with PCIe5, therefore my question.



            Exactly, overheating of this specific drive, so how did you conclude that all PCIe5 drives will have this error?
            Who concluded all PCIe 5 drives have this problem? It overheats because of the sequential speeds PCIe 5 enables, we've seen this before with PCIe 4 drives. If you read the comment section on TPU, one SM representative says that's not how the drives is supposed to behave and that they're looking into it. Most likely, it's supposed to throttle, but for some reason it doesn't. If the problem is in the generic part of the controller firmware, it could affect other drives using the same controller.

            Comment


            • #16
              As others have mentioned, it's likely that more "guts" will be required on PCIe 5 drives than before to handle thermal situations better. Possibly driving the price too high (?)

              Anyway, PCIe 5 is still an infant deployment wise. So, I could see "something" else even that moves us away from traditional NVMe down the road.

              Comment


              • #17
                Originally posted by Anux View Post
                That certainly is an error in the firmware or even hardware? The device should start to throttle when temps get to high and eventually become slow as fuck but r/w errors are a major flaw.
                Originally posted by Joe2021 View Post
                But this should not have gone unnoticed during intense testing before mass production. They do such tests, right? Right?
                Just because you run tests does not mean you will see this fault because test environment is critical. Lets say I am running these tests on the industry standard open test bench in a air conditioned room to 24C maintaining part clean room conditions(think increase room air turn over to normal) it also common for the CPU to be sitting under a water cooling block. In that environment may have functioned perfectly correctly because that environment compared to a normal environment would be like pointing a fan constantly at ti..

                Anux question is what is overheating. Full shutdown suggests controller overheating possible because of running at PCIe 5.0 speeds. Throttling is not going to work if the OS just end up thrashing the drive with request of where is my data.

                The Corsair MP700 is the first PCI-Express 5.0 SSD that we're reviewing. With transfer rates of up to 10 GB/s this drive is crazy fast. Our review confirms, this is the fastest SSD we've ever tested. With up to 10 W, the MP700 is also the most power-hungry SSD, and it puts out a lot of heat, too.

                Lets be real the end of the techpowerup results say that the drive is not that fast once you get it into real world setups. 2% faster than a PCIe 4.0 SSD and 12% faster than PCIe 3.0 SSD and its 35% faster than a SATA SSD.

                Yes just because you double the transfer speeds does not mean the result is double the performance. There has been diminishing returns for quite some time.

                Yes a PCIe 4.0 SSD normally generated more heat than a PCIe 3.0 SSD and a PCIe 3.0 normally generates more heat a PCIe 2.0 ssd and PCIe 2.0 ssd normally generates more heat than a Sata SSD. Increased transfer speed has equaled increased heat generation for many generations now. So PCIe 5.0 SSD is going to run hotter than a PCI 4.0 SSD. Something to consider motherboard provided cooling for SSD would have been tested with PCIe 4.0 because PCIe 5.0 did not exist when they were design.

                Yes this was always going to be a straw that broke the camel's back at some point. At some point putting the SSD between the GPU and CPU two very high heat generators was going to have the case of the SSD not getting enough cooling because of increasing transfer speed resulting in increased heat generation. I did not expect it to be PCIe 5.0 I was thinking PCIe 6.0/PCIe 7.0..

                Complete overheat the SSD would not have any choice but to terminate the PCIe connection of course in this case it would not be smart to restart the connection once the drive cools.

                Lets say user goes and lowers the PCIe bus speed why did you buy a PCIe 5.0 drive in the first place because then the performance is going to be no better than a good PCIe 4.0/3.0 drive what ever you low it to.

                Also you can kind of seen in those techpowerup tests that maybe the drive even with the cooling was not exactly happy like the anti-virus test and a few other tests because it losing to PCIe 4.0 drives.

                Maybe PCIe 4.0 is the sweet spot for current generation of motherboards. Maybe next generation of motherboard will have like heat pipe or something above the M.2 SSD to give it better cooling to handle PCIe 5.0.



                Yes warnings were put out last year that PCIe 5.0 SSD drives would most likely need active cooling and that passive cooling would not be good enough if solutions to the heat problems could not be solved . PCIe 4.0 SSD may be the highest performance SSD you can passively cool.

                Yes that blog even talked about the type of critical shutdown this PCIe 5.0 SSD just pulled.

                but the NAND isn’t, and the SSD will go into critical shutdown if it detects that the temperature of the NAND is above 80 degrees Celsius (176 degrees F) or so.”
                Yes critical shutdown NAND chips have crossed 80C the controller is 100 percent shutting down you don't have enough cooling. The higher temperature is noted that the controller can go hotter than the NAND but it critical that there is enough cooling on the SSD that the heat of the controller does not spread to the NAND chips.

                Now think heat bleeding out of the CPU and GPU to the m.2 drive in the middle.... Yes heat generation of GPU is going up as well.

                Things are basically not good on the heat front. There is a upper limit to how hot computer parts can run before they don't run at all.

                Comment


                • #18
                  Originally posted by bug77 View Post
                  Who concluded all PCIe 5 drives have this problem? It overheats because of the sequential speeds PCIe 5 enables, we've seen this before with PCIe 4 drives.
                  Originally posted by cjcox View Post
                  Anyway, PCIe 5 is still an infant deployment wise. So, I could see "something" else even that moves us away from traditional NVMe down the road.

                  Lead SSD controller developer warned about this problem with PCIe 5 SSD drives and unless something changed they would be requiring active cooling in most environments.

                  By the time product appears for consumers with SSD they have been in development for years. Durability testing of a new SSD design takes years 2-3 in fact.

                  Corsair MP700 PCIe 5.0 controller and nand combinations would have been in testing last year when the phison developer was putting up warning alarms. Maybe Corsair failed to pay attention to memo that the device would be needing active cooling.

                  I remember one of the last GPU with passive cooling before they all started growing fans. The heat sink on that thing got hot enough to burn you from handling it yes that vendor also failed to get the memo that the new more powerful GPUs from then on would require active cooling. That vendor end up out of business due to being sued for injuries.

                  Comment


                  • #19
                    Originally posted by oiaohm View Post




                    Lead SSD controller developer warned about this problem with PCIe 5 SSD drives and unless something changed they would be requiring active cooling in most environments.
                    Well, it makes sense to require cooling to reach top performance. But just like CPUs, when cooling doesn't cope, they're supposed to throttle back, not crap out. There is definitely something to fix in the firmware.

                    Comment


                    • #20
                      Originally posted by oiaohm View Post
                      Just because you run tests does not mean you will see this fault because test environment is critical. Lets say I am running these tests on the industry standard open test bench in a air conditioned room to 24C maintaining part clean room conditions(think increase room air turn over to normal) it also common for the CPU to be sitting under a water cooling block. In that environment may have functioned perfectly correctly because that environment compared to a normal environment would be like pointing a fan constantly at ti..
                      Sorry, but - nope. This is per definition not a proper testing procedure as it obviously lacks an important aspect. Complete testing involves testing under thermal stress and not in an ideal air-conditioned environment.



                      Comment

                      Working...
                      X