Announcement

Collapse
No announcement yet.

AMD Begins Prototyping CRIU Support For ROCm Compute

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • AMD Begins Prototyping CRIU Support For ROCm Compute

    Phoronix: AMD Begins Prototyping CRIU Support For ROCm Compute

    As part of AMD's growing HPC focus and maturing of their Radeon Open eCosystem GPU compute stack, they ended out this week by making public a prototype implementation of CRIU support for ROCm...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    What is the usecase for this? Genuinely curious.

    Comment


    • #3
      I just hope this comes in handy for GPU resets....

      (by the way, CRIU sounds a lot like cryo)

      Comment


      • #4
        Originally posted by bezirg View Post
        What is the usecase for this? Genuinely curious.
        Containers.
        CRIU - Wikipedia
        CRIU

        Comment


        • #5
          Originally posted by bezirg View Post
          What is the usecase for this? Genuinely curious.
          One of the uses is HPC (High Performance Computing). You might be running a multi-day (or multi-week) simulation on a cluster of computers with GPUs and CPUs. If during that period any of them fails, or crashes, your computations are scraped basically, which is a huge waste of resources and money (researchers usually apply for grant for HPC cluster time, and then are allowed to use that time, but no more, so if the system or program, or one node crashes, they are screwed, possibly delaying their research by a year or two). Granted, not all HPC uses are like that, but a lot of them are. With CRIU, you periodically (i.e. every 2 hours) do a dump of state of CPU, GPU, memory, network sockets, opened files, etc, to the storage, then continue. This process usually will often take just minutes. If the system crashes, you can restore from the previous good checkpoint, and continue. You pay a little in inefficiency and the time to test the code that it actually can do checkpoint and restore (but that can be tested on smaller jobs or even on a single workstation), but you improve reliability a lot, and you are safe from risks I mentioned above.

          Comment


          • #6
            Miner wants to switch his rig over to do his day job and then switch it back at the end of the day.

            Comment

            Working...
            X