Results 1 to 10 of 10

Thread: Checkpoint-Restore Hits v1.0: Freeze Your Linux Apps

  1. #1
    Join Date
    Jan 2007
    Posts
    15,388

    Default Checkpoint-Restore Hits v1.0: Freeze Your Linux Apps

    Phoronix: Checkpoint-Restore Hits v1.0: Freeze Your Linux Apps

    The Checkpoint-Restore Tool has reached version 1.0 as part of the CRIU project. Checkpoint/Restore In Userspace allows for users to freeze running applications and checkpoint it to the hard drive as a file and that checkpoint can then be restored to a running process later on. CRIU is different from suspend-and-resume with the Linux kernel in that this is a tool for handling individual programs and it is implemented in user-space...

    http://www.phoronix.com/vr.php?view=MTUyNjE

  2. #2
    Join Date
    Jun 2013
    Posts
    38

    Default

    Looks really useful and universal, thanks for bringing it up.

  3. #3
    Join Date
    Oct 2009
    Posts
    2,137

    Default

    This actually sounds very interesting...

  4. #4
    Join Date
    Nov 2008
    Posts
    780

    Default

    I really doubt this can be done without violating at least some of the POSIX guarantees and a lot of assumptions that usually hold.
    Any form of network connection or IPC is disconnected (unless all IPC'ing tasks are frozen at the same time), syscalls can disappear when switching hosts, anything with direct driver interaction is bound to fail (which includes hardware accelerated graphics)..

    Probably useful for HPC (sans openCL), contained virtual machines (without direct HW access) and debugging, but as soon as we talk about X apps, it'll fail.

    And then there's a lot of applications who weren't designed to gracefully handle most of the subtle errors system APIs are in theory allowed to throw, but will in practice only throw when someone mocks about with the task's structure. Oh the joys of debugging those...

  5. #5
    Join Date
    Apr 2010
    Posts
    794

    Default

    Quote Originally Posted by rohcQaH View Post
    I really doubt this can be done without violating at least some of the POSIX guarantees and a lot of assumptions that usually hold.
    Exactly what I was thinking. Most of the applications they list as working are servers of one kind or another - but if you suspend a server, what happens to all the clients that are connected to it? And when it resumes, what happens to all the processes that have suddenly lost all the connections they were writing to? If you're going to mess with things that badly, surely it's easier and cleaner to just shut the process down, and restart later?

  6. #6
    Join Date
    Nov 2011
    Posts
    300

    Default

    Quote Originally Posted by Delgarde View Post
    Exactly what I was thinking. Most of the applications they list as working are servers of one kind or another - but if you suspend a server, what happens to all the clients that are connected to it? And when it resumes, what happens to all the processes that have suddenly lost all the connections they were writing to? If you're going to mess with things that badly, surely it's easier and cleaner to just shut the process down, and restart later?
    With servers, one use is to have a "quickstart" snapshot, so that rather than rerunning the full initialization you just reuse a pre-initialized snapshot.

    But really, they've done a lot of work on all these things you're bringing up--they had kernel patches for a reason, you know.
    There is already "TCP repair", so it can also be used for load balancing (freeze on server1, move to server2, restore).
    For migrating applications, they suggest using rsync or a networked filesystem to share files.
    The bigger issues are at http://criu.org/What_can_change_after_C/R


    Some of the other uses include freezing an application you need to debug later, moving programs into "screen", or having checkpoints for application state.
    There's more at http://criu.org/Usage_scenarios

  7. #7
    Join Date
    Nov 2008
    Posts
    780

    Default

    Quote Originally Posted by Ibidem View Post
    But really, they've done a lot of work on all these things you're bringing up--they had kernel patches for a reason, you know.
    There is already "TCP repair", so it can also be used for load balancing (freeze on server1, move to server2, restore).
    Which works in a limited amount of circumstances, only.
    a) The task must be restarted before the other side gets a timeout. The task must not be restarted more than once.
    b) Packets from server2 must be allowed to carry the same IP as packets from server1. Either the whole task must be inside a VM with a bridged network, or the migration must setup a tunnel between server1 and 2.

    Quick migration of running VMs between homogenous hosts? Check. Which seems to be the primary use case it was developed for, anyway. But anything else? Expect your connections to timeout or be dropped.

    Quote Originally Posted by Ibidem View Post
    I doubt that list is complete.


    What if I hold a write lock on a file on a remote share? freeze/restore can try to reacquire the lock, but cannot guarantee that the remote file has been unchanged. Of course it could freeze a copy of the file, load it into memory and fake a file descriptor. But then a) the remote file may be in a corrupt state, if the program was frozen during a write operation and b) any further changes to the file will not appear on the remote share.
    While I doubt the second option violates POSIX constraints from the POV of the program (after all, files may be unlinked), it certainly won't look like the program is working as intended.


    What about syscalls going away due to a different kernel version or configuration?


    What about access to non-hotplug hardware, that may or may not exist on the target host?

  8. #8
    Join Date
    Jan 2011
    Posts
    102

    Default

    I wonder if this could be used to deterministically freeze applications in mobile environments. For instance, currently on Android I have no idea of what happens when I switch between "apps": sometimes they live on in the background, sometimes they're paused and don't lose state, sometimes they're killed and I have to start over whatever I was doing with them; and it's impossible to tell in advance. Which is a shame given the proper multitasking abilities of the underlying OS.

  9. #9
    Join Date
    Jan 2011
    Posts
    102

    Default

    Quote Originally Posted by rohcQaH View Post
    What if I hold a write lock on a file on a remote share?
    You lose it as you would if some network cable between you and the remote share was unplugged? Certainly both clients and servers are already expected to handle that.

    What about syscalls going away due to a different kernel version or configuration?
    I'd expect them to ENOSYS as if the restored program was run on the inferior kernel from the beginning. But then there's no possible success case in that scenario, a program can't be useful if it requires kernel functionality which is not there.

    What about access to non-hotplug hardware, that may or may not exist on the target host?
    This mechanism works at the file descriptor level, where you don't see the hardware directly. Of course, if you checkpoint/restore a program like fsck, you'll have problems.

  10. #10
    Join Date
    Sep 2009
    Posts
    60

    Default Huge for Containers (LXC)

    I'm not sure how the article missed it, but this could be the last piece needed for Linux Containers (LXC) to reach acceptability for cloud Linux providers. With this, you can now live-migrate the entire process group's processes. This allows the downtime-reducing abilities needed for cloud providers.

    This is the last piece. It just needs to be integrated as a script in LXC for that purpose. All the features like running snapshots would be possible too.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •