Home >> Projects >> Process Checkpointing & Restarting |
Asim Shankar |
|
Process checkpointing and restarting (using dumped core)
This page describes a system for checkpointing and restarting UNIX processes. It differs from some existing implementations in that
(a) It does not require the executables to be linked with library, so processes can be checkpointed without change and more interestingly,
(b) the manner in which a checkpointed process is restarted. Other systems (such as ckpt and esky) have a complex mechanism of restoring the stack and register state of the checkpointed process as both are also used by the restoration code. This system seems to be simpler as the restarted process and the restoration code are in independent address spaces. The system runs only on user-level code and requires no modifications to the kernel.
Updates
- March 1, 2005 - Well, I finally got to it and fixed the "Could not read name of note #1" error. The examples now seem to work on my Linux 2.6 kernel (and I suspect they should work in 2.4 as well). Let me know if there are any problems. Thanks.
- March 3, 2004 - An error (Could not read name of note #1) seems to be appear in various distributions/kernels. This is because of a slight difference between the core format in the kernel I used to develop and test (the one that ships with Mandrake 9.1) and these other kernels. I am aware of the problem and the fix should not be complicated. Unfortunately, I haven't found the time to do this myself yet. If you do, please let me know. Thanks
Contents
The core file contains a complete memory dump of the process, thus in theory it should be possible to restore the process to the same state it was in when the core was dumped.
However, there are many unanswered questions when it comes to restarting from this state. What happens to the open file descriptors? Files may have changed, how do you handle sockets, pipes, seeks etc. Then there are issues with process ids - does the process id have to be the same as before? What about the parent-child relationship? Signal handling state - what signals are blocked? How does the process see the time that has elapsed since the checkpoint?
Answers to these questions would affect what exactly does one mean by "restarting" a process from the checkpointed state and how to go about it. However, one would notice that for jobs that are essentially compute intensive, where inter-process communication and signal handling aren't the major point of concern - the process address space has all the information necessary. The point is that restarting from the address-space dump in the core can serve a worthwhile purpose.
The result so far is a system that can checkpoint and then restart any process along with file descriptors, with the following caveats:
- File descriptors of only regular files, directories and symbolic links can be checkpointed. No character/block devices, sockets or pipes
- Signal handlers are not restored (default ones are used)
- Processes that have used dlopen() to open a dynamic library are not restarted successfully
- Programs must be single threaded
- Only a single process will be checkpointed, thus programs that use fork(), exec() (or other things like system() and popen()) are in trouble
- Programs that use the mmap() call to map files to the process' address space cannot be restarted
However, of the above, the mmap() and dlopen() limitations are likely to be fairly easily overcome.
Given these limitations, which some other checkpointing systems share, I'm of the opinion that things are done much more simply here than in other systems. Details follow.
Here's an overview of the steps the restart utility takes in order to restart a process given the executable file and the core dump file:
- Open the executable and core files and read their ELF headers
- From the NOTES program header of the core file, get the PR_STATUS structure (which has register values) of the checkpointed process
- fork(), we now have a CHILD process (which will be the restarted image) and the PARENT process (that sets up the child)
- CHILD: ptrace(PTRACE_TRACEME,...) and then exec() the executable file
- PARENT: Setup a breakpoint in the child.
This is done as follows: Store the instruction in the child at the entry point of the executable and replace it with the INT3 instruction (opcode 0xCC). Then do a ptrace(PTRACE_CONT,...). This allows the child process to run till it reaches the entry point (normally the address of the _start function). Once here, it will execute the INT3 instruction which causes a SIGTRAP to be generated and returns control to the parent process. (Allowing the child to run till the entry point allows the address space to be initialized and the code to be loaded). In the case of statically compiled binaries (e.g.: gcc with -static), instead of the entry point, we would want to break at the address of main()).
- PARENT: With the help of the LOAD sections in the core file, restore the address space of the child.
(The program headers with type LOAD specify the virtual address and the location in the file where the contents of that address can be found)
- PARENT: Restore the registers of the child, read in from the NOTES section of the dumped core
- PARENT: Detach the child (ptrace(PTRACE_DETACH,...))
- THE CHILD PROCESS IS NOW READY TO EXECUTE FROM THE POINT IT WAS CHECKPOINTED
The use of the exec() call and breaking at the entry point of the program handles the initialization of the process' address space and loading the executable code of the program and the used dynamic libraries (except those explicitly mapped by dlopen()). ckpt and esky (see Other Systems) handle the restart by making the restart process overwrite its own address space. This can be quite complicated as one must make sure that the code of the restart process remains intact and there are a host of related issues that must be carefully dealt with. The methodology above is much simpler as the address space of the restart process and the restarted process are completely independent.
File Descriptors -
File descriptors are handled with the help of a dynamic library that must be put into the LD_PRELOAD environment variable. This library installs a special signal handler for the SIGQUIT signal which dumps information on the open file descriptors to a text file. This text file is then read in during the restart process mentioned above after the fork() and before the exec() and file descriptors are restored with their offsets.
Based on the methodology described above, a system was implemented. Some things regarding the implementation:
- The system works on Linux and requires kernel 2.4 or above
(The mmap2() system call is used to allocate pages to the process after the program was exec()ed. Kernel 2.2 doesn't seem to have this call implemented)
- The "checkpoint" file used is an ELF core file with type ET_CORE. This implementation works on the IA32 architecture (The architecture affects, among other things, the registers available etc.)
- In such a system, the stack starts at 0xbfffffff and "grows" to lower addresses
- The .text, .data and .bss segments of the executable are loaded at 0x0804000. Dynamic libraries are loaded by ld at 0x4000000 onwards
Checkpointing in this system simply means generating a core dump. Here we describe ways to do that and the slightly different methodology used to checkpoint file descriptors (which are not checkpointed in the core dump).
Using a signal -
There are some signals (SIGSEGV, SIGQUIT among others) whose default disposition is to cause the process to dump core and quit. Thus, one way of creating a checkpoint for a running process is to send it the SIGQUIT signal. There is a limit to the allowable size of this core dump and many times the default setting is to not allow the core file to be created. To remedy this, before running the process type the following in the bash shell:
ulimit -c unlimited
Using the debugger (gdb) -
NOTE: For checkpointing with gdb, you require gdb version 5.2 or greater (which implements the gcore command)
A debugger can be attached to a running process and then used to manipulate it. gdb has a command "gcore" that creates a core dump of the process. In fact, with the debugger you can bring a process to a "safe" state before dumping core. For example, if the process opens sockets, does some processing and then closes the sockets then you can use gdb to set a break point where all sockets are closed and then create a core dump. Thus, when the process is resumed from the core file, there were no open socket fds to worry about. To attach gdb to a running process use:
gdb <executable filename> <process id>
Checkpointing file descriptors -
The file descriptor table is maintained by the kernel and thus doesn't lie in the process' address space. Therefore, information on open file descriptors doesn't seem to be present in the core file. Furthermore, various issues arise when trying to restore them, for example, what do you do with sockets and pipes? What happens if the file is moved? etc. This system however, provides rudimentary support for regular files (regular meaning files/directories as opposed to sockets or pipes). On receipt of a SIGQUIT signal, we store for each open file descriptor - its descriptor, filename, offset and flags and write all this information into another file. The default signal hander is then restored and the process is sent another SIGQUIT signal which forces a core dump. Information on open file descriptors is taken from /proc/self/fd.
Use of this special signal handler does not require any relinking, we use the environment variable LD_PRELOAD to load our library (libsavefds.so) which installs the special signal handler.
In summary, to checkpoint a process with file descriptors, ensure that libsavefds.so is present in LD_PRELOAD before starting the process and then when you need to checkpoint it, send the process a SIGQUIT signal.
The core component of this system is the restart program. Not much had to be done for checkpointing as we basically ask the kernel for a core dump to create the checkpoint (checkpointing file descriptors uses a special library, libsavefds.so as mentioned earlier). This program essentially implements the methodology explained above.
The usage of this utility is shown below:
Usage: restart [options] <executable filename> <core filename>
Options:
-b, --breakpoint=ADDRESS When execing the program to be restarted then run till given
instruction ADDRESS before restoring address space and registers
(Default is the entry point of the executable, which is generally
the address of the _start function, thus all dynamic libraries are
loaded by this time. Specifying this is useful for statically linked
executables (Compiled with the --static flag in gcc)).
-f, --filedes[=FILENAME] Restore file descriptors from FILENAME created by
libsavefds.so (Default FILENAME is "filedescriptors")
-n, --nostop Do not pause the restarted process
(By default the process must be sent a SIGCONT to continue)
-s, --select Make detailed selections while the address space is restored
-V, --verbose Be a bit verbose about what is being done while restarting
-w, --wait Wait for restarted process to finish execution
-h, --help Display this help and exit
-v, --version Display version information and exit
Special mention must go to the -b option which is useful when it comes to statically linked executables. The -b option takes an address as argument, which is the address at which the exec()ed process is paused and the state of the checkpointed process is restored. The system requires that by executing all instructions in the program till this breakpoint, the program code and code of required dynamic libraries are loaded into the address space of the process (ld does it's job). Most executables are dynamically linked to libc and ld and the entry point of these executables (_start function) has the characteristics required of the breakpoint address. However, in the case of statically linked executables, the entry point is often 0 and at this address even the program code has not been loaded. Hence, for such executables an acceptable breakpoint would be the address of the main() function. One could look up the symbol table and determine this address, however in case symbols have been stripped, the -b option can be used to specify it.
I'd appreciated if you'd share any comments/suggestions/queries you may have with me by emailing me.
Some other interesting things that I came across while figuring out the logistics of this system:
- Sandeep's articles on ptrace - Here is a series of 3 articles on the "ptrace" system and some interesting hacks. There articles appear in issues of Linux Gazette in issues 81, 83 and 85.
- core_restart.c - A system which reconstructs an executable file from its core dump.
- ELF Format Specifications - Search Google for "elf format specification"
- To play around with ELF files, you can use "readelf" and "objdump". These utilities (and a hex editor!) were what helped me figure out the nitty-gritty. They are part of the binutils package and should be installed on most systems.
- The Linux Kernel source code, specifically fs/binfmt_elf.c and fs/exec.c
- Information on system calls and how they take parameters : http://www.lxhp.in-berlin.de/lhpsyscal.html
Some other checkpointing systems:
- ckpt - A checkpointing system developed at the University of Wisconsin
- esky - Doesn't suffer from the mmap() and dlopen() limitations that this thing does
- libckpt - Transparent checkpointing under UNIX (1995)
- EPCKPT - Checkpoint/restart utility built into the Linux kernel
- checkpointing.org - The home of checkpointing packages
Last modified: Tue Mar 01 15:05:27 Central Standard Time 2005
|