GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling

GPU cluster scheduling would be much easier if a running GPU job behaved like an ordinary CPU process. Pause it. Move it. Resume it somewhere else. Reclaim the device when a higher-priority job arrives. Repair fragmentation without killing user work.

In practice, this is exactly where GPU scheduling gets stuck.

A CPU process can be checkpointed by saving its address space, file descriptors, and kernel-visible state. A GPU task has an extra half of its life outside the normal process abstraction: CUDA contexts, device allocations, streams, events, library handles, kernels in flight, and data resident in GPU memory. The operating system does not naturally know how to serialize that state. The scheduler can stop the host process, but that is not the same thing as having a correct, portable checkpoint of the GPU computation.

FlowGPU is about turning GPU checkpoint/restore into a system primitive. Before FlowGPU became a full system, I wrote cudaw as the first version of the codebase: a CUDA wrapper prototype for interposing on runtime calls, tracking GPU objects, translating application-visible addresses, and making pause/resume/migration possible above an unmodified CUDA application.

Why Schedulers Want This Primitive

Pause/resume and migration change what a scheduler can do.

Without GPU checkpoint/restore, preemption is often blunt. A scheduler can kill a job, ask the framework to checkpoint at a pre-defined training boundary, or wait until the user code cooperates. That is acceptable for some training loops, but it is poorly aligned with cluster events. A high-priority job may arrive now. A GPU may fail now. A fragmented placement may need repair now. Framework-level checkpoints are usually placed for application convenience, not scheduler control.

With a transparent GPU checkpoint, the scheduler gets stronger operations:

  • pause a job and release its GPU memory;
  • resume it later on the same GPU;
  • migrate it to another GPU or node;
  • checkpoint periodically for fault tolerance;
  • defragment the cluster by moving jobs away from awkward placements;
  • support elastic scaling and priority scheduling with less user code involvement.

This is the missing link between scheduling policy and GPU execution. A scheduler may know the right decision, but without a safe migration primitive, it cannot act on that decision cheaply.

The CUDA Wrapper View

The basic idea behind my cudaw prototype is to place a wrapper between the application and CUDA runtime. Instead of letting the application talk directly to libcudart, the wrapper intercepts CUDA calls such as allocation, memory copy, and kernel launch. From the scheduler’s perspective, this creates an execution log and a shadow view of GPU state.

This wrapper layer can record which device memory regions exist, what host-side pointers correspond to them, how data moves between CPU and GPU, and which kernels are launched with which arguments. It can also maintain virtual GPU addresses: the application sees stable logical addresses, while the wrapper maps them to real CUDA allocations underneath. That indirection is what makes restore and migration plausible, because the restored task may receive different physical GPU addresses on the target device.

In a simplified checkpoint flow, the wrapper reaches a safe point, synchronizes GPU work, copies live GPU memory into a checkpoint image, saves enough metadata to reconstruct CUDA state, and releases the device. Restore reverses the process: allocate memory on the target GPU, rebuild mappings, copy data back, replay necessary CUDA setup calls, and continue execution.

This early prototype captured the central intuition that later shaped FlowGPU. GPU migration is not magic; it is state reconstruction. The hard part is making the reconstructed world indistinguishable from the original one.

Where Wrapper-Only Designs Struggle

The wrapper idea is powerful, but the edge cases are brutal.

First, CUDA state is larger than cudaMalloc and cudaMemcpy. Real applications use streams, events, cuBLAS, cuDNN, NCCL, memory pools, unified memory, graph execution, and framework allocators. Many of these objects are opaque: CUDA exposes handles, not serializable internals. A checkpoint system must record and replay the operations that created or mutated them.

Second, address identity matters. A pointer value may be stored inside application data structures, kernel arguments, framework metadata, or library state. If restore gives the program a different GPU virtual address, the application can become subtly wrong even if the bytes were copied correctly.

Third, deep learning frameworks hide memory behavior. PyTorch and TensorFlow often reserve large GPU memory blocks and keep them for reuse. Much of that reserved memory may be inactive at a given moment. A naive checkpoint that saves everything allocated by the runtime can produce enormous checkpoint images, even when the useful live state is much smaller.

Fourth, distributed training is a synchronization problem. A consistent checkpoint of a multi-GPU job requires pausing all participating ranks safely. With NCCL communication, pausing one side of a blocking send/receive pair while the other side waits can deadlock the checkpoint protocol itself.

These are the problems FlowGPU is designed to handle systematically.

FlowGPU’s Core Move

FlowGPU’s key insight is that prior system-level GPU checkpoint/restore designs coupled C/R with API forwarding. In API forwarding, all GPU operations pass through a privileged central process. That makes interception and state separation easier, but it imposes runtime overhead, creates GPU address conflicts under sharing, and blocks some GPU features.

FlowGPU decouples checkpoint/restore from virtualization.

During normal execution, each task uses a per-task intercept library. GPU operations stay private to that task and go directly to the GPU, avoiding the IPC overhead of a central forwarding process. When checkpointing is needed, FlowGPU creates a ghost process. The ghost process temporarily takes over GPU state, while the original process becomes a conventional CPU process that can be checkpointed with CRIU. GPU state and CPU state are saved in parallel, then recombined during restore.

This design keeps the useful part of interception without forcing every GPU operation through a virtualization server during normal execution.

Making Checkpoints Small and Correct

FlowGPU adds several mechanisms that are especially important for deep learning workloads.

Active memory identification avoids saving the whole framework-reserved memory pool. FlowGPU inserts a memory stub at stable DL framework backend allocation/free interfaces, tracking the memory regions that are actually active. It can also wait briefly for active memory to reach a low point in the training iteration before checkpointing. This matters because active memory in training can fluctuate dramatically between the end of an iteration and the activation-heavy middle of forward/backward execution.

Virtual memory management preserves GPU address identity. FlowGPU intercepts GPU allocations and uses CUDA VMM APIs such as cuMemAddressReserve, cuMemCreate, and cuMemMap to reserve and remap the same virtual addresses on restore. That removes a major source of correctness bugs for pointer-rich GPU applications.

Record/replay handles opaque runtime objects. Since CUDA streams, events, contexts, and library handles cannot simply be read out as bytes, FlowGPU records operations that create or modify them and replays those operations during recovery.

The pause mechanism is refined for distributed tasks. FlowGPU coordinates pausing across ranks, but avoids a known NCCL deadlock pattern by resuming all instances after a timeout if a complete pause cannot be achieved. This is a small detail with a large consequence: checkpointing must not introduce a failure mode worse than the one it tries to solve.

For multi-GPU tasks, FlowGPU also performs fine-grained deduplication. Replicated model parameters may appear on multiple GPUs, but runtime memory blocks rarely match exactly. FlowGPU deduplicates fixed-size regions, reducing checkpoint image size for distributed jobs.

What This Means for Scheduling

Once GPU pause/resume becomes practical, several scheduling policies become more realistic.

Priority scheduling can preempt a low-priority GPU job without throwing away all its progress. Fairness scheduling can redistribute service over time with lower disruption. Fragmentation-aware schedulers can migrate jobs to rebuild contiguous placements for gang-scheduled workloads. Fault-tolerance systems can checkpoint at scheduler-controlled intervals instead of relying only on framework checkpoints. Elastic schedulers can shrink, expand, or relocate jobs with a clearer recovery path.

The primitive also changes the economics of GPU sharing. If a job can be paused and restored quickly, a cluster can take more aggressive actions under bursty demand. Online inference, training, and HPO workloads no longer need to live in completely isolated resource islands; the scheduler has a better way to move work when priorities change.

FlowGPU’s evaluation shows why the details matter. It reports no runtime overhead during normal single-GPU execution because tasks can access the GPU directly without API forwarding. For DL tasks, it reduces checkpoint pause time by 6.2x to 15x over POS and up to 10.4x over Singularity. Restore time drops by 12x to 18x over POS and up to 4.1x over Singularity. For migration, FlowGPU outperforms Singularity by up to 2.1x and PyTorch framework-level checkpointing by 1.7x to 4.5x.

Those numbers are not only checkpointing results. They are scheduling-enablement results. A slow checkpoint is a policy that the scheduler cannot afford to use often. A fast, transparent checkpoint becomes a real control knob.

The Takeaway

GPU scheduling is often discussed in terms of algorithms: fairness metrics, placement heuristics, bin packing, elastic allocation, and priority queues. But the scheduler is only as powerful as the execution primitives beneath it.

cudaw was my first working cut at the wrapper-level intuition: interpose on CUDA, virtualize what the application sees, and reconstruct GPU state when needed. FlowGPU pushes that intuition into a more complete system design: per-task interception for low overhead, ghost processes for state separation, active-memory tracking for small images, VMM for address correctness, and distributed pause logic for multi-GPU workloads.

The result is a cleaner boundary between policy and mechanism. The scheduler decides when a job should pause, resume, or move. The checkpoint/restore layer makes that decision safe enough to execute.

Paper: FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
Early codebase: yzs981130/cudaw

Zhisheng YE
Zhisheng YE
Machine Learning Systems Researcher

My research interests include AI Infra for LLMs, algorithm–system co-design for machine learning systems and resource management.

Related