Deep Learning Systems | 木叶吟

GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling

Sun, 17 May 2026 15:00:00 +0800

GPU cluster scheduling would be much easier if a running GPU job behaved like an ordinary CPU process. Pause it. Move it. Resume it somewhere else. Reclaim the device when a higher-priority job arrives. Repair fragmentation without killing user work.

In practice, this is exactly where GPU scheduling gets stuck.

A CPU process can be checkpointed by saving its address space, file descriptors, and kernel-visible state. A GPU task has an extra half of its life outside the normal process abstraction: CUDA contexts, device allocations, streams, events, library handles, kernels in flight, and data resident in GPU memory. The operating system does not naturally know how to serialize that state. The scheduler can stop the host process, but that is not the same thing as having a correct, portable checkpoint of the GPU computation.

FlowGPU is about turning GPU checkpoint/restore into a system primitive. Before FlowGPU became a full system, I wrote cudaw as the first version of the codebase: a CUDA wrapper prototype for interposing on runtime calls, tracking GPU objects, translating application-visible addresses, and making pause/resume/migration possible above an unmodified CUDA application.

Why Schedulers Want This Primitive

Pause/resume and migration change what a scheduler can do.

Without GPU checkpoint/restore, preemption is often blunt. A scheduler can kill a job, ask the framework to checkpoint at a pre-defined training boundary, or wait until the user code cooperates. That is acceptable for some training loops, but it is poorly aligned with cluster events. A high-priority job may arrive now. A GPU may fail now. A fragmented placement may need repair now. Framework-level checkpoints are usually placed for application convenience, not scheduler control.

With a transparent GPU checkpoint, the scheduler gets stronger operations:

pause a job and release its GPU memory;
resume it later on the same GPU;
migrate it to another GPU or node;
checkpoint periodically for fault tolerance;
defragment the cluster by moving jobs away from awkward placements;
support elastic scaling and priority scheduling with less user code involvement.

This is the missing link between scheduling policy and GPU execution. A scheduler may know the right decision, but without a safe migration primitive, it cannot act on that decision cheaply.

The CUDA Wrapper View

The basic idea behind my cudaw prototype is to place a wrapper between the application and CUDA runtime. Instead of letting the application talk directly to libcudart, the wrapper intercepts CUDA calls such as allocation, memory copy, and kernel launch. From the scheduler’s perspective, this creates an execution log and a shadow view of GPU state.

This wrapper layer can record which device memory regions exist, what host-side pointers correspond to them, how data moves between CPU and GPU, and which kernels are launched with which arguments. It can also maintain virtual GPU addresses: the application sees stable logical addresses, while the wrapper maps them to real CUDA allocations underneath. That indirection is what makes restore and migration plausible, because the restored task may receive different physical GPU addresses on the target device.

In a simplified checkpoint flow, the wrapper reaches a safe point, synchronizes GPU work, copies live GPU memory into a checkpoint image, saves enough metadata to reconstruct CUDA state, and releases the device. Restore reverses the process: allocate memory on the target GPU, rebuild mappings, copy data back, replay necessary CUDA setup calls, and continue execution.

This early prototype captured the central intuition that later shaped FlowGPU. GPU migration is not magic; it is state reconstruction. The hard part is making the reconstructed world indistinguishable from the original one.

Where Wrapper-Only Designs Struggle

The wrapper idea is powerful, but the edge cases are brutal.

First, CUDA state is larger than cudaMalloc and cudaMemcpy. Real applications use streams, events, cuBLAS, cuDNN, NCCL, memory pools, unified memory, graph execution, and framework allocators. Many of these objects are opaque: CUDA exposes handles, not serializable internals. A checkpoint system must record and replay the operations that created or mutated them.

Second, address identity matters. A pointer value may be stored inside application data structures, kernel arguments, framework metadata, or library state. If restore gives the program a different GPU virtual address, the application can become subtly wrong even if the bytes were copied correctly.

Third, deep learning frameworks hide memory behavior. PyTorch and TensorFlow often reserve large GPU memory blocks and keep them for reuse. Much of that reserved memory may be inactive at a given moment. A naive checkpoint that saves everything allocated by the runtime can produce enormous checkpoint images, even when the useful live state is much smaller.

Fourth, distributed training is a synchronization problem. A consistent checkpoint of a multi-GPU job requires pausing all participating ranks safely. With NCCL communication, pausing one side of a blocking send/receive pair while the other side waits can deadlock the checkpoint protocol itself.

These are the problems FlowGPU is designed to handle systematically.

FlowGPU’s Core Move

FlowGPU’s key insight is that prior system-level GPU checkpoint/restore designs coupled C/R with API forwarding. In API forwarding, all GPU operations pass through a privileged central process. That makes interception and state separation easier, but it imposes runtime overhead, creates GPU address conflicts under sharing, and blocks some GPU features.

FlowGPU decouples checkpoint/restore from virtualization.

During normal execution, each task uses a per-task intercept library. GPU operations stay private to that task and go directly to the GPU, avoiding the IPC overhead of a central forwarding process. When checkpointing is needed, FlowGPU creates a ghost process. The ghost process temporarily takes over GPU state, while the original process becomes a conventional CPU process that can be checkpointed with CRIU. GPU state and CPU state are saved in parallel, then recombined during restore.

This design keeps the useful part of interception without forcing every GPU operation through a virtualization server during normal execution.

Making Checkpoints Small and Correct

FlowGPU adds several mechanisms that are especially important for deep learning workloads.

Active memory identification avoids saving the whole framework-reserved memory pool. FlowGPU inserts a memory stub at stable DL framework backend allocation/free interfaces, tracking the memory regions that are actually active. It can also wait briefly for active memory to reach a low point in the training iteration before checkpointing. This matters because active memory in training can fluctuate dramatically between the end of an iteration and the activation-heavy middle of forward/backward execution.

Virtual memory management preserves GPU address identity. FlowGPU intercepts GPU allocations and uses CUDA VMM APIs such as cuMemAddressReserve, cuMemCreate, and cuMemMap to reserve and remap the same virtual addresses on restore. That removes a major source of correctness bugs for pointer-rich GPU applications.

Record/replay handles opaque runtime objects. Since CUDA streams, events, contexts, and library handles cannot simply be read out as bytes, FlowGPU records operations that create or modify them and replays those operations during recovery.

The pause mechanism is refined for distributed tasks. FlowGPU coordinates pausing across ranks, but avoids a known NCCL deadlock pattern by resuming all instances after a timeout if a complete pause cannot be achieved. This is a small detail with a large consequence: checkpointing must not introduce a failure mode worse than the one it tries to solve.

For multi-GPU tasks, FlowGPU also performs fine-grained deduplication. Replicated model parameters may appear on multiple GPUs, but runtime memory blocks rarely match exactly. FlowGPU deduplicates fixed-size regions, reducing checkpoint image size for distributed jobs.

What This Means for Scheduling

Once GPU pause/resume becomes practical, several scheduling policies become more realistic.

Priority scheduling can preempt a low-priority GPU job without throwing away all its progress. Fairness scheduling can redistribute service over time with lower disruption. Fragmentation-aware schedulers can migrate jobs to rebuild contiguous placements for gang-scheduled workloads. Fault-tolerance systems can checkpoint at scheduler-controlled intervals instead of relying only on framework checkpoints. Elastic schedulers can shrink, expand, or relocate jobs with a clearer recovery path.

The primitive also changes the economics of GPU sharing. If a job can be paused and restored quickly, a cluster can take more aggressive actions under bursty demand. Online inference, training, and HPO workloads no longer need to live in completely isolated resource islands; the scheduler has a better way to move work when priorities change.

FlowGPU’s evaluation shows why the details matter. It reports no runtime overhead during normal single-GPU execution because tasks can access the GPU directly without API forwarding. For DL tasks, it reduces checkpoint pause time by 6.2x to 15x over POS and up to 10.4x over Singularity. Restore time drops by 12x to 18x over POS and up to 4.1x over Singularity. For migration, FlowGPU outperforms Singularity by up to 2.1x and PyTorch framework-level checkpointing by 1.7x to 4.5x.

Those numbers are not only checkpointing results. They are scheduling-enablement results. A slow checkpoint is a policy that the scheduler cannot afford to use often. A fast, transparent checkpoint becomes a real control knob.

The Takeaway

GPU scheduling is often discussed in terms of algorithms: fairness metrics, placement heuristics, bin packing, elastic allocation, and priority queues. But the scheduler is only as powerful as the execution primitives beneath it.

cudaw was my first working cut at the wrapper-level intuition: interpose on CUDA, virtualize what the application sees, and reconstruct GPU state when needed. FlowGPU pushes that intuition into a more complete system design: per-task interception for low overhead, ghost processes for state separation, active-memory tracking for small images, VMM for address correctness, and distributed pause logic for multi-GPU workloads.

The result is a cleaner boundary between policy and mechanism. The scheduler decides when a job should pause, resume, or move. The checkpoint/restore layer makes that decision safe enough to execute.

Paper: FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
Early codebase: yzs981130/cudaw

GPU 任务的暂停、恢复与迁移：调度器一直缺的那块拼图

Sun, 17 May 2026 15:00:00 +0800

如果正在运行的 GPU 任务能像普通 CPU 进程一样行动，GPU 集群调度会容易很多。暂停它，移动它，在别处恢复它。当更高优先级任务到来时回收设备。通过迁移修复碎片化，而不杀掉用户工作。

现实中，GPU 调度恰恰卡在这里。

CPU 进程可以通过保存 address space、file descriptor 和 kernel-visible state 来 checkpoint。GPU 任务则有另一半生命在普通进程抽象之外：CUDA context、device allocation、stream、event、library handle、正在运行的 kernel，以及驻留在 GPU memory 中的数据。操作系统并不天然知道如何 serialize 这些状态。调度器可以停止 host process，但这不等于拥有一个正确、可迁移的 GPU computation checkpoint。

FlowGPU 的目标就是把 GPU checkpoint/restore 变成系统原语。在 FlowGPU 成为完整系统之前，我写过 cudaw 作为代码库的最早版本：一个 CUDA wrapper prototype，用来 interpose runtime call、跟踪 GPU object、翻译 application-visible address，并让 pause/resume/migration 在不修改 CUDA application 的情况下变得可能。

为什么调度器需要这个原语

Pause/resume 和 migration 会改变调度器能做什么。

没有 GPU checkpoint/restore 时，抢占往往很粗暴。调度器可以杀掉任务，要求 framework 在预定义 training boundary checkpoint，或者等用户代码主动配合。这对某些 training loop 可以接受，但和集群事件并不匹配。高优先级任务可能现在就到达，GPU 可能现在就故障，碎片化 placement 也可能现在就需要修复。Framework-level checkpoint 通常服务于应用便利性，而不是调度器控制。

有了透明 GPU checkpoint，scheduler 就拥有了更强的操作：

暂停任务并释放 GPU memory；
稍后在同一张 GPU 上恢复；
迁移到另一张 GPU 或另一个节点；
为 fault tolerance 做周期性 checkpoint；
通过迁移任务修复集群碎片化；
用更少用户代码介入支持 elastic scaling 和 priority scheduling。

这是调度策略和 GPU execution 之间缺失的连接。调度器也许知道正确决策是什么，但如果没有安全 migration primitive，它就无法低成本执行这个决策。

CUDA Wrapper 视角

我的 cudaw prototype 的基本想法，是在应用和 CUDA runtime 之间放一个 wrapper。应用不再直接和 libcudart 交互，而是由 wrapper 拦截 allocation、memory copy、kernel launch 等 CUDA call。从调度器的角度看，这会形成一份 execution log，以及 GPU state 的 shadow view。

这个 wrapper layer 可以记录哪些 device memory region 存在、哪些 host-side pointer 与之对应、数据如何在 CPU 和 GPU 之间移动、哪些 kernel 带着哪些参数被 launch。它也可以维护 virtual GPU address：应用看到稳定的 logical address，而 wrapper 在底层把它们映射到真实 CUDA allocation。这层 indirection 让 restore 和 migration 变得可能，因为恢复后的任务在目标设备上可能拿到不同的 physical GPU address。

在一个简化的 checkpoint flow 中，wrapper 到达 safe point，同步 GPU work，把 live GPU memory 拷贝到 checkpoint image，保存足以重建 CUDA state 的 metadata，然后释放设备。Restore 则反向执行：在目标 GPU 上分配 memory，重建 mapping，把数据拷贝回来，replay 必要的 CUDA setup call，并继续执行。

这个早期 prototype 捕捉到的核心直觉，后来也影响了 FlowGPU。GPU migration 不是魔法，而是状态重建。难点在于，让重建后的世界和原来的世界无法区分。

纯 Wrapper 设计难在哪里

Wrapper 思路很有力量，但边界情况非常残酷。

第一，CUDA state 远不止 cudaMalloc 和 cudaMemcpy。真实应用会使用 stream、event、cuBLAS、cuDNN、NCCL、memory pool、unified memory、graph execution 和 framework allocator。许多对象是 opaque 的：CUDA 暴露的是 handle，而不是可序列化的内部状态。Checkpoint system 必须记录并 replay 创建或修改它们的操作。

第二，address identity 很重要。Pointer value 可能存储在应用数据结构、kernel argument、framework metadata 或 library state 里。如果 restore 后程序看到不同的 GPU virtual address，即使 bytes 被正确拷贝，应用也可能出现非常隐蔽的错误。

第三，深度学习框架会隐藏 memory behavior。PyTorch 和 TensorFlow 通常会预留大块 GPU memory 并复用它们。在某个时刻，其中很多 reserved memory 可能并不活跃。Naive checkpoint 如果保存 runtime 分配过的一切，就会产生巨大的 checkpoint image，即使真正有用的 live state 小得多。

第四，distributed training 是一个同步问题。多 GPU 任务的一致 checkpoint 需要安全暂停所有参与 rank。遇到 NCCL communication 时，如果暂停 blocking send/receive 的一侧，而另一侧还在等待，checkpoint protocol 本身就可能 deadlock。

这些正是 FlowGPU 试图系统化解决的问题。

FlowGPU 的核心动作

FlowGPU 的关键 insight 是，以前的 system-level GPU checkpoint/restore 设计经常把 C/R 和 API forwarding 绑在一起。在 API forwarding 中，所有 GPU operation 都经过一个 privileged central process。这让 interception 和 state separation 更容易，但会引入 runtime overhead，在 sharing 场景下产生 GPU address conflict，并阻碍部分 GPU feature。

FlowGPU 把 checkpoint/restore 从 virtualization 中解耦出来。

正常执行时，每个任务使用 per-task intercept library。GPU operation 仍然属于该任务，并直接访问 GPU，避免 central forwarding process 带来的 IPC overhead。需要 checkpoint 时，FlowGPU 创建一个 ghost process。Ghost process 临时接管 GPU state，而原 process 变成一个传统 CPU process，可以用 CRIU checkpoint。GPU state 和 CPU state 并行保存，再在 restore 时重新组合。

这个设计保留了 interception 有用的部分，同时避免正常执行期间所有 GPU operation 都绕过 virtualization server。

让 Checkpoint 更小也更可靠

FlowGPU 增加了几项对深度学习任务尤其重要的机制。

Active memory identification 避免保存整个 framework-reserved memory pool。FlowGPU 在稳定的 DL framework backend allocation/free interface 上插入 memory stub，跟踪真正活跃的 memory region。它也可以在 checkpoint 前短暂等待 active memory 降到 training iteration 中较低的位置。原因是 training 中 active memory 可能在 iteration 末尾和 forward/backward 中 activation 最重的阶段之间剧烈波动。

Virtual memory management 用于保留 GPU address identity。FlowGPU 拦截 GPU allocation，并使用 cuMemAddressReserve、cuMemCreate、cuMemMap 等 CUDA VMM API，在 restore 时保留并 remap 相同的 virtual address。这移除了 pointer-rich GPU application 中一类主要 correctness bug。

Record/replay 用来处理 opaque runtime object。CUDA stream、event、context 和 library handle 无法简单读成 bytes，因此 FlowGPU 记录创建或修改它们的操作，并在恢复期间 replay。

Pause mechanism 也针对 distributed task 做了细化。FlowGPU 会协调多个 rank 的暂停，但为了避免一种已知 NCCL deadlock pattern，如果完整 pause 无法达成，它会在 timeout 后恢复所有 instance。这个细节很小，后果很大：checkpointing 不能引入一个比原问题更糟的 failure mode。

对于 multi-GPU task，FlowGPU 还做了细粒度 deduplication。Replicated model parameter 可能出现在多张 GPU 上，但 runtime memory block 很少完全一致。FlowGPU 对固定大小 region 做去重，降低分布式任务的 checkpoint image size。

这对调度意味着什么

一旦 GPU pause/resume 变得实用，很多调度策略就更现实了。

Priority scheduling 可以抢占低优先级 GPU 任务，而不丢掉它的全部进度。Fairness scheduling 可以用更低扰动在时间上重新分配 service。Fragmentation-aware scheduler 可以迁移任务，重建 gang-scheduled workload 需要的连续 placement。Fault-tolerance system 可以按调度器控制的间隔 checkpoint，而不只依赖 framework checkpoint。Elastic scheduler 可以更清晰地 shrink、expand 或 relocate 任务。

这个原语也改变了 GPU sharing 的经济性。如果一个任务可以快速 pause 和 restore，集群就能在 bursty demand 下采取更激进的动作。Online inference、training 和 HPO workload 不必完全生活在彼此隔离的资源孤岛里；当优先级变化时，调度器有了更好的移动工作方式。

FlowGPU 的评估展示了细节为什么重要。论文报告称，因为任务可以不经过 API forwarding 直接访问 GPU，它在正常 single-GPU execution 中没有 runtime overhead。对于 DL task，相比 POS，FlowGPU 将 checkpoint pause time 降低 6.2x 到 15x；相比 Singularity，最多降低 10.4x。Restore time 相比 POS 降低 12x 到 18x，相比 Singularity 最多降低 4.1x。Migration 方面，FlowGPU 最多比 Singularity 快 2.1x，比 PyTorch framework-level checkpointing 快 1.7x 到 4.5x。

这些数字不只是 checkpointing 结果，也是调度能力被释放出来的结果。慢 checkpoint 是调度器不敢频繁使用的 policy；快而透明的 checkpoint 才会变成真正的 control knob。

小结

GPU 调度经常被讨论成算法问题：fairness metric、placement heuristic、bin packing、elastic allocation 和 priority queue。但调度器的能力上限取决于下面的 execution primitive。

cudaw 是我对 wrapper-level 直觉的第一版实现：interpose CUDA，virtualize application 看到的东西，并在需要时重建 GPU state。FlowGPU 把这个直觉推进成更完整的系统设计：per-task interception 保证低开销，ghost process 实现 state separation，active-memory tracking 缩小 image，VMM 保证 address correctness，distributed pause logic 支撑 multi-GPU workload。

最终结果是 policy 和 mechanism 之间更清晰的边界。调度器决定一个任务什么时候应该 pause、resume 或 move；checkpoint/restore layer 让这个决策足够安全，可以真正执行。

Paper: FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
Early codebase: yzs981130/cudaw

GPU Cluster Scheduling: A Map for Deep Learning Workloads

Sun, 17 May 2026 14:30:00 +0800

GPU cluster scheduling is easy to underestimate. At first glance, it looks like a familiar resource allocation problem: jobs arrive, GPUs are free or busy, and the scheduler decides who runs next.

Deep learning breaks that simplicity.

Training jobs can run for days, need gangs of GPUs, and care deeply about placement topology. Inference services are online, latency-sensitive, and often underutilize a GPU unless requests are batched or colocated. Hyperparameter tuning launches many similar trials, most of which are meant to be discarded. LLM workloads add model parallelism, massive memory footprints, long contexts, and bursty development patterns.

Our survey, Deep Learning Workload Scheduling in GPU Datacenters, tries to organize this messy design space. The most useful way to read the field is not as a list of schedulers, but as a set of tensions: speed versus cost, utilization versus isolation, fairness versus efficiency, and online latency versus cluster-wide throughput.

Why DL Scheduling Is Different

Traditional HPC and big-data schedulers provide useful starting points, but DL workloads have their own physics.

Training jobs are often gang-scheduled. A distributed job needs all requested GPUs at the same time, so GPUs are not easily divisible like CPU slots. Placement matters because communication-heavy jobs may run much faster when GPUs are packed within a node or connected by NVLink rather than scattered across weaker links. Preemption is expensive because model and optimizer states are large. At the same time, training is iterative, so a few profiled iterations can often reveal throughput, memory behavior, and placement sensitivity.

Inference has nearly opposite pressure. Each request is small compared with a training job, but the service has latency SLOs. Batching improves GPU utilization, yet waiting too long to form a batch hurts latency. Colocation improves throughput, yet interference can violate tail latency. The scheduler has to trade average efficiency against worst-case user experience.

This is why GPU cluster scheduling is not one problem. It is a family of related problems whose correct answer depends on the workload.

Training: Efficiency, Fairness, Deadlines

For training workloads, the survey groups scheduling objectives into three broad categories.

The first is efficiency. Some schedulers reduce job completion time through priority rules, such as least attained service or progress-aware variants. Others use profiling or learning-based methods to predict job duration, speed, placement sensitivity, or future resource needs. Placement is a core part of efficiency: a scheduler can have enough GPUs in aggregate but still produce poor performance if it fragments the cluster and cannot satisfy locality.

The second is fairness. Fairness is subtle because GPUs are indivisible in common gang-scheduling settings, and heterogeneous GPUs do not provide equal value to every job. Finish-time fairness, long-term GPU-time fairness, and heterogeneity-aware fairness all try to answer a version of the same question: how much service did this job or tenant deserve, and how much did it actually receive?

The third is deadline guarantee. Deadline-aware training is less explored, but important for production workflows. A best-effort job can tolerate delay; an SLO job cannot. Systems in this direction need to predict whether a job can finish before its deadline under different placements and resource allocations, then decide how to mix deadline jobs with normal jobs.

Training: How GPUs Are Used

Objectives are only half the taxonomy. The other half is how a scheduler uses resources.

Heterogeneous resource scheduling recognizes that “a GPU” is not a uniform unit. Different model architectures benefit differently from newer GPU generations, CPU allocation, memory, network bandwidth, and storage. A cost-effective scheduler should place jobs where their bottlenecks match the available hardware, not blindly send every job to the newest device.

GPU sharing attacks the underutilization problem. Many training jobs cannot saturate a modern GPU. Packing multiple jobs onto one device through MPS, MIG, virtualization, time sharing, or framework-level co-execution can improve utilization. The risk is interference: the scheduler must know when sharing helps and when it silently slows everything down.

Elastic training changes the number of GPUs assigned to a job over time. This can reduce queueing and improve utilization, especially when demand fluctuates. But elasticity is not free. Resource changes may require checkpointing, reinitialization, or batch-size adaptation. If batch size changes affect convergence, a scheduler may improve system throughput while quietly changing model behavior.

The broad lesson is that training schedulers increasingly need to be co-designed with training frameworks. The scheduler wants fine-grained control, but the framework knows whether a job can safely pause, resize, share, or change batch size.

Inference: Latency, Cost, Throughput

Inference scheduling is shaped by a different triangle: latency, cost, and accuracy.

Latency is usually the first-class constraint. A model serving system can improve throughput by batching requests, but a request waiting in a queue is still user-visible latency. A practical scheduler often uses dynamic batching: increase batch size when the service is healthy, shrink it when latency approaches the SLO.

Cost enters through cloud instance choice, autoscaling, and heterogeneous hardware. Some workloads are cheaper on CPU, some need GPU, and some become cost-efficient only when batching is large enough. The scheduler has to decide not only where to run a model, but how many replicas and which instance types are worth paying for.

Accuracy adds another axis. Some systems choose among model variants, ensembles, or modalities. A smaller model may be cheap and fast but less accurate; a larger model may be slower but better. This turns inference scheduling into a policy problem: what accuracy loss is acceptable for a given latency or cost budget?

Throughput techniques include batching, caching, model residency, and colocation. But inference colocation is more dangerous than training colocation because SLO violations are immediate. A scheduler needs interference models, isolation mechanisms, or hardware partitioning to make sharing safe.

Beyond Training and Inference

Some workloads deserve their own category.

Hyperparameter optimization is technically training, but operationally different. It launches many similar trials, prunes weak ones, and shifts resources toward promising configurations. This structure creates opportunities for early stopping, elastic trial allocation, trial packing, model fusion, and surrogate-based tuning. Our Hydro work is one example: it uses model scaling, trial fusion, and cluster-level interleaving to make HPO less brute-force.

Mixed training and inference workloads are another frontier. Inference clusters are often overprovisioned for bursts, leaving idle GPUs during low-traffic periods. Training jobs can sometimes borrow that capacity if the system can preempt or resize them quickly when inference demand returns. The challenge is respecting online SLOs while reclaiming otherwise wasted capacity.

These cases point to a larger trend: future schedulers will be more workload-aware. A generic GPU queue is too blunt for the diversity of DL development.

Where the Field Is Going

The survey ends with three research directions that still feel current.

First, emerging workloads will keep changing scheduler design. LLM pretraining, fine-tuning, serving, agentic inference, and HPO all expose different bottlenecks. The scheduler must understand more than GPU count; it must understand memory pressure, communication structure, context length, trial similarity, and elasticity.

Second, scheduling decisions need better intelligence. Heuristics are robust and deployable, mathematical optimization can be principled but slow, and ML/RL-based schedulers can capture complex patterns but are hard to trust and benchmark. A practical scheduler may combine all three: heuristics for the fast path, profiling for calibration, and optimization or learning for difficult decisions.

Third, hardware heterogeneity is becoming unavoidable. A production cluster may contain multiple GPU generations, specialized interconnects, CPUs, storage tiers, and accelerators. Heterogeneity creates opportunities for better cost-performance, but it also complicates fairness. Allocating an old GPU and a new GPU for the same amount of wall-clock time is rarely equal service.

The simplest summary is this: GPU scheduling is no longer just about filling empty slots. It is about matching workload structure to hardware structure under user-visible objectives.

That is what makes the area interesting. The best scheduler is not merely the one with the shortest queue. It is the one that understands what kind of deep learning work is in front of it, what resources it truly needs, and what trade-off the cluster is willing to make.

Paper: Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Project: Awesome DL Scheduling Papers

GPU 集群调度：深度学习任务该如何排队、放置与共享

Sun, 17 May 2026 14:30:00 +0800

GPU 集群调度很容易被低估。乍看起来，它像一个熟悉的资源分配问题：任务到达，GPU 有空闲也有忙碌，调度器决定谁先运行。

深度学习打破了这种简单性。

训练任务可能运行好几天，需要成组 GPU，并且对 placement topology 非常敏感。推理服务是在线服务，对 latency 敏感，如果不做 batching 或 colocation，往往又难以充分利用 GPU。超参数搜索会启动大量相似 trial，其中大多数注定会被丢弃。LLM workload 还会带来 model parallelism、巨大的 memory footprint、long context，以及开发过程中的 bursty pattern。

我们的 survey，Deep Learning Workload Scheduling in GPU Datacenters，试图整理这个复杂的设计空间。理解这个领域最有用的方式，不是把调度器列成清单，而是看它们面对的一组张力：速度与成本、利用率与隔离、公平性与效率、在线 latency 与集群整体吞吐。

为什么深度学习调度不一样

传统 HPC 和大数据调度器提供了有用起点，但深度学习任务有自己的物理规律。

训练任务往往需要 gang scheduling。一个分布式任务必须同时拿到所有请求的 GPU，因此 GPU 不像 CPU slot 那样容易切分。Placement 很重要，因为通信密集型任务如果被放在同一节点内或通过 NVLink 连接，可能比散落在弱链路上快得多。抢占很昂贵，因为模型和优化器状态都很大。同时，训练又具有迭代性，所以少量 profiled iteration 往往能暴露 throughput、memory behavior 和 placement sensitivity。

推理的压力几乎相反。每个请求相比训练任务很小，但服务有 latency SLO。Batching 可以提高 GPU utilization，但等待组 batch 会增加用户可见 latency。Colocation 可以提升 throughput，但 interference 可能打破 tail latency。调度器必须在平均效率和最坏情况下的用户体验之间做取舍。

这就是为什么 GPU 集群调度不是一个单一问题。它是一组相关问题，正确答案取决于工作负载。

训练：效率、公平性、Deadline

对于训练任务，survey 把调度目标分成三大类。

第一类是效率。有些调度器通过 priority rule 降低任务完成时间，比如 least attained service 或 progress-aware variant。另一些调度器使用 profiling 或 learning-based method 预测任务时长、速度、placement sensitivity 或未来资源需求。Placement 是效率的核心部分：一个调度器可能在总量上有足够 GPU，却因为集群碎片化而无法满足 locality，导致性能很差。

第二类是公平性。公平性很微妙，因为在常见 gang-scheduling 场景中 GPU 不可分割，而异构 GPU 对不同任务的价值也不一样。Finish-time fairness、long-term GPU-time fairness 和 heterogeneity-aware fairness 都在回答同一个问题的不同版本：这个任务或租户应得多少 service，实际又获得了多少？

第三类是 deadline guarantee。Deadline-aware training 研究相对少，但对生产流程很重要。Best-effort 任务可以容忍等待；SLO 任务不行。这类系统需要预测某个任务在不同 placement 和 resource allocation 下能否按 deadline 完成，再决定如何混合 deadline 任务和普通任务。

训练：GPU 如何被使用

目标只是 taxonomy 的一半，另一半是调度器如何使用资源。

Heterogeneous resource scheduling 认识到“一张 GPU”并不是一个统一单位。不同 model architecture 对新一代 GPU、CPU allocation、memory、network bandwidth 和 storage 的收益不同。一个 cost-effective 调度器应该把任务放到和其 bottleneck 匹配的硬件上，而不是盲目把所有任务都送到最新设备。

GPU sharing 试图解决 underutilization 问题。许多训练任务无法吃满现代 GPU。通过 MPS、MIG、virtualization、time sharing 或 framework-level co-execution，把多个任务打包到同一设备上可以提高利用率。风险是 interference：调度器必须知道什么时候 sharing 有收益，什么时候它只是悄悄拖慢所有任务。

Elastic training 会随时间改变分配给任务的 GPU 数量。在需求波动时，它可以减少排队并提升利用率。但 elasticity 不是免费的。资源变化可能需要 checkpoint、reinitialization 或 batch-size adaptation。如果 batch size 的变化影响 convergence，调度器可能提升了系统 throughput，却悄悄改变了模型行为。

一个大趋势是，训练调度器越来越需要和训练框架协同设计。调度器想要细粒度控制，但框架才知道一个任务是否能安全 pause、resize、share 或改变 batch size。

推理：Latency、成本、Throughput

推理调度由另一组三角关系塑造：latency、cost 和 accuracy。

Latency 通常是一等约束。Model serving system 可以通过 batching 提升 throughput，但请求在队列里等待本身就是用户可见 latency。实际调度器往往使用 dynamic batching：服务健康时增大 batch size；latency 接近 SLO 时缩小 batch。

Cost 来自 cloud instance selection、autoscaling 和 heterogeneous hardware。有些工作负载在 CPU 上更便宜，有些需要 GPU，还有些只有在 batch 足够大时才划算。调度器不仅要决定模型放在哪里，还要决定需要多少 replica、哪些 instance type 值得付费。

Accuracy 又引入了一个维度。有些系统会在 model variant、ensemble 或 modality 之间选择。小模型便宜快速但准确率较低；大模型更慢但效果更好。这让推理调度变成 policy problem：在给定 latency 或 cost budget 下，可以接受多大 accuracy loss？

Throughput 技术包括 batching、caching、model residency 和 colocation。但推理 colocation 比训练 colocation 更危险，因为 SLO violation 是即时可见的。调度器需要 interference model、isolation mechanism 或 hardware partitioning，才能让 sharing 安全。

训练和推理之外

有些工作负载值得单独分类。

Hyperparameter optimization 技术上属于训练，但在操作上很不一样。它会启动许多相似 trial，提前剪枝较弱的 trial，并把资源转向更有前途的 configuration。这种结构带来了 early stopping、elastic trial allocation、trial packing、model fusion 和 surrogate-based tuning 的机会。我们的 Hydro 工作就是一个例子：它用 model scaling、trial fusion 和 cluster-level interleaving 让 HPO 少一点 brute force。

混合训练和推理工作负载是另一个前沿。推理集群往往为了应对 burst 而过度配置，在低流量期间留下 idle GPU。如果系统能在推理需求回来时快速 preempt 或 resize 训练任务，训练就可以借用这部分容量。挑战是，在回收空闲资源的同时仍然尊重在线 SLO。

这些例子指向一个更大的趋势：未来调度器会越来越 workload-aware。面对深度学习开发的多样性，一个泛泛的 GPU queue 已经太粗糙。

这个领域正在走向哪里

Survey 最后总结了三个至今仍然重要的研究方向。

第一，emerging workload 会继续改变调度器设计。LLM pretraining、fine-tuning、serving、agentic inference 和 HPO 都暴露出不同瓶颈。调度器需要理解的不只是 GPU 数量，还包括 memory pressure、communication structure、context length、trial similarity 和 elasticity。

第二，调度决策需要更好的智能。Heuristic 鲁棒且容易部署，mathematical optimization 更有原则但可能很慢，ML/RL-based scheduler 能捕捉复杂 pattern 但难以信任和 benchmark。实际调度器可能会结合三者：fast path 用 heuristic，profiling 用于校准，复杂决策再交给 optimization 或 learning。

第三，hardware heterogeneity 已经不可避免。生产集群可能包含多代 GPU、专用 interconnect、CPU、storage tier 和 accelerator。异构性带来更好的 cost-performance 机会，但也让公平性更复杂。给一个任务分配老 GPU 和新 GPU，即使用时相同，也很少代表相同服务。

最简单的总结是：GPU 调度已经不再只是填满空 slot。它是在用户可见目标之下，把工作负载结构匹配到硬件结构。

这也是这个方向有意思的地方。最好的调度器不只是队列最短的那个，而是理解眼前的深度学习任务是什么、真正需要什么资源，以及集群愿意做出什么 trade-off 的那个。

Paper: Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Project: Awesome DL Scheduling Papers

ASTRAEA: Fairness Is More Than Counting GPUs

Sun, 17 May 2026 13:00:00 +0800

Fairness sounds simple until a GPU cluster starts running real deep learning workloads.

In a shared research or production cluster, different tenants submit jobs with very different shapes. Some jobs need one GPU for a quick debugging run. Others need many GPUs and run for days. A scheduler that only optimizes utilization may let long jobs dominate the cluster. A scheduler that aggressively favors short jobs may make large training jobs wait forever. Both users can reasonably say the system is unfair.

ASTRAEA was built around this problem: how should a multi-tenant GPU cluster enforce fairness without wasting expensive accelerators?

Why Existing Fairness Breaks

Traditional cluster schedulers often think in terms of instantaneous resource fairness. If two users share a cluster, each should receive a fair share of resources at the current moment. This works well for many big-data workloads, where tasks are easier to split, migrate, and rebalance.

Deep learning training is less flexible. Jobs usually require gang scheduling: all requested GPUs must be allocated together. Communication-heavy jobs are sensitive to GPU topology. Preemption is also costly because model state must be checkpointed, moved, and restored. If a scheduler tries to enforce fairness by frequently reshuffling GPUs, it can destroy the performance it was meant to protect.

Another approach is finish-time fairness, where the scheduler asks whether a job would finish no later than it would in a private fair-share cluster. That is useful, but incomplete. It focuses on time and can miss the spatial side of fairness: a job that asks for more GPUs consumes more cluster capacity per unit time. Treating a 1-GPU job and an 8-GPU job only through finish time can create incentives to overclaim resources.

ASTRAEA’s core idea is to measure what the cluster is actually spending: GPU-time.

Long-Term GPU-Time Fairness

ASTRAEA introduces Long-Term GPU-Time Fairness, or LTGF. Instead of asking only “how many GPUs does a tenant have right now?” or “when will this job finish?”, LTGF asks how much GPU service a tenant or job has received over a period of time compared with how much it deserves.

This captures both dimensions of allocation:

temporal impact: how long the job runs;
spatial impact: how many GPUs it occupies while running.

At the tenant level, LTGF distributes GPU-time according to tenant weights, such as budget or quota. At the job level, it distributes GPU-time fairly among concurrent jobs inside a tenant. This two-level view is important because a fair cluster should protect both the organization sharing contract and the individual jobs waiting inside each tenant’s queue.

The metric also avoids relying on fragile remaining-time prediction. In real clusters, users cancel jobs, jobs fail, and training throughput changes with placement. ASTRAEA can evaluate fairness from past allocation history, then use that signal to decide who should receive service next.

How ASTRAEA Schedules

ASTRAEA uses a two-phase scheduling algorithm.

First, it selects the tenant with the lowest tenant-level fairness index. In plain language: the scheduler finds the tenant that has received the least GPU-time relative to what it should have received. If that tenant has pending jobs and the cluster can place one of them, ASTRAEA grants resources to it.

Second, ASTRAEA selects a job within that tenant using the job-level fairness index. This keeps one tenant’s internal queue from becoming unfair even when the tenant as a whole is being treated fairly. Job-level policies can still incorporate practical priorities, but they are constrained by the fairness signal.

The scheduler is lease-based. Instead of preempting whenever fairness changes, ASTRAEA gives a running job a lease term. At lease boundaries, the scheduler can rearrange execution order to repair fairness. This is a practical compromise: short leases improve fairness response, but too-short leases increase preemption overhead and hurt job completion time. ASTRAEA chooses a lease length that balances those forces for deep learning training.

What It Buys

ASTRAEA was evaluated with large-scale simulations on real GPU cluster traces, including SenseTime’s Venus trace and Microsoft’s Philly trace. The paper reports that ASTRAEA improves tenant-level fairness by up to 9.42x and job-level fairness by up to 10.3x compared with state-of-the-art schedulers, without sacrificing average job completion time.

The important lesson is that fairness in GPU clusters is not just a policy preference. It is a measurement problem. If the metric ignores GPU count, users can overclaim. If it ignores time, long-running jobs can be starved. If it ignores tenants, the cluster violates sharing agreements. If it ignores jobs, individual users still experience unfairness.

ASTRAEA’s contribution is to make fairness measurable in the unit that matters most for deep learning clusters: long-term GPU-time.

Paper: ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters
Code: Astraea Artifacts