木叶吟
木叶吟
Home
Experience
Posts
Publications
Services
CV
Light
Dark
Automatic
English
中文 (简体)
Deep Learning Systems
GPU 集群调度:深度学习任务该如何排队、放置与共享
基于我们的 ACM Computing Surveys 论文,梳理 GPU 数据中心里的训练、推理、HPO、混合负载以及未来调度器设计。
Zhisheng YE
May 17, 2026
ASTRAEA: Fairness Is More Than Counting GPUs
A technical note on ASTRAEA, a multi-tenant GPU scheduler that measures fairness by long-term GPU-time instead of instantaneous allocation or finish time alone.
Zhisheng YE
May 17, 2026
5 min read
ASTRAEA:GPU 集群里的公平,不只是分到几张卡
一篇关于 ASTRAEA 的技术笔记:它面向多租户 GPU 集群,用长期 GPU-time 衡量公平性,避免只看瞬时分配或任务完成时间。
Zhisheng YE
May 17, 2026
GPU Cluster Scheduling: A Map for Deep Learning Workloads
A technical guide to GPU datacenter scheduling based on our ACM Computing Surveys paper, covering training, inference, HPO, mixed workloads, and future scheduler design.
Zhisheng YE
May 16, 2026
7 min read
GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling
A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.
Zhisheng YE
May 15, 2026
8 min read
Cite
×