木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
GPU Scheduling
FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
GPU checkpointing and restore promises to enable emerging tasks, such as deep learning, to benefit from functionalities like task …
Zehua Yang
,
Xiao Zheng
,
Yonghao Zou
,
Junyang Zhang
,
Zhisheng YE
,
Feng Xie
,
Xiaolin Wang
,
Yingwei Luo
,
Zhenlin Wang
,
Diyu Zhou
PDF
Cite
Characterization of Large Language Model Development in the Datacenter
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to …
Qinghao Hu
,
Zhisheng YE
,
Zerui Wang
,
Guoteng Wang
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Dahua Lin
,
Xiaolin Wang
,
Yingwei Luo
,
Yonggang Wen
,
Tianwei Zhang
Preprint
Cite
UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands
We present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs).
Wei Gao
,
Zhisheng YE
,
Peng Sun
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
DOI
Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and …
Zhisheng YE
,
Wei Gao
,
Qinghao Hu
,
Peng Sun
,
Xiaolin Wang
,
Yingwei Luo
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
Project
DOI
Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
Hyperparameter tuning is an essential step in deep learning model development that provides better model performance at the cost of …
Qinghao Hu
,
Zhisheng YE
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
PDF
Cite
Code
Slides
Video
Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster
With the proliferation of deep learning, there exists a strong need to efficiently operate GPU clusters for deep learning production in …
Zehua Yang
,
Zhisheng YE
,
Tianhao Fu
,
Jing Luo
,
Xiong Wei
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Tianwei Zhang
PDF
Cite
Dataset
DOI
ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters
We design a new and practical GPU scheduler, ASTRAEA, to enforce the desired fairness among tenants and jobs for deep learning training clusters.
Zhisheng YE
,
Peng Sun
,
Wei Gao
,
Tianwei Zhang
,
Xiaolin Wang
,
Shengen Yan
,
Yingwei Luo
Preprint
Cite
Code
DOI
Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs for deep learning training jobs.
Wei Gao
,
Zhisheng YE
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
Preprint
PDF
Cite
Code
Video
DOI
Cite
×