木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
deep learning training
Characterization of Large Language Model Development in the Datacenter
Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to …
Qinghao Hu
,
Zhisheng YE
,
Zerui Wang
,
Guoteng Wang
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Dahua Lin
,
Xiaolin Wang
,
Yingwei Luo
,
Yonggang Wen
,
Tianwei Zhang
Preprint
Cite
UniSched: A Unified Scheduler for Deep Learning Training Jobs with Different User Demands
We present UniSched, a unified scheduler to optimize different types of scheduling objectives (e.g., guaranteeing the deadlines of SLO jobs, minimizing the latency of best-effort jobs).
Wei Gao
,
Zhisheng YE
,
Peng Sun
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
DOI
Deep Learning Workload Scheduling in GPU Datacenters: A Survey
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and …
Zhisheng YE
,
Wei Gao
,
Qinghao Hu
,
Peng Sun
,
Xiaolin Wang
,
Yingwei Luo
,
Tianwei Zhang
,
Yonggang Wen
Preprint
PDF
Cite
Project
DOI
Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters
Hyperparameter tuning is an essential step in deep learning model development that provides better model performance at the cost of …
Qinghao Hu
,
Zhisheng YE
,
Meng Zhang
,
Qiaoling Chen
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
PDF
Cite
Code
Slides
Video
Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster
With the proliferation of deep learning, there exists a strong need to efficiently operate GPU clusters for deep learning production in …
Zehua Yang
,
Zhisheng YE
,
Tianhao Fu
,
Jing Luo
,
Xiong Wei
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Tianwei Zhang
PDF
Cite
Dataset
DOI
ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters
We design a new and practical GPU scheduler, ASTRAEA, to enforce the desired fairness among tenants and jobs for deep learning training clusters.
Zhisheng YE
,
Peng Sun
,
Wei Gao
,
Tianwei Zhang
,
Xiaolin Wang
,
Shengen Yan
,
Yingwei Luo
Preprint
Cite
Code
DOI
Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs
We present Chronus, an end-to-end scheduling system to provide deadline guarantee for SLO jobs and maximize the performance of best-effort jobs for deep learning training jobs.
Wei Gao
,
Zhisheng YE
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
Preprint
PDF
Cite
Code
Video
DOI
Cite
×