Biography

Hi, there! This is Zhisheng Ye, a final-year Ph.D. student in Computer Science. I received a B.S. degree in computer science and technology from School of Electronics Engineering and Computer Science, Peking University, China, in 2019. Currently I am working towards a Ph.D. degree in the Institute of Networking and Energy-efficient Computing (NEEC) at Peking University. I am co-advised by Prof. Yingwei Luo, the director of NEEC, and Prof. Xiaolin Wang.

My research interests include distributed systems, machine learning systems and resource management, etc. I am also interested in high performance computing and GPU systems, as a former member of Peking University Cluster Competition Team. I also receive mentorship from Prof. Tianwei Zhang of NTU and have strong collaborations with his students, including Wei Gao, Dr. Qinghao Hu, Meng Zhang, and Qiaoling Chen, and Dr. Peng Sun of Shanghai AI Lab.

Download my CV.

Interests
  • Distributed Systems
  • Machine Learning Systems
  • Resource Management
Education
  • Ph.D. Student in Computer Science, since 2019

    Peking University

  • BSc in Computer Science and Technology, 2019

    Peking University

Experience

 
 
 
 
 
Shanghai AI Laboratory
Research Intern
Jul 2022 – Present Beijing, China
  • Large scale model (e.g., LLM, MoE) training infrastructure optimization.
  • Deeply involved in the development of InternLM.
 
 
 
 
 
Sensetime Research
Research Intern
Sep 2019 – Jun 2022 Beijing, China
  • Supercomputing cluster scheduling and optimization for deep learning training workloads in Sensetime Research (now SenseCore).
  • Design and implementation of a fair scheduler for DLT jobs as first author.
 
 
 
 
 
Peng Cheng Laboratory
Research Intern
Jul 2018 – Sep 2021 Shenzhen, China
  • Contributed to development of OpenI-Octopus, an open-sourced scheduler for deep learning training workloads based on Kubernetes.
  • Safe GPU sharing and efficient migration mechanisms on Kubernetes.
  • Monitoring and logging systems.
 
 
 
 
 
Peking University Cluster Competition Team
Team member
Sep 2018 – Jun 2019 Beijing, China
  • Participated in analyzing, compiling, profiling, optimizing, and improving parallelizability of general HPC tasks.
  • First Price (Team), ASC19 Student Supercomputer Challenge

Recent Publications

(2024). Characterization of Large Language Model Development in the Datacenter. In NSDI.

Cite

(2023). Deep Learning Workload Scheduling in GPU Datacenters: A Survey. In CSUR.

Preprint PDF Cite Project DOI

(2023). Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters. In OSDI.

PDF Cite Code Slides Video

(2022). Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In ICCD.

PDF Cite Dataset DOI

(2021). ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters. In TPDS.

Preprint Cite Code DOI