Characterization of Large Language Model Development in the Datacenter

System design


Large Language Models (LLMs) have presented impressive performance across several transformative tasks. However, it is non-trivial to efficiently utilize large-scale cluster resources to develop LLMs, often riddled with numerous challenges such as frequent hardware failures, intricate parallelization strategies, and imbalanced resource utilization. In this paper, we present an in-depth characterization study of a six-month LLM development workload trace collected from our GPU datacenter Acme. Specifically, we investigate discrepancies between LLMs and prior task-specific Deep Learning (DL) workloads, explore resource utilization patterns, and identify the impact of various job failures.

In USENIX Symposium on Networked System Design and Implementation
Zhisheng YE
Zhisheng YE
CS Ph.D. student

My research interests include distributed systems, machine learning systems and resource management, etc.