Publications

(2026). ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism. Preprint.

Cite

(2026). CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control. In arXiv.

Preprint PDF Cite DOI

(2025). LEMUR: Large Scale End-to-End Multimodal Recommendation. arXiv.

Preprint Cite

(2025). Memory Offloading for Large Language Model Inference with Latency SLO Guarantees. arXiv.

Preprint Cite

(2024). Characterization of Large Language Model Development in the Datacenter. In NSDI.

Preprint Cite

(2023). Deep Learning Workload Scheduling in GPU Datacenters: A Survey. In CSUR.

Preprint PDF Cite Project DOI

(2023). AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning. arXiv.

Preprint Cite

(2023). Hydro: Surrogate-Based Hyperparameter Tuning Service in Datacenters. In OSDI.

PDF Cite Code Slides Video

(2022). Tear Up the Bubble Boom: Lessons Learned From a Deep Learning Research and Development Cluster. In ICCD.

PDF Cite Dataset DOI

(2021). ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters. In TPDS.

Preprint Cite Code DOI