木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
3
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual …
Tenghui Ma
,
Jihu Guo
,
Wei Gao
,
Sitian Lu
,
Zhisheng YE
,
Dahua Lin
,
Hanjing Wang
Cite
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …
Qiaoling Chen
,
Zhisheng YE
,
Tian Tang
,
Peng Sun
,
Boyu Tian
,
Guoteng Wang
,
Shenggui Li
,
Yonggang Wen
,
Zhenhua Han
,
Tianwei Zhang
Preprint
PDF
Cite
DOI
LEMUR: Large Scale End-to-End Multimodal Recommendation
Traditional ID-based recommender systems often struggle with cold-start and generalization challenges. Multimodal recommendation …
Xintian Han
,
Honggang Chen
,
Quan Lin
,
Jingyue Gao
,
Xiangyuan Ren
,
Lifei Zhu
,
Zhisheng YE
,
Shikang Wu
,
XiongHang Xie
,
Xiaochu Gan
,
Bingzheng Wei
,
Peng Xu
,
Zhe Wang
,
Yuchao Zheng
,
Jingjian Lin
,
Di Wu
,
Junfeng Ge
Preprint
Cite
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning
Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, …
Qiaoling Chen
,
Qinghao Hu
,
Zhisheng YE
,
Guoteng Wang
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
Preprint
Cite
Cite
×