木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
3
ASYRA: Automating Graph Scheduling for Communication-Computation Overlap in Efficient Model Parallelism
Scaling large models requires complex multi-dimensional (n-D) parallelism, yet this paradigm suffers from severe communication bubbles …
Lei Zhang
,
Zhisheng YE
PDF
Cite
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …
Qiaoling Chen
,
Zhisheng YE
,
Tian Tang
,
Peng Sun
,
Boyu Tian
,
Guoteng Wang
,
Shenggui Li
,
Yonggang Wen
,
Zhenhua Han
,
Tianwei Zhang
Preprint
PDF
Cite
DOI
LEMUR: Large Scale End-to-End Multimodal Recommendation
Traditional ID-based recommender systems often struggle with cold-start and generalization challenges. Multimodal recommendation …
Xintian Han
,
Honggang Chen
,
Quan Lin
,
Jingyue Gao
,
Xiangyuan Ren
,
Lifei Zhu
,
Zhisheng YE
,
Shikang Wu
,
XiongHang Xie
,
Xiaochu Gan
,
Bingzheng Wei
,
Peng Xu
,
Zhe Wang
,
Yuchao Zheng
,
Jingjian Lin
,
Di Wu
,
Junfeng Ge
Preprint
Cite
Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …
Chenxiang Ma
,
Zhisheng YE
,
Hanyu Zhao
,
Zehua Yang
,
Tianhao Fu
,
Jiaxun Han
,
Jie Zhang
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Yong Li
,
Diyu Zhou
Preprint
Cite
AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning
Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, …
Qiaoling Chen
,
Qinghao Hu
,
Zhisheng YE
,
Guoteng Wang
,
Peng Sun
,
Yonggang Wen
,
Tianwei Zhang
Preprint
Cite
Cite
×