木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
LLM Inference
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …
Qiaoling Chen
,
Zhisheng YE
,
Tian Tang
,
Peng Sun
,
Boyu Tian
,
Guoteng Wang
,
Shenggui Li
,
Yonggang Wen
,
Zhenhua Han
,
Tianwei Zhang
Preprint
PDF
Cite
Latency-SLO-Aware Memory Offloading for Large Language Model Inference
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …
Chenxiang Ma
,
Hanyu Zhao
,
Zhisheng YE
,
Zehua Yang
,
Tianhao Fu
,
Jiaxun Han
,
Jie Zhang
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Yong Li
,
Diyu Zhou
Preprint
Cite
Cite
×