LLM Inference

CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …

Qiaoling Chen, Zhisheng YE, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, Tianwei Zhang

Latency-SLO-Aware Memory Offloading for Large Language Model Inference

Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …

Chenxiang Ma, Hanyu Zhao, Zhisheng YE, Zehua Yang, Tianhao Fu, Jiaxun Han, Jie Zhang, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Yong Li, Diyu Zhou