LLM Serving

CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control

Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …

Qiaoling Chen, Zhisheng YE, Tian Tang, Peng Sun, Boyu Tian, Guoteng Wang, Shenggui Li, Yonggang Wen, Zhenhua Han, Tianwei Zhang