木叶吟
木叶吟
Home
Experience
Posts
Publications
Services
CV
Light
Dark
Automatic
English
中文 (简体)
LLM Inference
CONCUR: Controlling Mid-Phase Thrashing in Agentic Batch Inference
A technical note on CONCUR, an agent-level admission control layer that prevents KV cache collapse during long-running agentic LLM inference.
Zhisheng YE
May 17, 2026
4 min read
CONCUR:让 Agent 批量推理避开中期拥塞
一篇关于 CONCUR 的技术笔记:它在 agent 层做准入控制,避免长时间运行的 LLM agent 推理把 KV cache 推入失控区间。
Zhisheng YE
May 17, 2026
CONCUR: High-Throughput Agentic Batch Inference of LLM via Congestion-Based Concurrency Control
Batch inference for agentic workloads stresses the GPU key-value (KV) cache in a sustained and cumulative manner, often causing severe …
Qiaoling Chen
,
Zhisheng YE
,
Tian Tang
,
Peng Sun
,
Boyu Tian
,
Guoteng Wang
,
Shenggui Li
,
Yonggang Wen
,
Zhenhua Han
,
Tianwei Zhang
Preprint
PDF
Cite
Latency-SLO-Aware Memory Offloading for Large Language Model Inference
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …
Chenxiang Ma
,
Hanyu Zhao
,
Zhisheng YE
,
Zehua Yang
,
Tianhao Fu
,
Jiaxun Han
,
Jie Zhang
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Yong Li
,
Diyu Zhou
Preprint
Cite
Cite
×