木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
Memory Offloading
Memory Offloading for Large Language Model Inference with Latency SLO Guarantees
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …
Chenxiang Ma
,
Zhisheng YE
,
Hanyu Zhao
,
Zehua Yang
,
Tianhao Fu
,
Jiaxun Han
,
Jie Zhang
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Yong Li
,
Diyu Zhou
Preprint
Cite
Cite
×