木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
LLM Inference
Latency-SLO-Aware Memory Offloading for Large Language Model Inference
Offloading large language models (LLMs) state to host memory during inference promises to reduce operational costs by supporting larger …
Chenxiang Ma
,
Hanyu Zhao
,
Zhisheng YE
,
Zehua Yang
,
Tianhao Fu
,
Jiaxun Han
,
Jie Zhang
,
Yingwei Luo
,
Xiaolin Wang
,
Zhenlin Wang
,
Yong Li
,
Diyu Zhou
Preprint
Cite
Cite
×