木叶吟
木叶吟
Home
Experience
Publications
Posts
CV
Light
Dark
Automatic
English
中文 (简体)
Hybrid Parallelism
ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism
A technical report on ResiHP, a resilient training system that detects fail-slow devices under noisy sequence-length variation and dynamically adapts 3D parallelism.
Zhisheng YE
May 17, 2026
3 min read
ResiHP:大模型训练故障下的动态混合并行
一篇关于 ResiHP 的技术报告:它在变长序列带来的噪声中识别 fail-slow 设备,并动态调整 3D 并行来提升大模型训练韧性。
Zhisheng YE
May 17, 2026
Cite
×