木叶吟
木叶吟
Home
Experience
Posts
Publications
Services
CV
Light
Dark
Automatic
English
中文 (简体)
Fault Tolerance
ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism
A technical report on ResiHP, a resilient training system that detects fail-slow devices under noisy sequence-length variation and dynamically adapts 3D parallelism.
Zhisheng YE
May 17, 2026
3 min read
ResiHP:大模型训练故障下的动态混合并行
一篇关于 ResiHP 的技术报告:它在变长序列带来的噪声中识别 fail-slow 设备,并动态调整 3D 并行来提升大模型训练韧性。
Zhisheng YE
May 17, 2026
GPU Pause, Resume, and Migration: The Missing Primitive in Cluster Scheduling
A technical note on GPU checkpoint/restore for schedulers, using FlowGPU as the main reference and my cudaw prototype as the first version of the codebase.
Zhisheng YE
May 15, 2026
8 min read
FlowGPU: Transparent and Efficient GPU Checkpointing and Restore
GPU checkpointing and restore promises to enable emerging tasks, such as deep learning, to benefit from functionalities like task …
Zehua Yang
,
Xiao Zheng
,
Yonghao Zou
,
Junyang Zhang
,
Zhisheng YE
,
Feng Xie
,
Xiaolin Wang
,
Yingwei Luo
,
Zhenlin Wang
,
Diyu Zhou
PDF
Cite
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism
Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual …
Tenghui Ma
,
Jihu Guo
,
Wei Gao
,
Sitian Lu
,
Zhisheng YE
,
Dahua Lin
,
Hanjing Wang
Cite
DOI
Cite
×