Distributed Training

Helix: Automating Communication-Computation Overlap with Graph Scheduling

A technical note on Helix, a compiler-based graph scheduling system that overlaps communication and computation for n-D model parallel training and inference.

Zhisheng YE

May 18, 2026 9 min read

Helix: Automating Communication-Computation Overlap with Graph Scheduling

ResiHP: Surviving LLM Training Failures with Dynamic Hybrid Parallelism

A technical report on ResiHP, a resilient training system that detects fail-slow devices under noisy sequence-length variation and dynamically adapts 3D parallelism.

Zhisheng YE

May 17, 2026 3 min read

ResiHP：大模型训练故障下的动态混合并行

一篇关于 ResiHP 的技术报告：它在变长序列带来的噪声中识别 fail-slow 设备，并动态调整 3D 并行来提升大模型训练韧性。

Zhisheng YE

May 17, 2026

ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism

Hybrid parallelism underpins large-scale LLM training across tens of thousands of GPUs. At such scale, hardware failures on individual …

Tenghui Ma, Jihu Guo, Wei Gao, Sitian Lu, Zhisheng YE, Dahua Lin, Hanjing Wang

AMSP: Super-Scaling LLM Training via Advanced Model States Partitioning

Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, …

Qiaoling Chen, Qinghao Hu, Zhisheng YE, Guoteng Wang, Peng Sun, Yonggang Wen, Tianwei Zhang