Helix: Automating Communication-Computation Overlap with Graph Scheduling

Mon, 18 May 2026 12:00:00 +0800

Large models are rarely trained or served with one clean parallelism strategy. Tensor parallelism splits matrix operations. Pipeline parallelism splits layers. Sequence parallelism stretches context length across devices. Expert parallelism routes tokens through distributed experts. Real deployments increasingly compose several of these dimensions at once.

That composition is powerful, but it creates a familiar systems tax: communication bubbles. When an AllReduce, AllGather, ReduceScatter, or All-to-All sits on the critical path, GPU compute units wait. In a dense tensor-parallel block, the waiting may come from sharded matrix results. In a long-context sequence-parallel block, it may come from exchanging sequence chunks. In a MoE layer, it may come from routing tokens across experts. The parallel strategy changes, but the shape of the problem is the same: computation and communication are both present, yet the execution graph does not expose enough safe overlap.

Helix is built around one idea: communication-computation overlap should be a graph scheduling problem, not a hand-written kernel trick for each new parallel pattern. Helix is currently a WIP research project starts from early 2026. Please check back for updates.

Why Manual Overlap Does Not Scale

The fastest overlap techniques often go deep into kernels. They split an operation into small pieces, launch communication early, and fuse enough computation around it to hide latency. This can work extremely well for one pattern. Ring-style attention can overlap sequence exchange with local attention blocks. Tensor-parallel kernels can pipeline collectives with partial matrix multiplications. MoE systems can schedule expert computation around token dispatch.

The problem is that each of these optimizations tends to encode assumptions about the model, the collective, the tiling shape, and the synchronization protocol. Once the model architecture changes, or a deployment combines tensor, sequence, and expert parallelism, the optimization becomes harder to reuse. A local trick may also miss a global opportunity: a communication operation produced by one parallel dimension might be hidden under computation from another dimension, but a pattern-specific optimizer will not necessarily see that.

Helix moves the optimization boundary up to the compiled execution graph. After torch.compile captures the model-parallel program, Helix sees compute operators, communication operators, waits, and dependency edges in one intermediate representation. That unified graph is the key abstraction. The compiler no longer needs a separate overlap recipe for every parallel strategy; it can schedule visible compute and communication nodes under the same correctness rules.

The Scheduling Objective

At a high level, Helix treats the model-parallel program as a directed graph. Nodes are compute or communication operators. Edges are precedence constraints: an operator can run only after the values it depends on are ready.

The optimization goal is straightforward but constrained:

reduce the graph makespan by hiding communication under independent computation;
preserve every data dependency in the original graph;
keep peak memory below the available device budget.

That last constraint matters. Aggressive overlap is not free. If the compiler launches too much work early, intermediate activations and communication buffers live longer. A schedule that looks faster on the timeline can become unusable because it inflates peak memory. Helix therefore optimizes both time and memory, guided by a lightweight graph simulator.

The system uses three compiler passes: tiling, reordering, and bucketing.

Tiling: Create Overlap Opportunities

The original execution graph is often too coarse. A large compute operator may wait for a large communication operator, even though smaller chunks of the work could have been interleaved. Helix first applies graph tiling: it partitions operators into multiple tile streams while preserving the dependency structure inside each stream.

By default, the paper tiles along the batch dimension because it is broadly applicable and easy to reason about. Other dimensions, such as sequence length, can also be used when correctness is guaranteed for that region of the graph.

Tiling has two benefits. First, it exposes overlap. Communication from one tile stream can be launched while computation from another tile stream is still running. This turns one rigid graph into several smaller streams that can be woven together. Second, it can reduce activation memory. Smaller tiles mean smaller live inputs and intermediate tensors, so the peak memory footprint can drop when lifetimes are well controlled.

But tiling also has a cost. Smaller compute kernels can lose efficiency, especially for memory-bound operators such as normalization, softmax, and pointwise functions. Smaller communication messages can also lose effective bandwidth. The paper’s profiling shows that a small tiling factor is usually the practical choice; Helix uses K = 2 by default because it exposes useful overlap without creating excessive fragmentation.

Reordering: Make Overlap Safe

After tiling, the compiler has several independent tile streams. The next question is launch order.

A naive schedule would simply execute the streams one by one. That preserves correctness, but it leaves communication bubbles exposed. An overly aggressive schedule would launch many asynchronous operations early, which may improve overlap but keep too many tensors alive and push peak memory upward.

Helix uses Segmented Round-Robin Reordering to sit between those extremes. The key observation is that explicit wait operators are natural segment boundaries. Within a tile stream, Helix groups contiguous non-blocking compute and communication operators into a segment until it reaches a wait. It then schedules segments across streams in a round-robin style. Communication from one stream can be injected into the compute-heavy region of another stream, but waits still force the graph to respect the original data dependencies.

This segment-level granularity is important. It is coarse enough to avoid the memory explosion of operator-by-operator eager scheduling, because segments are flushed at synchronization boundaries and their intermediates can be released. It is also fine enough to move communication earlier than the original graph would allow under strict serial execution.

In practice, this is where Helix gets much of its generality. The scheduler does not need to know that a node came from tensor parallelism, sequence parallelism, or expert parallelism. If the node is visible in the graph and its dependencies are explicit, the reordering pass can reason about it.

Bucketing: Recover Kernel Efficiency

Tiling creates flexibility, but too much fragmentation hurts hardware efficiency. The bucketing pass repairs that damage selectively.

The idea is to merge compatible operators across tile streams back into larger buckets when doing so improves end-to-end performance. This sounds simple, but it creates a trade-off. Bucketing can reduce kernel-launch overhead and improve compute or communication efficiency. At the same time, it may reintroduce synchronization, reduce scheduling freedom, and extend tensor lifetimes by moving some work earlier.

Helix treats bucketing as a constrained search. For a candidate merge, the graph simulator estimates two quantities: the new makespan and the new peak memory. A merge is useful only if the saved time is worth the additional memory cost and does not destroy the overlap created by tiling and reordering. The implementation uses dynamic programming over candidate buckets, choosing the set of merges that gives the best schedule under the memory budget.

This pass is the reason Helix is not just “split everything and hope.” It deliberately creates overlap granularity, then fuses back the pieces that should not remain separate.

The Simulator Is the Control Loop

The graph simulator is small but central. It runs at compile time and estimates both runtime and peak memory for candidate schedules. For compute and communication cost, it combines graph-visible operator semantics, tensor shapes, analytical modeling, and automated benchmarking. For memory, it simulates execution order and tracks the lifetimes of tensors and communication buffers.

The simulator does not have to be perfect to be useful. It needs to rank scheduling choices well enough that the compiler avoids obviously bad trade-offs. The paper reports close agreement between estimated and real traces across GPT-3, LLaMA3, and Qwen3-MoE configurations. For example, on a GPT-3 Curie setup with TP=2 and SP=4, the estimated runtime is 6.80 seconds versus 6.41 seconds measured, and the estimated peak memory is 65.9 GiB versus 66.0 GiB measured.

That fidelity matters because the optimizer is making decisions before the real run. Without a simulator, the compiler would either need expensive trial execution or rely on brittle heuristics.

What It Buys

Across GPT-3, LLaMA3, and Qwen3-MoE workloads, Helix shows the same pattern: once communication is exposed to graph scheduling, bubbles shrink and useful GPU work rises. End-to-end training throughput improves by 4% to 9% within a node, and by 12% to 30% when communication crosses nodes. At the layer level, communication bubbles are often reduced by more than 60%, which is the direct evidence that the scheduler is hiding communication rather than merely shifting overhead around.

The memory result is also important. In long-context inference, Helix reduces activation memory by up to 30%, lowering peak memory from 23.5 GiB to 21.4 GiB in the measured trace. This comes from the same design principle as the performance gain: the graph scheduler controls when tiles become live and when their intermediates can be released, instead of letting overlap inflate memory lifetime accidentally.

Helix also compares favorably with hand-tuned tensor-parallel overlap. On large GPT and LLaMA training runs, it reaches 17% and 16% speedups over the baseline, while AsyncTP reports 12% and 13% in the same comparison. The point is not that compiler scheduling makes specialized kernels obsolete. The point is that a graph-level optimizer can find cross-dimensional overlap while keeping correctness, synchronization, kernel efficiency, and memory lifetime in one place.

That is the technical core of Helix: make communication visible, make dependencies explicit, and let the compiler schedule the overlap that manual implementations would otherwise have to rediscover for each workload.

Compiler Optimization | 木叶吟