Max Ryabinin · 13:09 "Training models with large context lengths is a fun and challenging goal, but the bottlenecks show up in unexpected places."
Together AI is an AI-native cloud that provides GPU clusters, fine-tuning, and an inference platform hosting over 200 models. This talk is about the training side — how to stack up memory optimizations so that a standard transformer can be trained at a context length of 5 million tokens on a single node (8 H100s).
Why long context
Max Ryabinin gives two reasons why interest in long-context training is rising. One is the spread of agents — there are more tokens you want to put into the context, and you want the model to actually make use of them. The other is applications like video generation — you need to track multiple frames (sometimes multiple frames per second), which quickly eats up a large number of tokens. On top of that, you need temporal consistency, the ability to see "what happened a few seconds ago, ideally a few minutes ago." For these to work, the model has to be able to process that long context correctly during training. And even if you're not at the multi-million-token scale, if you understand where the memory goes, you can reinvest what you free up into something else and speed up training.
Two bottlenecks
When you try to extend the context of a standard transformer, you run into two bottlenecks. The first is quadratic compute A transformer's attention computes the pairwise interactions between all elements in the sequence, so the amount of compute scales with the square of the sequence length. Just as the number of handshakes when N people each shake hands with everyone is about N squared, doubling the sequence length inflates the compute fourfold. — the transformer computes the all-pairs interactions among every element in the sequence, so the compute grows with the square of the sequence length (the same way N people each shaking hands with everyone comes out to about N²). The second is trickier: the longer you extend the context, the more the memory keeps growing linearly. Linear isn't as bad as quadratic, but it's hard to handle unless you apply a specific set of techniques. An example from Hugging Face's training blog also shows that growth in sequence length puts considerable pressure on the memory ceiling.
Stacking up memory savings
Assuming the scenario of fitting Llama3-8B at 3 million tokens on a node of 8 H100s, Ryabinin shows how known techniques are stacked one after another.
At first, just placing the model's parameters overflows memory. Next, with FSDP (Fully Sharded Data Parallelism) PyTorch's fully sharded data parallelism. The model's parameters, gradients, and optimizer states are split (sharded) across multiple GPUs and held in pieces. Think of 8 GPUs each carrying 1/8 of a single huge book. The model's memory drops sharply, but the attention activations still overflow. the parameters are split across 8 GPUs — the model's memory drops sharply, but it still overflows on the attention activations. So context parallelism / DeepSpeed Ulysses Parallelization that splits a long sequence across multiple GPUs. Instead of having every GPU compute the multi-head attention for the entire sequence, Microsoft's DeepSpeed Ulysses computes different heads on different GPUs and communicates the necessary activations among them. A single GPU handles one head while computing attention across the whole sequence. It lets you keep using the best implementations like FlashAttention. is introduced. Rather than having every GPU compute the attention for the full sequence separately, Microsoft's DeepSpeed Ulysses computes it on a different GPU per head and communicates the necessary activations among them. A single GPU handles one head while computing the attention over the full sequence. It also lets you keep using the best FlashAttention implementation. This drops the memory used to about one-eighth, but it's still far from the goal of fitting on a single node.
Next, with activation checkpointing A technique that, rather than keeping all the intermediate activations computed in the forward pass, recomputes them at the point they are needed during the backward pass. Think of not saving every draft but cheaply rewriting them when needed. It cuts memory substantially, but incurs the compute cost of the recomputation, so it is enabled in a way that doesn't become an excessive burden. the activations are recomputed during the backward pass — cutting activation memory by roughly another eighth. Furthermore, part of each transformer block's input is moved off the GPU to the CPU and prefetched just before it's needed, applying CPU offloading A technique that moves part of the activations from GPU memory to CPU memory and retrieves them just before backpropagating to the relevant layer. Think of stowing documents you aren't using right now in a drawer (CPU) and pulling them out just before you need them. Because it prefetches, the performance impact is small. This talk noted that the fine-tuning optimization library Unsloth was the first to implement it. (offloading about 37GB). This optimization is introduced as something Unsloth implemented first. Finally, element-wise computations (loss and MLP) are tiled along the sequence-length dimension, avoiding the creation of a huge buffer of "sequence length 3 million." Stacking all of this up, 3 million tokens finally becomes possible.
Untied Ulysses (UPipe) — head-wise chunking
So, how do you go even further? Here the study's main optimization, Untied Ulysses (UPipe) Together AI's development of context parallelism (arXiv 2602.21196). When multiple heads are assigned to a single GPU, they are split into small chunks and iterated over in the time dimension. Compute the attention for one group of heads → save the partial result → the next group reuses the previous stage's buffer, so instead of allocating one giant buffer, small buffers are reused. It cuts the memory of attention's intermediate tensors by up to 87.5% on a 32B model. appears. Ryabinin positions it as a deeper analysis and extension of context parallelism.
The idea goes like this — just computing one set of heads already saturates the GPU's compute capacity within a single iteration. In other words, if multiple heads are assigned to a single GPU, you can split them into chunks and iterate over the time dimension. Recompute one group of heads → compute their attention → save the partial result → the next stage reuses the buffer allocated by the previous stage. Instead of allocating one giant buffer as before, a small buffer is reused across two or more iterations. At small scale, you can save still more activation memory without losing much throughput. It's close to the idea of preparing a single small workbench and reusing it for each group of heads.
Results and lessons
In measurements, with both 8B (Llama3-8B) and 32B (Qwen3-32B), it maintains performance close to the most memory-efficient transformer training implementation while scaling up to 5 million tokens, and in some cases is even faster at shorter sequences. According to the paper, Llama3-8B can be trained up to 5 million tokens on a single node (8×H100) (a 25% extension from the prior SOTA FPDT's 4 million), and up to 8 million tokens on two nodes (16×H100), and on 32B it cuts the memory of attention's intermediate tensors by up to 87.5%. The relationship between chunk size (the number of heads computed at once) and throughput is straightforward — the larger the chunk, the higher the memory usage but the slightly faster it runs. Stack everything and put UPipe on top, and you can either reinvest the freed memory into another use (such as pipeline stages) or make training at the 5-million-token scale feasible.
The lesson Ryabinin offers is that "the bottlenecks show up in unexpected places." He closes by saying that tools like the PyTorch profiler (detailed in the paper) are a great help. In the Q&A, he checked the QKV (a transformer layer's query / key / value matrices) and added that at a sequence length of 3 million you end up allocating huge tensors for the all-pairs interactions, so it's only by combining multiple techniques — not UPipe alone — that you can avoid running out of memory.
Editorial notes
The value of this talk is that it shows not a "magic single move" but a "stacking up." It stacks known stages in order — FSDP → DeepSpeed Ulysses → activation checkpointing → CPU offloading (Unsloth) → tiling — and, rather than stopping there, adds one final stage: "head-wise chunking (UPipe)." The structure of showing one slide at a time for which part of memory each stage cuts conveys that long-context training is a field that advances not through a single invention but through engineering layering. As a counterpart to Tanishq Kumar, who argued "speed = capability" on the inference side, this one demonstrates on the training side that "if you understand the accounting of memory, you can go farther on the same hardware." The fact that the primary sources (published paper + code) are all available also raises its archive value.
Points of focus
The observation "the GPU is already saturated" is the key
The core of UPipe is not new mathematics but the observation that "just computing one set of heads saturates the GPU's compute capacity within a single iteration." If it's saturated, there's little point in unrolling the remaining heads simultaneously; turning them over in small batches along the time dimension to reuse the buffer saves more memory. The method itself embodies Ryabinin's closing lesson — find room for optimization from actual measurements of compute resources (the profiler).
Video outline
- (00:00) Self-introduction (Together AI, VP of R&D), overview of Together AI (cloud / fine-tuning / inference)
- (01:53) Main topic — training, customization, fine-tuning, interest in long context
- (02:15) Why long context — agents and video generation, temporal consistency
- (03:42) Two bottlenecks — quadratic compute and linearly growing memory (Hugging Face blog example)
- (04:46) The scenario — Llama3-8B / 3 million tokens / 8×H100, OOM first
- (05:43) Split parameters with FSDP, still overflows on activations
- (06:16) DeepSpeed Ulysses (context parallelism), about 8x reduction
- (07:48) Activation checkpointing for about another 8x
- (08:38) CPU offloading (Unsloth), offload about 37GB
- (09:30) Tiling of element-wise computations, 3 million tokens becomes possible
- (10:16) Untied Ulysses (UPipe) — head-wise chunking
- (11:43) Results — 8B / 32B, 5 million tokens, chunk size and speed
- (13:09) Lesson — bottlenecks show up in unexpected places, PyTorch profiler
- (13:58) Q&A — QKV and the allocation of huge tensors
Related links
- Paper "Untied Ulysses" (arXiv 2602.21196, Together AI, 2026-02)
- Code (GitHub: togethercomputer/Untied-Ulysses)
- Together AI official
- AI Engineer talk video (YouTube)
Glossary
- context parallelism
- Parallelization that splits a long sequence across multiple GPUs for processing. Instead of every GPU computing the attention for the entire sequence, Microsoft's DeepSpeed Ulysses computes it on a different GPU per head and communicates the necessary activations among them. Because a single GPU handles one head while computing the attention over the full sequence, it can also keep using the best implementations like FlashAttention.
- Untied Ulysses (UPipe)
- Together AI's development of context parallelism (arXiv 2602.21196). When multiple heads are assigned to a single GPU, it splits them into small chunks and iterates over the time dimension, reusing small buffers. Compared to allocating one giant buffer as before, it saves activation memory and cuts attention's intermediate tensors by up to 87.5% on a 32B model.
- FSDP (Fully Sharded Data Parallelism)
- PyTorch's fully sharded data parallelism. The model's parameters, gradients, and optimizer states are split across multiple GPUs and held in pieces. Picture 8 GPUs each carrying 1/8 of a single huge book. The model's memory drops sharply, but it can still overflow on the attention activations.
- activation checkpointing / CPU offloading
- Activation checkpointing is a technique that, rather than keeping the intermediate activations of the forward pass, recomputes them during the backward pass to cut memory. CPU offloading is a technique that moves part of the activations off the GPU to the CPU and retrieves them just before they are needed (this talk noted Unsloth implemented it first). Both buy memory, but in exchange for the cost of recomputation or transfer.
- quadratic compute and linear memory
- Because a transformer's attention computes the all-pairs interactions among every element in the sequence, the compute grows with the square of the sequence length. Memory, on the other hand, grows linearly with the sequence length. In long-context training, you need to keep both of these in check with various techniques (parallelization, recomputation, offloading, tiling).