Long-Context & Agent Training: The sequence length was pushed to 128K, and the model was trained on long documents and large-scale synthetic agent trajectories.
The post-training process was designed to refine the model's abilities in two main stages: Expert Training (creating specialists in Reasoning, Agent, and General chat) and Unified Training (integrating these experts into one comprehensive model with self-distillation).
SFT was used to give the expert models a "cold start" with basic chat and reasoning skills. In the unified training stage, SFT served to distill the capabilities from the different expert models into the final hybrid reasoning model.
Slime is an open-source SGLang-native post-training framework designed to scale RL models with flexibility and efficiency. A key architectural feature is its support for a flexible hybrid training architecture, which can operate in either a "colocated, synchronous mode" or a "disaggregated, asynchronous mode". The choice of mode is directly tied to the nature of the task. Synchronous modes were shown to be more effective for reasoning tasks like math and code generation, where the training and inference engines reside on the same worker, maximizing GPU utilization. The researchers explain that asynchronous modes were better for agentic tasks, where data generation can be slow. This disaggregated model decouples the training and rollout processes, allowing agent environments to continuously generate data without being stalled by the training cycle.
The team developed a suite of specialized RL techniques to effectively train the models.
Reasoning RL: To avoid getting stuck with reward signals that were all 0s or 1s, the team used a two-stage difficulty-based curriculum, moving from moderate to extremely difficult problems as the model improved. They also found that a single-stage RL process at the maximum 64K output length was more effective than progressively increasing the length, as shorter stages could cause the model to "unlearn" its long-context abilities.
Agentic RL: This focused on web-search and code-generation agents where actions could be automatically verified, providing dense and reliable reward signals. The training involved an iterative self-distillation approach, where an RL-trained model was used to generate better data for a new SFT model, which was then further trained with RL.
General RL: To improve overall performance, a multi-source feedback system combined rule-based feedback, human feedback (RLHF), and model-based feedback (RLAIF). This included targeted training to improve instruction following and fix pathological behaviors like repetition or formatting mistakes.