The Rise of Agentic Data Generation

Community Article Published July 15, 2024

With the consolidation of LLM architectures, the quality of training data has become the most important factor in creating state-of-the-art models. This is true for both pre-training and post-training, where instruction datasets have a major impact on the final model. Two innovative approaches have recently emerged to address the challenge of generating high-quality instruction datasets for post-training LLMs: AgentInstruct and Arena Learning. Both frameworks come from Microsoft Research and leverage multiple LLMs to create and refine samples.

In this article, I want to explore both methods, analyze their similarities and differences, and see how we could combine them in a single end-to-end framework.

🤖 AgentInstruct: A Multi-Agent Approach

AgentInstruct is an agentic framework by Mitra et al. (2024), designed to generate large-scale, diverse, and high-quality synthetic data. The framework uses a sophisticated pipeline that transforms raw text into refined instructions through multiple stages of processing. In the paper, the agents seem to be based on GPT-4, which is also used to evaluate data quality and hallucinations in some contexts.

Figure from the AgentInstruct paper.

The AgentInstruct pipeline consists of four main steps:

Seed Collection: Assemble a diverse collection of raw seeds, such as textbook chapters, web articles, and code snippets. These seeds serve as the foundation for generating new instructions.
Content Transformation: One or more specialized agents modify each seed into an intermediate representation that simplifies instruction creation. These agents are designed to perform tasks like generating argument passages, debates, conversations, meeting transcripts, poems, satirical content, etc.
Seed Instruction Generation: Multiple agents take the transformed seed and generate diverse instructions based on a pre-defined taxonomy of instruction types. For example, in the domain of reading comprehension, the taxonomy includes 43 question types, ranging from literal comprehension to critical analysis and inference.
Instruction Refinement: The final stage involves iteratively enhancing the complexity and quality of the generated instructions. This is achieved through suggester-editor agent pairs. Suggester agents propose ways to increase instruction complexity, while editor agents modify the instructions accordingly.

To get a better idea of what each stage produces, I recommend reading the examples provided in the paper.

Each flow in the AgentInstruct pipeline consists of multiple agents powered by LLMs. These agents can be equipped with tools like search APIs or code interpreters to enhance their capabilities. The roles of these agents are carefully defined in their system messages to ensure they perform their specific tasks effectively.

The authors of AgentInstruct implemented flows for 17 different skills, each with multiple subcategories. These skills cover a wide range of areas, including reading comprehension, question answering, coding, retrieval augmented generation, creative writing, tool use, and web control.

Using this comprehensive pipeline, the researchers generated approximately 22 million instructions. They combined this synthetic data with 3.8 million instructions from other sources to create a dataset of 25.8 million paired instructions. This dataset was then used to fine-tune the Mistral-7b model, resulting in the creation of the Orca-3 model.

⚔️ Arena Learning: A Competitive Refinement Approach

Arena Learning by Luo, Suo, et al. (2024) takes a different approach to generating high-quality instruction data. Instead of creating instructions from scratch, it focuses on refining existing instruction datasets through a simulated competitive environment. It is not an agentic framework because tools are not provided to the models, but could easily be transformed into one.

Figure from the Arena Learning paper.

The key components of the Arena Learning pipeline are:

Offline Pairwise LLM Arena: Arena Learning creates a simulated arena where multiple LLMs compete against each other on a large set of instruction data. A judge LLM (meta-llama/Meta-Llama-3-70B-Instruct) evaluates the responses from competing models for each instruction, providing rankings, scores, and explanations. This process effectively simulates human evaluation but at a much larger scale and lower cost.
Data Collection and Preprocessing: The framework starts with a large corpus of conversational data collected from various open sources. This data goes through filtering, cleaning, and deduplication. Instructions that are too short, illegal/toxic, or too similar to benchmark test sets are removed. The refined dataset is then split into multiple parts for iterative training.
Iterative Battle and Model Evolution: The process involves multiple rounds of battles and training:
1. An initial model (WizardLM-β-SFT-I0) is trained on a subset of data.
2. This model competes against other state-of-the-art LLMs on another data subset.
3. Instances where WizardLM-β loses are collected, with the winning model's response used as the target for fine-tuning.
4. The process repeats for multiple iterations, with each iteration potentially using different training strategies (SFT, DPO, PPO).
Training Strategies: Arena Learning employs multiple training strategies to improve the model:
- Supervised Fine-Tuning (SFT): Uses battle results to fine-tune the model on instances where it performed poorly.
- Direct Preference Optimization (DPO): Treats win/loss responses as choice/reject pairs for training.
- Proximal Policy Optimization (PPO): Uses battle results to train both a reward model and the language model.
WizardArena Evaluation: The authors create an offline test set (WizardArena) with diverse and hard subsets. This is used to evaluate models through pairwise battles, with results used to compute Elo rankings. The evaluation closely aligns with human-based arenas but is much faster and cheaper.
Data Selection: The pipeline uses various strategies to select high-quality training data, such as threshold-based filtering to control data size and quality, focusing on instances where the model underperforms, and gradually shifting towards more complex data in later iterations.

Figure from the Arena Learning paper.

This framework allows for multiple iterations of battles and training, as illustrated with WizardLM-β. The model's capabilities are progressively strengthened, particularly in complex tasks. The process results in significant gains in Elo rankings, MT-bench scores, and other evaluation metrics.

Arena Learning focuses on improving areas where the model under training is currently lacking. A nice feature is that it doesn't require particularly powerful models like Claude 3.5 Sonnet or GPT-4o. Models with a similar level can be better in some tasks and domains, as well as more suited to answer certain prompt syntaxes. It means that the entire pipeline can be deployed using open-weight models, which is a big advantage if you already have a high-quality infrastructure.

🪄 ArenaInstruct: Combining AgentInstruct and Arena Learning

While both AgentInstruct and Arena Learning aim to generate high-quality data for post-training language models, they take fundamentally different approaches to achieve this goal. Understanding how they differ, as well as their strengths and weaknesses is a good first step to see how we could combine them. I selected four points I want to focus on:

Data Generation: AgentInstruct starts from raw text, generating instructions from scratch through a multi-stage pipeline. This allows for the creation of entirely new content, potentially leading to greater diversity and novelty in the generated instructions. On the other hand, Arena Learning refines existing instruction datasets through simulated battles between models. This method leverages the quality of existing datasets while improving upon them through competitive evaluation.
Data Quality: AgentInstruct relies on suggester-editor agent pairs for iterative refinement of instructions. This approach allows for fine-grained control over the complexity and quality of generated instructions. Arena Learning, in contrast, uses an LLM-as-a-judge to evaluate responses in simulated battles. It means that the entire data quality process is handled by a single model.
Diversity and Complexity: AgentInstruct explicitly (i.e., manually) designs for diversity through a taxonomy of instruction types and multiple transformation agents. This structured approach ensures coverage across a wide range of skills and instruction types. Arena Learning's diversity comes from the variety of competing models and initial instruction datasets. While this may lead to less structured diversity, it could potentially capture more natural variations in instruction styles.
Flexibility: AgentInstruct's pipeline allows for easy addition of new seed types and instruction categories, making it highly adaptable to new domains and tasks. Arena Learning's iterative battle process enables continuous improvement of the target model, potentially allowing it to adapt more quickly to new challenges and competing models.

Based on this comparison, it's not too difficult to see how we can leverage the advantages of each framework. For instance, a taxonomy-based data generation is more steerable and could be improved upon by arena learning. But we could also use feedback signals to improve this first step over multiple iterations.

Here's how such a hybrid approach might work:

AgentInstruct Instruction Generation: Use AgentInstruct to create a broad and diverse base of instructions (no answers!) from raw text. This would ensure wide coverage of tasks and domains that are relevant for our use cases.
Arena Learning Answer Generation: Apply Arena Learning's competitive battle approach to refine and select the highest quality answers from a pool of models. This would combine AgentInstruct's ability to generate novel content with Arena Learning's robust quality control mechanism.
Data Quality Evaluation: Instead of relying on a single LLM-as-a-judge, we can use reward models or an LLM-as-a-jury to improve the data selection process.
Diversity Feedback: Use insights from Arena Learning battles to dynamically update AgentInstruct's instruction taxonomy. This would focus the generation process on producing more of the instruction types that prove most challenging or useful in real-world scenarios.
Complexity Feedback: Leverage Arena Learning's performance metrics to identify areas where instructions are too easy or too difficult. Use this information to guide AgentInstruct's complexity refinement process, ensuring a well-balanced dataset that challenges the model appropriately over several iterations.

By combining these approaches, we can create a powerful feedback loop between instruction generation and evaluation. This hybrid framework would benefit from AgentInstruct's ability to generate novel, diverse content and Arena Learning's competitive quality control and model improvement process. The result would be a more robust, effective, and continuously improving post-training dataset for LLMs.

Conclusion

In conclusion, this article explored two recent approaches in synthetic data generation: AgentInstruct and Arena Learning. We proposed a hybrid solution that combines AgentInstruct's structured, taxonomy-based methodology with Arena Learning's iterative refinement using multiple LLMs. This combination leverages the strengths of both frameworks, allowing for a systematic generation of diverse data while enabling continuous improvement of the underlying taxonomy through feedback from the LLM pool. I feel like we might lose some quality by removing the suggester-editor agent pairs. Let me know if you have better ideas.

Still, data quality evaluation is a significant challenge to perfect this approach. The current reliance on models like GPT-4 or Llama 3 70B Instruct as judges is imperfect and has known limitations (see my quick review here). Improving the quality assessment stage could lead to more efficient datasets, achieving better performance with fewer samples. To know more about how to create high-quality datasets, check out my GitHub repo 💾 LLM Datasets.

Upvote