Synthetic Data Generation - a mapama247 Collection

mapama247 's Collections

Instruction Tuning Datasets

Conversational Datasets

RLHF

Synthetic Data Generation

Synthetic Data Generation

updated about 11 hours ago

SDG papers

Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 142
Textbooks Are All You Need II: phi-1.5 technical report

Paper • 2309.05463 • Published Sep 11, 2023 • 87
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 94
Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published Jun 20 • 85
Improving Text Embeddings with Large Language Models

Paper • 2401.00368 • Published Dec 31, 2023 • 79
Adapting Large Language Models via Reading Comprehension

Paper • 2309.09530 • Published Sep 18, 2023 • 75
Magicoder: Source Code Is All You Need

Paper • 2312.02120 • Published Dec 4, 2023 • 79
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Paper • 2401.01335 • Published Jan 2 • 64
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12 • 62
WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Paper • 2312.14187 • Published Dec 20, 2023 • 49
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20 • 46
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29 • 47
AgentInstruct: Toward Generative Teaching with Agentic Flows

Paper • 2407.03502 • Published Jul 3 • 43
Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 40
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation

Paper • 2410.09584 • Published 4 days ago • 40
OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Paper • 2402.10176 • Published Feb 15 • 34
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 31
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

Paper • 2402.10379 • Published Feb 16 • 29
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

Paper • 2312.06585 • Published Dec 11, 2023 • 28
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Paper • 2307.03692 • Published Jul 5, 2023 • 24
AlpaGasus: Training A Better Alpaca with Fewer Data

Paper • 2307.08701 • Published Jul 17, 2023 • 22
Simple synthetic data reduces sycophancy in large language models

Paper • 2308.03958 • Published Aug 7, 2023 • 21
CodecLM: Aligning Language Models with Tailored Synthetic Data

Paper • 2404.05875 • Published Apr 8 • 16
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Paper • 2409.08239 • Published Sep 12 • 15
WizardLM: Empowering Large Language Models to Follow Complex Instructions

Paper • 2304.12244 • Published Apr 24, 2023 • 13
Learning to Generate Instruction Tuning Datasets for Zero-Shot Task Adaptation

Paper • 2402.18334 • Published Feb 28 • 12
Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Paper • 2408.03256 • Published Aug 6 • 10
Self-Instruct: Aligning Language Model with Self Generated Instructions

Paper • 2212.10560 • Published Dec 20, 2022 • 7
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Paper • 2305.14233 • Published May 23, 2023 • 6
M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Paper • 2406.16783 • Published Jun 24 • 4
Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22 • 1
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing

Paper • 2305.16635 • Published May 26, 2023 • 1
Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena

Paper • 2407.10627 • Published Jul 15 • 1
ZeroGen: Efficient Zero-shot Learning via Dataset Generation

Paper • 2202.07922 • Published Feb 16, 2022 • 1
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Paper • 2403.04190 • Published Mar 7
On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey

Paper • 2406.15126 • Published Jun 14
Large Language Models for Data Annotation: A Survey

Paper • 2402.13446 • Published Feb 21
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

Paper • 2306.15895 • Published Jun 28, 2023
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Paper • 2404.14445 • Published Apr 20
TarGEN: Targeted Data Generation with Large Language Models

Paper • 2310.17876 • Published Oct 27, 2023