Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation 15 days ago • 11
Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 36
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 • 14
Medieval HTR Collection This is a collection of HTR data and models • 2 items • Updated 1 day ago • 2
Medieval NER Collection This is a collection of Medieval NER datasets and models. • 7 items • Updated 1 day ago • 2
Probably oasst Style Datasets Collection Datasets in the OpenAssistant format {"INSTRUCTION": "...", "RESPONSE": "..."} • 46 items • Updated 3 days ago • 1
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives Paper • 2407.01490 • Published 4 days ago • 1
Probably function calling datasets Collection Created using the https://huggingface.co/spaces/librarian-bots/dataset-column-search-api Space. • 38 items • Updated 3 days ago • 7
Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER Paper • 2407.01272 • Published 4 days ago • 6
APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets Paper • 2406.18518 • Published 9 days ago • 20
Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity Paper • 2406.17720 • Published 10 days ago • 7
Probably Alpaca Style Datasets Collection Datasets probably matching the alpaca format ({"instruction": "...", "input": "...", "output": "..."}) • 1944 items • Updated 4 days ago • 1
LiveBench: A Challenging, Contamination-Free LLM Benchmark Paper • 2406.19314 • Published 8 days ago • 12
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models Paper • 2406.14848 • Published 15 days ago • 2
Probably DPO datasets Collection A collection of datasets that probably support DPO • 146 items • Updated 9 days ago • 8
DataComp-LM: In search of the next generation of training sets for language models Paper • 2406.11794 • Published 18 days ago • 39
PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models Paper • 2406.15513 • Published 15 days ago • 1
TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings Paper • 2406.15586 • Published 14 days ago • 2
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks Paper • 2406.12925 • Published 21 days ago • 17
synthetic-data-generation-demos Collection A collection of demos for various approaches to synthetic data generation • 4 items • Updated 11 days ago • 10
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models Paper • 2406.13542 • Published 16 days ago • 15
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published 15 days ago • 76
Large Scale Transfer Learning for Tabular Data via Language Modeling Paper • 2406.12031 • Published 18 days ago • 6
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools Paper • 2406.12793 • Published 17 days ago • 27
TabuLa-8B Collection Training, eval suite, and model from the paper "Large Scale Transfer Learning for Tabular Data via Language Modeling" https://arxiv.org/abs/2406.12031 • 4 items • Updated 17 days ago • 8
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens Paper • 2406.11271 • Published 19 days ago • 10
ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures Paper • 2406.09818 • Published 22 days ago • 2
view article Article The CVPR Survival Guide: Discovering Research That's Interesting to YOU! By harpreetsahota • 21 days ago • 9
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community Apr 15 • 146
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature Paper • 2406.07835 • Published 25 days ago • 1
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published 23 days ago • 48
A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding Paper • 2406.05540 • Published 27 days ago • 2
An Open and Large-Scale Dataset for Multi-Modal Climate Change-aware Crop Yield Predictions Paper • 2406.06081 • Published 26 days ago • 1
Qwen2 Collection Qwen2 language models, including pretrained and instruction-tuned models of 5 sizes, including 0.5B, 1.5B, 7B, 57B-A14B, and 72B. • 29 items • Updated 29 days ago • 231
FiftyOne-Compatible VQA Datasets Collection Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 6 items • Updated Jun 3 • 2
FiftyOne-Compatible Image Captioning Datasets Collection Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 6 items • Updated Jun 3 • 2
FiftyOne-Compatible Image Segmentation Datasets Collection Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 3 items • Updated Jun 3 • 2
FiftyOne-Compatible Object Detection Datasets Collection Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 7 items • Updated 4 days ago • 2
FiftyOne-Compatible Image Classification Datasets Collection Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 14 items • Updated Jun 3 • 2
view article Article Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data By frimelle • Jun 3 • 12
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 May 28 • 115
Arabic NoRobots DPO Datasets Collection Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated May 29 • 4
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers Paper • 2403.02839 • Published Mar 5 • 1
view article Article ⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2 By burtenshaw • Jun 3 • 21
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated 14 days ago • 20
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Paper • 2405.11143 • Published May 20 • 33
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated May 31 • 360
MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels Paper • 2405.07526 • Published May 13 • 16
Arabic Aya DPO Datasets Collection Our synthetic DPO datasets for Arabic Aya. • 5 items • Updated Jun 4 • 3
ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata Paper • 2405.09496 • Published May 15 • 3
Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning Paper • 2307.03692 • Published Jul 5, 2023 • 24
Optimizing Language Model's Reasoning Abilities with Weak Supervision Paper • 2405.04086 • Published May 7 • 1