🇩🇪German SFT and DPO datasets Collection Datasets that can be used for LLM training with axolotl, trl or llama_factory. • 30 items • Updated May 27 • 8
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published 14 days ago • 73
view article Article Fine-tuning Florence-2 - Microsoft's Cutting-edge Vision Language Models 15 days ago • 131
view article Article BM25 for Python: Achieving high performance while simplifying dependencies with *BM25S*⚡ By xhluca • about 4 hours ago • 25
GenQA: Generating Millions of Instructions from a Handful of Prompts Paper • 2406.10323 • Published 25 days ago • 5
Embedding Model Datasets Collection A curated subset of the datasets that work out of the box with Sentence Transformers: https://huggingface.co/datasets?other=sentence-transformers • 67 items • Updated 6 days ago • 46
4M Models Collection Multimodal models from https://4m.epfl.ch/ • 14 items • Updated 24 days ago • 29
GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks Paper • 2406.12925 • Published 25 days ago • 18
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published 19 days ago • 76
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models Paper • 2404.18796 • Published Apr 29 • 67
TabuLa-8B Collection Training, eval suite, and model from the paper "Large Scale Transfer Learning for Tabular Data via Language Modeling" https://arxiv.org/abs/2406.12031 • 4 items • Updated 20 days ago • 8
Depth Anything v2 Release Collection A comprehensive collection on DAv2 • 5 items • Updated 21 days ago • 9
FP8 LLMs for vLLM Collection Accurate FP8 quantized models by Neural Magic, ready for use with vLLM! • 16 items • Updated about 17 hours ago • 18
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing Paper • 2406.08464 • Published 27 days ago • 48
codestral-text2cypher Collection codestral finetuned for text2cypher • 3 items • Updated 29 days ago • 2
Local Function Calling Gems Collection These are the best function calling LLMs one can run on less than 64GB VRAM/Unified Memory. I use these on a M1 Max Macbook 64GB. • 6 items • Updated 8 days ago • 3
Qwen2 Collection Qwen2 language models, including pretrained and instruction-tuned models of 5 sizes, including 0.5B, 1.5B, 7B, 57B-A14B, and 72B. • 29 items • Updated Jun 6 • 237
DeTikZify Collection Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ • 9 items • Updated Jun 3 • 2
view article Article Releasing Common Corpus: the largest public domain dataset for training LLMs By Pclanglais • Mar 20 • 12
view article Article How to directly access 150k+ Hugging Face Datasets with DuckDB and query using GPT-4o By chilijung • May 31 • 10
view article Article ⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2 By burtenshaw • Jun 3 • 21
sentence-transformers-from-synthetic-data Collection Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated 18 days ago • 20
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 May 28 • 116
Granite Code Models: A Family of Open Foundation Models for Code Intelligence Paper • 2405.04324 • Published May 7 • 14
view article Article GPU Poor Savior: Revolutionizing Low-Bit Open Source LLMs and Cost-Effective Edge Computing By NicoNico • May 25 • 9
DiscoLeo 8B: Llama3 for German Collection Continued Pretraining on Llama3 8B to improve German linguistic capabilities. A collection of base and fine-tuned models and variants. • 5 items • Updated May 25 • 14
DiscoLeo 8B quants Collection A collection of different quantizations of the DiscoLeo models. • 3 items • Updated May 25 • 3
C4AI Aya 23 Collection Aya 23 is an open weights research release of an instruction fine-tuned model with highly advanced multilingual capabilities. • 3 items • Updated May 23 • 40
view article Article ⚗️ 🧑🏼🌾 Let's grow some Domain Specific Datasets together By burtenshaw • Apr 29 • 27
C4AI Command R Plus Collection C4AI Command R+ is an open weights research release of a 104B billion parameter model with highly advanced capabilities. • 3 items • Updated May 23 • 23
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated May 31 • 362
CommonCatalog Collection Common Catalog, a dataset with Creative Commons licensed images and machine-generated caption pairs • 8 items • Updated May 16 • 13
M2-BERT Embeddings Collection Models and Datasets for M2-BERT and LoCoV1 • 10 items • Updated May 22 • 2
Granite Code Models Collection A series of code models trained by IBM licensed under Apache 2.0 license. We release both the base pretrained and instruct models. • 20 items • Updated 10 days ago • 145
view article Article Saving Memory Using Padding-Free Transformer Layers during Finetuning By mayank-mishra • 28 days ago • 8
view article Article Introducing Idefics2: A Powerful 8B Vision-Language Model for the community Apr 15 • 146
view article Article 🧑⚖️ "Replacing Judges with Juries" using distilabel By alvarobartt • May 3 • 17
llama 3 self-align experiments Collection Replicating the pipeline for StarCoder-2 Instruct on Llama-3-8B with some tweaks https://huggingface.co/blog/sc2-instruct • 4 items • Updated May 9 • 6
view article Article StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation Apr 29 • 70
view article Article Post-OCR-Correction: 1 billion words dataset of automated OCR correction by LLM By Pclanglais • Apr 26 • 12
view article Article LLM Comparison/Test: Llama 3 Instruct 70B + 8B HF/GGUF/EXL2 (20 versions tested and compared!) By wolfram • Apr 24 • 53