view article Article Synthetic dataset generation techniques: generating custom sentence similarity data By davanstrien • May 23 • 13
Legal datasets Collection Large corpus of legal datasets from all around the 🌎 • 5 items • Updated 11 days ago • 2
Legal models Collection An extensive collection containing models for legal practice • 3 items • Updated 11 days ago • 1
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published 14 days ago • 73
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Paper • 2406.15877 • Published 17 days ago • 42
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs Paper • 2406.15319 • Published 17 days ago • 56
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published 21 days ago • 36
🎭 Avatars Collection The latest AI-powered technologies usher in a new era of realistic avatars! 🚀 • 47 items • Updated 4 days ago • 61
The best SDXL models Collection The best diffusion models (checkpoints) based on SDXL • 7 items • Updated 28 days ago • 4
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated May 31 • 362
Improve Mathematical Reasoning in Language Models by Automated Process Supervision Paper • 2406.06592 • Published Jun 5 • 17
Leaderboards and benchmarks ✨ Collection Cool leaderboard spaces collection for models across modalities! Text, vision, audio, ... • 64 items • Updated 28 days ago • 69
view article Article ⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2 By burtenshaw • Jun 3 • 21
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models Paper • 2405.05374 • Published May 8 • 1
view article Article StarCoder2-Instruct: Fully Transparent and Permissive Self-Alignment for Code Generation Apr 29 • 70
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22 • 241
view article Article Releasing Youtube-Commons: a massive open corpus for conversational and multimodal data By Pclanglais • Apr 18 • 20
HF-curated models available on Workers AI Collection A collection of models curated with Hugging Face that can be run on Cloudflare's Workers AI serverless inference platform. • 15 items • Updated Apr 2 • 50
PDF Document / OCR Datasets Collection Document datasets with .pdf files that are usable with pixparse libraries and tools. • 2 items • Updated Mar 30 • 38
LLM Leaderboard best models ❤️🔥 Collection A daily uploaded list of models with best evaluations on the LLM leaderboard: • 264 items • Updated 16 days ago • 347
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding Paper • 2403.12895 • Published Mar 19 • 28
⛔️🔦 Provenance, Watermarking & Deepfake Detection Collection Technical tools for more control over non-consensual synthetic content • 14 items • Updated Apr 1 • 37
Improving Text Embeddings with Large Language Models Paper • 2401.00368 • Published Dec 31, 2023 • 77
Gemini: A Family of Highly Capable Multimodal Models Paper • 2312.11805 • Published Dec 19, 2023 • 44
LLM in a flash: Efficient Large Language Model Inference with Limited Memory Paper • 2312.11514 • Published Dec 12, 2023 • 255
Weight subcloning: direct initialization of transformers using larger pretrained ones Paper • 2312.09299 • Published Dec 14, 2023 • 17
Self-Evaluation Improves Selective Generation in Large Language Models Paper • 2312.09300 • Published Dec 14, 2023 • 13
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent Paper • 2312.10003 • Published Dec 15, 2023 • 32
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Paper • 2312.09390 • Published Dec 14, 2023 • 32
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit Paper • 2312.09911 • Published Dec 15, 2023 • 52
WizardCoder: Empowering Code Large Language Models with Evol-Instruct Paper • 2306.08568 • Published Jun 14, 2023 • 28