503 212 587

Daniel van Strien

davanstrien

https://danielvanstrien.xyz/

vanstriendaniel

davanstrien

AI & ML interests

Machine Learning Librarian

Articles

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

15 days ago

• 11

Data Is Better Together: A Look Back and Forward

16 days ago

• 14

Synthetic dataset generation techniques: generating custom sentence similarity data

May 23

• 13

Synthetic dataset generation techniques: Self-Instruct

May 15

• 5

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

May 7

• 7

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Mar 20

• 36

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 14

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

Aug 2, 2023

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Jun 12, 2023

• 1

Introducing BERTopic Integration with Hugging Face Hub

May 31, 2023

• 2

Jupyter X Hugging Face

Mar 23, 2023

• 2

Image search with 🤗 datasets

Mar 16, 2022

• 5

Organizations

davanstrien's activity

upvoted an article 1 day ago

Article

Image search with 🤗 datasets

Mar 16, 2022

• 5

upvoted 2 collections 1 day ago

Medieval HTR

Collection

This is a collection of HTR data and models • 2 items • Updated 1 day ago • 2

Medieval NER

Collection

This is a collection of Medieval NER datasets and models. • 7 items • Updated 1 day ago • 2

upvoted a collection 3 days ago

Probably oasst Style Datasets

Collection

Datasets in the OpenAssistant format {"INSTRUCTION": "...", "RESPONSE": "..."} • 46 items • Updated 3 days ago • 1

upvoted a paper 4 days ago

LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives

Paper • 2407.01490 • Published 4 days ago • 1

upvoted a collection 4 days ago

Probably function calling datasets

Collection

Created using the https://huggingface.co/spaces/librarian-bots/dataset-column-search-api Space. • 38 items • Updated 3 days ago • 7

upvoted a paper 4 days ago

Show Less, Instruct More: Enriching Prompts with Definitions and Guidelines for Zero-Shot NER

Paper • 2407.01272 • Published 4 days ago • 6

upvoted 2 papers 5 days ago

APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets

Paper • 2406.18518 • Published 9 days ago • 20

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published 10 days ago • 7

upvoted a collection 7 days ago

Probably Alpaca Style Datasets

Collection

Datasets probably matching the alpaca format ({"instruction": "...", "input": "...", "output": "..."}) • 1944 items • Updated 4 days ago • 1

upvoted 2 papers 8 days ago

LiveBench: A Challenging, Contamination-Free LLM Benchmark

Paper • 2406.19314 • Published 8 days ago • 12

Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

Paper • 2406.14848 • Published 15 days ago • 2

upvoted a collection 9 days ago

Probably DPO datasets

Collection

A collection of datasets that probably support DPO • 146 items • Updated 9 days ago • 8

upvoted a paper 10 days ago

DataComp-LM: In search of the next generation of training sets for language models

Paper • 2406.11794 • Published 18 days ago • 39

upvoted 2 papers 11 days ago

PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Paper • 2406.15513 • Published 15 days ago • 1

TinyStyler: Efficient Few-Shot Text Style Transfer with Authorship Embeddings

Paper • 2406.15586 • Published 14 days ago • 2

upvoted a paper 14 days ago

GLiNER multi-task: Generalist Lightweight Model for Various Information Extraction Tasks

Paper • 2406.12925 • Published 21 days ago • 17

upvoted 2 collections 14 days ago

synthetic-data-generation-demos

Collection

A collection of demos for various approaches to synthetic data generation • 4 items • Updated 11 days ago • 10

Instruction Pre-Training

Collection

8 items • Updated 14 days ago • 24

upvoted a paper 14 days ago

Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

Paper • 2406.13542 • Published 16 days ago • 15

upvoted a paper 15 days ago

Instruction Pre-Training: Language Models are Supervised Multitask Learners

Paper • 2406.14491 • Published 15 days ago • 76

upvoted 2 papers 17 days ago

Large Scale Transfer Learning for Tabular Data via Language Modeling

Paper • 2406.12031 • Published 18 days ago • 6

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Paper • 2406.12793 • Published 17 days ago • 27

upvoted a collection 17 days ago

TabuLa-8B

Collection

Training, eval suite, and model from the paper "Large Scale Transfer Learning for Tabular Data via Language Modeling" https://arxiv.org/abs/2406.12031 • 4 items • Updated 17 days ago • 8

upvoted a paper 18 days ago

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

Paper • 2406.11271 • Published 19 days ago • 10

upvoted 2 papers 19 days ago

GEB-1.3B: Open Lightweight Large Language Model

Paper • 2406.09900 • Published 22 days ago • 18

ClimRetrieve: A Benchmarking Dataset for Information Retrieval from Corporate Climate Disclosures

Paper • 2406.09818 • Published 22 days ago • 2

upvoted 2 articles 22 days ago

Article

The CVPR Survival Guide: Discovering Research That's Interesting to YOU!

•

21 days ago

• 9

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 146

upvoted 2 papers 23 days ago

SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature

Paper • 2406.07835 • Published 25 days ago • 1

Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published 23 days ago • 48

upvoted 2 papers 25 days ago

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

Paper • 2406.05540 • Published 27 days ago • 2

An Open and Large-Scale Dataset for Multi-Modal Climate Change-aware Crop Yield Predictions

Paper • 2406.06081 • Published 26 days ago • 1

upvoted a paper 26 days ago

Zyda: A 1.3T Dataset for Open Language Modeling

Paper • 2406.01981 • Published Jun 4 • 2

upvoted a collection 29 days ago

Qwen2

Collection

Qwen2 language models, including pretrained and instruction-tuned models of 5 sizes, including 0.5B, 1.5B, 7B, 57B-A14B, and 72B. • 29 items • Updated 29 days ago • 231

upvoted 5 collections about 1 month ago

FiftyOne-Compatible VQA Datasets

Collection

Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 6 items • Updated Jun 3 • 2

FiftyOne-Compatible Image Captioning Datasets

Collection

Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 6 items • Updated Jun 3 • 2

FiftyOne-Compatible Image Segmentation Datasets

Collection

Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 3 items • Updated Jun 3 • 2

FiftyOne-Compatible Object Detection Datasets

Collection

Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 7 items • Updated 4 days ago • 2

FiftyOne-Compatible Image Classification Datasets

Collection

Parquet formatted datasets loadable into FiftyOne with 1 line: https://docs.voxel51.com/integrations/huggingface.html#loading-datasets-from-the-hub • 14 items • Updated Jun 3 • 2

upvoted 2 articles about 1 month ago

Article

Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data

•

Jun 3

• 12

Article

Training and Finetuning Embedding Models with Sentence Transformers v3

May 28

• 115

upvoted a collection about 1 month ago

Arabic NoRobots DPO Datasets

Collection

Our synthetic DPO datasets for Arabic NoRobots. • 4 items • Updated May 29 • 4

upvoted a paper about 1 month ago

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers

Paper • 2403.02839 • Published Mar 5 • 1

upvoted an article about 1 month ago

Article

⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2

•

Jun 3

• 21

upvoted a collection about 1 month ago

sentence-transformers-from-synthetic-data

Collection

Example of using distilabel to generate synthetic triplets data for fine-tuning a Sentence Transformer model • 4 items • Updated 14 days ago • 20

upvoted a paper about 1 month ago

Retrieving Texts based on Abstract Descriptions

Paper • 2305.12517 • Published May 21, 2023 • 2

upvoted an article about 1 month ago

Article

Synthetic data: save money, time and carbon with open source

Feb 16

• 35

upvoted a paper about 2 months ago

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Paper • 2405.11143 • Published May 20 • 33

upvoted a collection about 2 months ago

Phi-3

Collection

Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 22 items • Updated May 31 • 360

upvoted 2 papers about 2 months ago

MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels

Paper • 2405.07526 • Published May 13 • 16

LoRA Learns Less and Forgets Less

Paper • 2405.09673 • Published May 15 • 81

upvoted a collection about 2 months ago

Arabic Aya DPO Datasets

Collection

Our synthetic DPO datasets for Arabic Aya. • 5 items • Updated Jun 4 • 3

upvoted 4 papers about 2 months ago

ParaNames 1.0: Creating an Entity Name Corpus for 400+ Languages using Wikidata

Paper • 2405.09496 • Published May 15 • 3

RLHF Workflow: From Reward Modeling to Online RLHF

Paper • 2405.07863 • Published May 13 • 62

Becoming self-instruct: introducing early stopping criteria for minimal instruct tuning

Paper • 2307.03692 • Published Jul 5, 2023 • 24

Self-Alignment with Instruction Backtranslation

Paper • 2308.06259 • Published Aug 11, 2023 • 38

upvoted an article about 2 months ago

Article

Introducing the Open Arabic LLM Leaderboard

May 14

• 53

upvoted 2 papers about 2 months ago

Typhoon: Thai Large Language Models

Paper • 2312.13951 • Published Dec 21, 2023 • 4

Optimizing Language Model's Reasoning Abilities with Weak Supervision

Paper • 2405.04086 • Published May 7 • 1

Daniel van Strien

AI & ML interests

Articles

Introducing Synthetic Data Workshop: Your Gateway to Easy Synthetic Dataset Creation

Data Is Better Together: A Look Back and Forward

Synthetic dataset generation techniques: generating custom sentence similarity data

Synthetic dataset generation techniques: Self-Instruct

Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

Data is better together

Extracting Insights from Model Cards Using Open Large Language Models

Creating open machine learning datasets? Share them on the Hugging Face Hub!

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub

The Hugging Face Hub for Galleries, Libraries, Archives and Museums

Introducing BERTopic Integration with Hugging Face Hub

Jupyter X Hugging Face

Image search with 🤗 datasets

Organizations

davanstrien's activity

Image search with 🤗 datasets

The CVPR Survival Guide: Discovering Research That's Interesting to YOU!

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Wikipedia's Treasure Trove: Advancing Machine Learning with Diverse Data

Training and Finetuning Embedding Models with Sentence Transformers v3

⚗️ 🔥 Building High-Quality Datasets with distilabel and Prometheus 2

Synthetic data: save money, time and carbon with open source

Introducing the Open Arabic LLM Leaderboard