@Taylor658 on Hugging Face: "🔍 A recently published technical report introduces MINT-1T, a dataset that…"

Post

905

🔍 A recently published technical report introduces MINT-1T, a dataset that will considerably expand open-source multimodal data. It features one trillion text tokens and three billion images and is scheduled for release in July 2024.

Researcher Affiliation:

University of Washington
Salesforce Research
Stanford University
University of Texas at Austin
University of California, Berkeley

Paper:
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
https://arxiv.org/pdf/2406.11271v1.pdf

GitHub:
https://github.com/mlfoundations/MINT-1T

Highlights:

MINT-1T Dataset: Largest open-source multimodal interleaved dataset with 1 trillion text tokens & 3 billion images. 📊🖼️
Diverse Sources: Incorporates data from HTML, PDFs, and ArXiv documents. 📄📚
Open Source: Dataset and code will be released at https://github.com/mlfoundations/MINT-1T. 🌐🔓
Broader Domain Representation: Uses diverse data sources for balanced domain representation. 🌍📚
Performance in Multimodal Tasks: The dataset’s scale and diversity should enhance multimodal task performance. 🤖💡

Datasheet Information:

Motivation: Addresses the gap in large-scale open-source multimodal datasets. 🌐📊
Composition: 927.6 million documents, including HTML, PDF, and ArXiv sources. 📄📚
Collection Process: Gathered from CommonCrawl WARC and WAT dumps, with rigorous filtering. 🗂️🔍
Preprocessing/Cleaning: Removal of low-quality text, duplicates and anonymization of sensitive information. 🧹🔒
Ethical Considerations: Measures to ensure privacy and avoid bias. ⚖️🔏
Uses: Training multimodal models, generating interleaved image-text sequences, and building retrieval systems. 🤖📖

Join the conversation