@merve on Hugging Face: "Forget about all the captioning datasets you've tried before! PixelProse is a…"

Post

3226

Forget about all the captioning datasets you've tried before!

PixelProse is a captioning dataset of 16M image-caption pairs, with less toxicity and higher details ✨
tomg-group-umd/pixelprose

The existing suite of captioning datasets consists of web scrapes that have alt text that is either irrelevant or not descriptive. The authors of this paper have taken those datasets, filtered for CSAM, passed it with a prompt to Gemini Vision Pro. They also removed PII and detoxified the resulting dataset.

Join the conversation