Introducing Sailor-14B Model and Sailor2 Project 🚢
We're thrilled to announce the release of the Sailor-14B models, including the Base and the Chat versions!
✅Built upon the Qwen1.5-14B model, the Base version follows a similar procedure as our Sailor-7B model. ✅The Chat version is optimized using DPO on our in-house human preference dataset, yielding a better experience than our previous Chat models.
We're also excited to introduce the Sailor2 project, ✨ an open collaboration opportunity for the entire community! ✨
🌐 The Sailor2 project aims to build a LLM with ~30B parameters, optimized for multiple South-East Asian languages, including Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese.
🎯The model will undergo continual pre-training from a base model proficient in both Chinese and English using nearly 800B SEA tokens, with an expected performance comparable to the most advanced business models for the above SEA languages.
🤝 Contribute your data, expertise, and ideas to shape the future of open-source LLMs for the SEA region.
🌍 Everyone passionate about the SEA region is welcome aboard! Join the party and get involved by scanning the QR code! 🔍
✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀
The pipeline consists of 4 stages🧹: 1️⃣ Initial data cleaning 2️⃣ Near deduplication 3️⃣ Exact deduplication 4️⃣ Second round of data cleaning
A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍
# Use Case ✨
With this codebase, you can clean your own dataset with:
✅ Get filtered data counts after each processing stage ✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay) ✅ Investigate what data was removed at each processing stage
# Acknowledgement 🙏
The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉
Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠
# What's Next 🚀
Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄