ArabicWeb24: Creating a High Quality Arabic Web-only Pre-training Dataset

Community Article Published August 8, 2024

TL;DR

We discuss how to preprocess a large scrape of Arabic web content and turn it into a ArabicWeb24, a clean web data Arabic pre-training dataset that we release. We document the evaluation performed by training different ablation models and how different filtering pipelines affect the output of the models. We openly release the the code to perform data processing.

The data can be found here.

You can check the full blog post on the website here

Data Pre-processing Pipeline
Ablation Models Training
Evaluation
Additional Information
Citation

This project was done jointly by May Farhat and Said Taghadouini.

1. Data Pre-processing Pipeline

ArabicWeb24 is the result of extracting as much valuable information as possible from a 6.5 TB crawl of compressed WARC files. This specialized web crawl aimed to collect comprehensive Arabic content from untapped sources, addressing the shortcomings of Common Crawl in capturing Arabic language data in a more representative way.

Given the massive volume of data we needed to process, we chose the open-source library datatrove, built to perform large-scale processing, filtering, and deduplication of text data by parallelizing workloads.

In this section, we discuss the main steps in constructing ArabicWeb24.

We began with basic filtering, using URL and Gopher quality filters, a language labeler and filter, and text extraction from HTML pages via the Trafilatura module. Since the dataset is very large, we could not perform deduplication on the whole dataset. We divided the dataset into 4 parts for computational considerations, we applied MinHash deduplication on each part seperately, an approach that works by doing similarity search in all the documents and marking those who are considered duplicates and removing documents that were at least 75% similar. This was followed by a sentence deduplication module to eliminate duplicate sentences across the entire dataset.

For further consistency, we opted for additional filtering pipelines, such as the C4 bad words filter, to remove offensive words (our list includes both Arabic and English banned words), the FineWeb filter to remove short lines, lists, navigation bars and other unnecessary web page elements, in addition to formatters to remove symbol lines and images placeholders resulting in a high-quality, cleaned pre-training dataset of 28 billion tokens(AraGPT2 tokenizer tokens).

For more details about how we adapted and designed those filters to fit within the Arabic language requirements you can check the blog post here.

2. Ablation Models Training

In the ablation model, we used the Mamba2 architecture with a sequence length of 1024 given that the mean number of tokens was around 750. We set the global batch size to 1040 and used a model dimension (d_model) of 2304 and a depth of 18. This choice of a wider model is motivated by training efficiency considerations with minimal performance degradation. The vocabulary size was set to 64k, and we use the AraGPT2 tokenizer. Our ablation models have all 900 million parameters. We also use the cosine decay learning rate scheduler with 10% of warmup.

Unsure about the effect and the optimal filtering strategy we have to follow, we created 5 distinct datasets with different pipelines that are represented in the table down below. Please note that we started the ablations after the MinHash deduplication process, which we think it's required for all situations.

Datasets Versions	Dataset V1	Dataset V2	Dataset V4	Dataset V5
Sentence deduplication	✅	✅	🚫	🚫
C4 Bad words filter	✅	✅	✅	✅
FineWeb filter	✅	🚫	🚫	✅
Formatters	✅	✅	✅	✅
Num. of tokens (GT)	28	45	78	39

In order to test them empirically, 5 ablation models were trained with the same configuration. The only difference was the data they were trained on.

We also trained a baseline model on an open source dataset that we called V3. The 101 Billion Arabic Words Dataset is built on Common Crawl and is dedicated for pre-training large language models for a better generation of Arabic content.

3. Evaluation

The evaluation was based on two aspects:

Qualitative evaluation
Zero-shot evaluation benchmarks

3.1 Qualitative evaluation

To judge the quality of the different models trained on the 5 versions of datasets, we established three qualitative metrics and evaluated the output of different prompts accordingly:

Fluency: How grammatically correct and natural the generated text is.
Coherence: How logically consistent and contextually appropriate the generated text is.
Relevance: How well the generated text addresses the prompt or task.

Results:

The model trained on V1 data, which had extensive filtering, produced the best outputs, demonstrating fluency, relevance, and appropriateness without any adult or spam content.
Models trained on V2 and V4, which lacked the FineWeb filter, generated noisy outputs with irrelevant characters and list-like content, emphasizing the importance of the FineWeb filter in maintaining output quality.
Sentence deduplication reduces the generation of memorized content and lowers computational costs. However, in our experiments, it had minimal impact on output quality.

3.2 Zero-shot evaluation benchmarks

TOP3 and TOP10 Accuracy metric

We evaluated the models using the open-access ultimate_arabic_news dataset, from which we selected 61 million tokens of Arabic news texts from sources such as Al-Arabiya, Al-Youm Al-Sabea (Youm7), and Google News.

The evaluation focused on TOP3 and TOP10 token accuracy, as illustrated in the plots, as it helps detecting how models are not only able to generate a coherent output but also how often the answer appears within the top 3 or top 10 predictions.

We can visualize the results in the plots down below:

Results:

The model trained on dataset V1 achieved the highest top 3 and top 10 accuracies, followed by the model trained on V5.
Filtering and deduplication improved data quality, though sentence deduplication had a minor effect.
Dropping sentence deduplication could yield more training tokens and further improvements.

Perplexity

Our perplexity study (figure below) shows that the V1 model outperforms others, including the baseline model trained on the 101B words dataset. Models trained on our data achieve lower, more similar perplexity scores. However, these results aren't optimal, likely due to the model's limited size (900M parameters) and differences between evaluation and training datasets.

The 101B words dataset underwent specific preprocessing, including diacritic removal and elimination of non-Arabic characters, which may contribute to these differences. This shows that performing classical data processing as done in classical NLP is obsolete nowadays and can only hurt performance when training on web data. Despite this, the V1 model's superior performance in predicting subsequent words suggests more coherent and contextually accurate outputs, aligning with our qualitative analysis findings.

Zero-shot evaluation benchmarks

We evaluated the ablation models on zero-shot Arabic benchmarks using the lm-evaluation-harness library. The selected benchmarks include COPA-ar, HellaSwag-ar, and PIQA-ar. These were chosen for their focus on the Arabic language, ease of use through the lm-evaluation-harness library, and ability to provide meaningful signals even for models trained on a few billion tokens. This approach allowed us to effectively assess the performance of the small-scale ablation models.

The zero-shot benchmarks reveal a notable performance improvement with models trained on the ArabicWeb24 datasets compared to the baseline model. This indicates that the ArabicWeb24 data is of higher quality than the 101B Arabic words dataset. We observe minimal differences among the various ArabicWeb24 versions, which we attribute to the model size and the data scale.

Given the challenges of extracting significant signals at such a small scale, qualitative evaluation emerged as our primary metric alongside validation metrics. To achieve more significant results, extended training on larger datasets and the use of additional benchmark sets could be considered, but such enhancements are beyond the scope of this project.

4. Additional Information

In this section, we will detail the distribution of data sources included in the final dataset, referred to as V1, along with the computational resources used in the creation of ArabicWeb24.

Data distribution

The experiments we had revealed a pattern in the models' outputs: a tendency to generate text similar to news articles. To investigate this trend, we extracted the 150 most frequently occurring URLs from the documents. These URLs were then annotated using a two step approach of Llama 3 8B annotation and manual verification: We provide the model with an excerpt from each URL and ask it to classify it into one of the following classes.

1. News
2. Encyclopedia
3. Art and Entertainment
4. Sports
5. Society and Religion
6. Spam
7. Marketplace

This distribution shows the variety of content sources within these most frequently occurring URLs, with a clear emphasis on news-related material- about 76% which explains why the model tends to generate news-like texts. While this analysis focuses on the top 150 URLs and may not represent the entire dataset, it suggests a strong presence of current events and factual information, along with a range of other topics that add diversity to the content.

Computational Resources

Ablation Studies:
- Platform: HPE Cray node
- Hardware: 8 NVIDIA H100 GPUs
- Cloud Provider: Orange Cloud Avenue
MinHash Deduplication:
- Infrastructure: MeluXina HPC cluster nodes
Data Pre-processing:
- Cloud Provider: Amazon Web Services (AWS)
- Text Extraction and Base Filtering:
  - Instance Type: c7a.8xlarge
- Advanced Processing (sentence deduplication, tokenization, fineweb processing, URL filtering, and formatting):
  - Instance Type: r7a.12xlarge

Conclusion

As we are releasing a substantial amount of this cleaned data from previously unexplored sources and document the process in this blog post, we hope our efforts will encourage others to contribute to the AI community by providing valuable resources for researchers and developers working on enhancing natively Arabic language models.

Discussion of Biases

Efforts were made to minimize NSFW and toxic content in the dataset by filtering at the URL level. Despite these efforts, a significant number of documents in the final dataset may still be considered toxic or harmful. Since ArabicWeb24 was sourced from the web as a whole, it may reproduce harmful biases commonly found online.

5. Citation

@misc{ArabicWeb24,
  title={ArabicWeb24: Creating a High Quality Arabic Web-only Pre-training Dataset},
  author={Farhat, May and Taghadouini, Said},
  url={www.lighton.ai/lighton-blogs/arabicweb24},
  year={2024}
}

Contact us

We believe Arabic should be a first-class citizen. If you want to source data, train, finetune, or work with LLMs in Arabic, then get in touch. Interested in custom data curation and processing for your language? Contact LightOn for tailored solutions, with Arabic as a prime example.

Upvote