kenhktsui
/

llm-data-textbook-quality-classifier-v1

Text Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

kenhktsui commited on Apr 27

Commit

76345ce

•

1 Parent(s): e9f5cc1

Update README.md

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -22,6 +22,24 @@ should probably proofread and complete it, then remove this comment. -->
 # llm-data-textbook-quality-classifer-v1
 This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
 This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on an unknown dataset.
 It achieves the following results on the evaluation set:

 # llm-data-textbook-quality-classifer-v1
 This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
+Please note textbook quality is a subset of high quality.
+## Benchmark
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/US04uiMXJpFLmoG-q7mvZ.png)
+|Dataset | Sampling | Average Quality Score |
+|--------------------------------------|---|-------------------|
+|[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000| 0.8618|
+|[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000| 0.8544|
+|[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000| 0.8109|
+|[pszemraj/simple_wikipedia_LM](https://huggingface.co/datasets/pszemraj/simple_wikipedia_LM) | Full| 0.5386|
+|[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| Full| 0.2951|
+|[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| Full | 0.2618|
+The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model. Wikipedia scores lower because it is not textbook after all. Web scores the lowest.
 This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on an unknown dataset.
 It achieves the following results on the evaluation set: