kenhktsui commited on
Commit
76345ce
1 Parent(s): e9f5cc1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -22,6 +22,24 @@ should probably proofread and complete it, then remove this comment. -->
22
 
23
  # llm-data-textbook-quality-classifer-v1
24
  This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on an unknown dataset.
27
  It achieves the following results on the evaluation set:
 
22
 
23
  # llm-data-textbook-quality-classifer-v1
24
  This model can classify if a text is of textbook quality data. It can be used as a filter for data curation when training a LLM.
25
+ Please note textbook quality is a subset of high quality.
26
+
27
+
28
+ ## Benchmark
29
+
30
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/60e50ce5350d181892d5a636/US04uiMXJpFLmoG-q7mvZ.png)
31
+
32
+ |Dataset | Sampling | Average Quality Score |
33
+ |--------------------------------------|---|-------------------|
34
+ |[nampdn-ai/tiny-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-textbooks) |First 10,000| 0.8618|
35
+ |[nampdn-ai/tiny-orca-textbooks](https://huggingface.co/datasets/nampdn-ai/tiny-orca-textbooks) |First 10,000| 0.8544|
36
+ |[SciPhi/textbooks-are-all-you-need-lite](https://huggingface.co/datasets/SciPhi/textbooks-are-all-you-need-lite) |First 10,000| 0.8109|
37
+ |[pszemraj/simple_wikipedia_LM](https://huggingface.co/datasets/pszemraj/simple_wikipedia_LM) | Full| 0.5386|
38
+ |[mattymchen/refinedweb-3m](https://huggingface.co/datasets/mattymchen/refinedweb-3m)| Full| 0.2951|
39
+ |[JeanKaddour/minipile](https://huggingface.co/datasets/JeanKaddour/minipile)| Full | 0.2618|
40
+
41
+ The classifier aligns with the expectation. Textbook category scores the highest, reflecting the effectiveness of this model. Wikipedia scores lower because it is not textbook after all. Web scores the lowest.
42
+
43
 
44
  This model is a fine-tuned version of [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) on an unknown dataset.
45
  It achieves the following results on the evaluation set: