caballeroch
commited on
Commit
•
b1d1135
1
Parent(s):
3a57388
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Fake News Classifier - Finetuned: 'distilbert-base-uncased'
|
2 |
+
|
3 |
+
#### **LIAR Dataset**
|
4 |
+
***
|
5 |
+
- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
|
6 |
+
- Data went through a series of text cleaning stages such as:
|
7 |
+
1. Lower-case standardization for improved 'uncased' model performance.
|
8 |
+
2. Mixed letter/digit word removal.
|
9 |
+
3. Stopword removal.
|
10 |
+
4. Extra space trimming.
|
11 |
+
|
12 |
+
#### **DistilBERT Uncased Tokenizer**
|
13 |
+
***
|
14 |
+
- The text is tokenized using the 'distilbert-base-uncased' HuggingFace tokenizer.
|
15 |
+
- For training, the text is cut to a block-size of 200.
|
16 |
+
- Max length padding is used to maintain consistent input data shape.
|
17 |
+
|
18 |
+
#### **DistilBERT Uncased Model**
|
19 |
+
***
|
20 |
+
- The model that is finetuned is the DistilBERT model, 'distilbert-base-uncased'.
|
21 |
+
- This is a small and fast text classifier, perfect for real-time inference!
|
22 |
+
- 40% less parameters than the base BERT model.
|
23 |
+
- 60% faster while preserving 95% performance of the base BERT model.
|
24 |
+
- This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
|
25 |
+
- This improvement comes mainly from the slower learning rate and improved data preprocessing.
|
26 |
+
- These modifications allow for a smoother training curve and convergence.
|
27 |
+
|
28 |
+
|
29 |
+
|
30 |
+
|