|
--- |
|
datasets: |
|
- liar |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- precision |
|
- recall |
|
--- |
|
# Fake News Classifier - Finetuned: 'distilbert-base-uncased' |
|
|
|
#### **LIAR Dataset** |
|
*** |
|
- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API. |
|
- Data went through a series of text cleaning stages such as: |
|
1. Lower-case standardization for improved 'uncased' model performance. |
|
2. Mixed letter/digit word removal. |
|
3. Stopword removal. |
|
4. Extra space trimming. |
|
|
|
#### **DistilBERT Uncased Tokenizer** |
|
*** |
|
- The text is tokenized using the **'distilbert-base-uncased'** HuggingFace tokenizer. |
|
- For training, the text is cut to a block-size of 200. |
|
- Max length padding is used to maintain consistent input data shape. |
|
|
|
#### **DistilBERT Uncased Model** |
|
*** |
|
- The model that is finetuned is the DistilBERT model, **'distilbert-base-uncased'**. |
|
- This is a small and fast text classifier, perfect for real-time inference! |
|
- 40% less parameters than the base BERT model. |
|
- 60% faster while preserving 95% performance of the base BERT model. |
|
- This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score. |
|
- This improvement comes mainly from the slower learning rate and improved data preprocessing. |
|
- These modifications allow for a smoother training curve and convergence. |