caballeroch's picture
Update README.md
fd3a144
---
datasets:
- liar
metrics:
- accuracy
- f1
- precision
- recall
---
# Fake News Classifier - Finetuned: 'distilbert-base-uncased'
#### **LIAR Dataset**
***
- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
- Data went through a series of text cleaning stages such as:
1. Lower-case standardization for improved 'uncased' model performance.
2. Mixed letter/digit word removal.
3. Stopword removal.
4. Extra space trimming.
#### **DistilBERT Uncased Tokenizer**
***
- The text is tokenized using the **'distilbert-base-uncased'** HuggingFace tokenizer.
- For training, the text is cut to a block-size of 200.
- Max length padding is used to maintain consistent input data shape.
#### **DistilBERT Uncased Model**
***
- The model that is finetuned is the DistilBERT model, **'distilbert-base-uncased'**.
- This is a small and fast text classifier, perfect for real-time inference!
- 40% less parameters than the base BERT model.
- 60% faster while preserving 95% performance of the base BERT model.
- This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
- This improvement comes mainly from the slower learning rate and improved data preprocessing.
- These modifications allow for a smoother training curve and convergence.