caballeroch
/

FakeNewsClassifierDistilBert-uncased

Text Classification

Inference Endpoints

Model card Files Files and versions Community

caballeroch commited on Jul 10, 2023

Commit

b1d1135

•

1 Parent(s): 3a57388

Create README.md

Files changed (1) hide show

README.md +30 -0

README.md ADDED Viewed

	@@ -0,0 +1,30 @@

+# Fake News Classifier - Finetuned: 'distilbert-base-uncased'
+#### **LIAR Dataset**
+***
+- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
+- Data went through a series of text cleaning stages such as:
+    1. Lower-case standardization for improved 'uncased' model performance.
+    2. Mixed letter/digit word removal.
+    3. Stopword removal.
+    4. Extra space trimming.
+#### **DistilBERT Uncased Tokenizer**
+***
+- The text is tokenized using the 'distilbert-base-uncased' HuggingFace tokenizer.
+- For training, the text is cut to a block-size of 200.
+- Max length padding is used to maintain consistent input data shape.
+#### **DistilBERT Uncased Model**
+***
+- The model that is finetuned is the DistilBERT model, 'distilbert-base-uncased'.
+- This is a small and fast text classifier, perfect for real-time inference!
+  - 40% less parameters than the base BERT model.
+  - 60% faster while preserving 95% performance of the base BERT model.
+- This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
+  - This improvement comes mainly from the slower learning rate and improved data preprocessing.
+  - These modifications allow for a smoother training curve and convergence.