caballeroch
/

FakeNewsClassifierDistilBert-uncased

Text Classification

Inference Endpoints

Model card Files Files and versions Community

FakeNewsClassifierDistilBert-uncased / README.md

caballeroch's picture

Update README.md

fd3a144 about 1 year ago

|

history blame contribute delete

No virus

1.34 kB

	---
	datasets:
	- liar
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	---
	# Fake News Classifier - Finetuned: 'distilbert-base-uncased'

	#### LIAR Dataset
	***
	- This model is finetuned on a large dataset of hand-labeled short statements from politifact.com's API.
	- Data went through a series of text cleaning stages such as:
	1. Lower-case standardization for improved 'uncased' model performance.
	2. Mixed letter/digit word removal.
	3. Stopword removal.
	4. Extra space trimming.

	#### DistilBERT Uncased Tokenizer
	***
	- The text is tokenized using the 'distilbert-base-uncased' HuggingFace tokenizer.
	- For training, the text is cut to a block-size of 200.
	- Max length padding is used to maintain consistent input data shape.

	#### DistilBERT Uncased Model
	***
	- The model that is finetuned is the DistilBERT model, 'distilbert-base-uncased'.
	- This is a small and fast text classifier, perfect for real-time inference!
	- 40% less parameters than the base BERT model.
	- 60% faster while preserving 95% performance of the base BERT model.
	- This model outperforms the finetuned 'distilbert-base-cased' by over 5% average F1-score.
	- This improvement comes mainly from the slower learning rate and improved data preprocessing.
	- These modifications allow for a smoother training curve and convergence.