frozen_news_classifier_ft / README.md

data-silence

Update README.md

580c8bd verified 14 days ago

preview code

raw

history blame

No virus

4.62 kB

	---
	license: apache-2.0
	base_model: sentence-transformers/LaBSE
	tags:
	- generated_from_trainer
	- news
	- russian
	- media
	- text-classification
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: frozen_news_classifier_ft
	results: []
	datasets:
	- data-silence/rus_news_classifier
	pipeline_tag: text-classification
	language:
	- ru
	library_name: transformers
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Model description

	This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
	The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space.
	The learning news dataset is a well-balanced sample of recent news from the last five years.

	It achieves the following results on the evaluation set:
	- Loss: 0.7314
	- Accuracy: 0.7793
	- F1: 0.7753
	- Precision: 0.7785
	- Recall: 0.7793

	## How to use

	```python

	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	universal_model_name = "data-silence/frozen_news_classifier_ft"
	universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
	universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)

	# Перевод моделей в режим оценки и на нужное устройство
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	universal_model = universal_model.to(device)
	universal_model.eval()

	id2label = {
	0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
	5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
	}


	def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
	"""Получает эмбеддинги списка текстов"""
	# Токенизация входного текста
	inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
	with torch.no_grad():
	outputs = universal_model.base_model(**inputs)
	embeddings = outputs.pooler_output
	embeddings = torch.nn.functional.normalize(embeddings, dim=1)
	return embeddings.tolist()


	def predict_category(news: list[str]) -> list[str]:
	"""Предсказывает категорию по тексту новости / новостей"""

	# Токенизация с активацией выравнивания и усечения
	inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
	# Получение логитов модели
	with torch.no_grad():
	outputs = universal_model(**inputs)
	logits = outputs.logits

	# Получение индексов предсказанных меток
	predicted_labels = torch.argmax(logits, dim=-1).tolist()
	# Преобразование индексов в категории
	predicted_categories = [id2label[label] for label in predicted_labels]
	return predicted_categories

	```



	## Intended uses & limitations

	Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.


	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 500
	- num_epochs: 10

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Accuracy \| F1 \| Precision \| Recall \|
	\|:-------------:\|:-----:\|:-----:\|:---------------:\|:--------:\|:------:\|:---------:\|:------:\|
	\| 0.8422 \| 1.0 \| 3596 \| 0.8104 \| 0.7681 \| 0.7632 \| 0.7669 \| 0.7681 \|
	\| 0.7923 \| 2.0 \| 7192 \| 0.7738 \| 0.7711 \| 0.7666 \| 0.7700 \| 0.7711 \|
	\| 0.7597 \| 3.0 \| 10788 \| 0.7485 \| 0.7754 \| 0.7716 \| 0.7741 \| 0.7754 \|
	\| 0.7564 \| 4.0 \| 14384 \| 0.7314 \| 0.7793 \| 0.7753 \| 0.7785 \| 0.7793 \|


	### Framework versions

	- Transformers 4.42.4
	- Pytorch 2.4.0+cu121
	- Datasets 2.21.0
	- Tokenizers 0.19.1