data-silence's picture
Update README.md
580c8bd verified
|
raw
history blame
No virus
4.62 kB
---
license: apache-2.0
base_model: sentence-transformers/LaBSE
tags:
- generated_from_trainer
- news
- russian
- media
- text-classification
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: frozen_news_classifier_ft
results: []
datasets:
- data-silence/rus_news_classifier
pipeline_tag: text-classification
language:
- ru
library_name: transformers
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Model description
This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space.
The learning news dataset is a well-balanced sample of recent news from the last five years.
It achieves the following results on the evaluation set:
- Loss: 0.7314
- Accuracy: 0.7793
- F1: 0.7753
- Precision: 0.7785
- Recall: 0.7793
## How to use
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
universal_model_name = "data-silence/frozen_news_classifier_ft"
universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)
# Перевод моделей в режим оценки и на нужное устройство
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
universal_model = universal_model.to(device)
universal_model.eval()
id2label = {
0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
}
def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
"""Получает эмбеддинги списка текстов"""
# Токенизация входного текста
inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
with torch.no_grad():
outputs = universal_model.base_model(**inputs)
embeddings = outputs.pooler_output
embeddings = torch.nn.functional.normalize(embeddings, dim=1)
return embeddings.tolist()
def predict_category(news: list[str]) -> list[str]:
"""Предсказывает категорию по тексту новости / новостей"""
# Токенизация с активацией выравнивания и усечения
inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
# Получение логитов модели
with torch.no_grad():
outputs = universal_model(**inputs)
logits = outputs.logits
# Получение индексов предсказанных меток
predicted_labels = torch.argmax(logits, dim=-1).tolist()
# Преобразование индексов в категории
predicted_categories = [id2label[label] for label in predicted_labels]
return predicted_categories
```
## Intended uses & limitations
Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
| 0.8422 | 1.0 | 3596 | 0.8104 | 0.7681 | 0.7632 | 0.7669 | 0.7681 |
| 0.7923 | 2.0 | 7192 | 0.7738 | 0.7711 | 0.7666 | 0.7700 | 0.7711 |
| 0.7597 | 3.0 | 10788 | 0.7485 | 0.7754 | 0.7716 | 0.7741 | 0.7754 |
| 0.7564 | 4.0 | 14384 | 0.7314 | 0.7793 | 0.7753 | 0.7785 | 0.7793 |
### Framework versions
- Transformers 4.42.4
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1