File size: 4,654 Bytes

---
license: apache-2.0
base_model: sentence-transformers/LaBSE
tags:
- generated_from_trainer
- news
- russian
- media
- text-classification
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: frozen_news_classifier_ft
  results: []
datasets:
- data-silence/rus_news_classifier
pipeline_tag: text-classification
language:
- ru
library_name: transformers
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# frozen_news_classifier_ft

This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
The learning news dataset is a well-balanced sample of recent news from the last five years. 

It achieves the following results on the evaluation set:
- Loss: 0.7314
- Accuracy: 0.7793
- F1: 0.7753
- Precision: 0.7785
- Recall: 0.7793

## How to use

```python

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

universal_model_name = "data-silence/frozen_news_classifier_ft"
universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)

# Перевод моделей в режим оценки и на нужное устройство
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
universal_model = universal_model.to(device)
universal_model.eval()

id2label = {
    0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
    5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
}


def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
    """Получает эмбеддинги списка текстов"""
    # Токенизация входного текста
    inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
    with torch.no_grad():
        outputs = universal_model.base_model(**inputs)
    embeddings = outputs.pooler_output
    embeddings = torch.nn.functional.normalize(embeddings, dim=1)
    return embeddings.tolist()


def predict_category(news: list[str]) -> list[str]:
    """Предсказывает категорию по тексту новости / новостей"""

    # Токенизация с активацией выравнивания и усечения
    inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
    # Получение логитов модели
    with torch.no_grad():
        outputs = universal_model(**inputs)
        logits = outputs.logits

    # Получение индексов предсказанных меток
    predicted_labels = torch.argmax(logits, dim=-1).tolist()
    # Преобразование индексов в категории
    predicted_categories = [id2label[label] for label in predicted_labels]
    return predicted_categories

```


## Model description

The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space.



## Intended uses & limitations

Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.


### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- num_epochs: 10

### Training results

| Training Loss | Epoch | Step  | Validation Loss | Accuracy | F1     | Precision | Recall |
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:|
| 0.8422        | 1.0   | 3596  | 0.8104          | 0.7681   | 0.7632 | 0.7669    | 0.7681 |
| 0.7923        | 2.0   | 7192  | 0.7738          | 0.7711   | 0.7666 | 0.7700    | 0.7711 |
| 0.7597        | 3.0   | 10788 | 0.7485          | 0.7754   | 0.7716 | 0.7741    | 0.7754 |
| 0.7564        | 4.0   | 14384 | 0.7314          | 0.7793   | 0.7753 | 0.7785    | 0.7793 |


### Framework versions

- Transformers 4.42.4
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1