--- license: apache-2.0 base_model: sentence-transformers/LaBSE tags: - generated_from_trainer - news - russian - media - text-classification metrics: - accuracy - f1 - precision - recall model-index: - name: frozen_news_classifier_ft results: [] datasets: - data-silence/rus_news_classifier pipeline_tag: text-classification language: - ru library_name: transformers --- # frozen_news_classifier_ft This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier). The learning news dataset is a well-balanced sample of recent news from the last five years. It achieves the following results on the evaluation set: - Loss: 0.7314 - Accuracy: 0.7793 - F1: 0.7753 - Precision: 0.7785 - Recall: 0.7793 ## How to use ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification universal_model_name = "data-silence/frozen_news_classifier_ft" universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name) universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name) # Перевод моделей в режим оценки и на нужное устройство device = torch.device("cuda" if torch.cuda.is_available() else "cpu") universal_model = universal_model.to(device) universal_model.eval() id2label = { 0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss', 5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel' } def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]: """Получает эмбеддинги списка текстов""" # Токенизация входного текста inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device) with torch.no_grad(): outputs = universal_model.base_model(**inputs) embeddings = outputs.pooler_output embeddings = torch.nn.functional.normalize(embeddings, dim=1) return embeddings.tolist() def predict_category(news: list[str]) -> list[str]: """Предсказывает категорию по тексту новости / новостей""" # Токенизация с активацией выравнивания и усечения inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True) # Получение логитов модели with torch.no_grad(): outputs = universal_model(**inputs) logits = outputs.logits # Получение индексов предсказанных меток predicted_labels = torch.argmax(logits, dim=-1).tolist() # Преобразование индексов в категории predicted_categories = [id2label[label] for label in predicted_labels] return predicted_categories ``` ## Model description The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space. ## Intended uses & limitations Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 16 - eval_batch_size: 16 - seed: 42 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 500 - num_epochs: 10 ### Training results | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | |:-------------:|:-----:|:-----:|:---------------:|:--------:|:------:|:---------:|:------:| | 0.8422 | 1.0 | 3596 | 0.8104 | 0.7681 | 0.7632 | 0.7669 | 0.7681 | | 0.7923 | 2.0 | 7192 | 0.7738 | 0.7711 | 0.7666 | 0.7700 | 0.7711 | | 0.7597 | 3.0 | 10788 | 0.7485 | 0.7754 | 0.7716 | 0.7741 | 0.7754 | | 0.7564 | 4.0 | 14384 | 0.7314 | 0.7793 | 0.7753 | 0.7785 | 0.7793 | ### Framework versions - Transformers 4.42.4 - Pytorch 2.4.0+cu121 - Datasets 2.21.0 - Tokenizers 0.19.1