data-silence commited on
Commit
92be7cb
1 Parent(s): 88b135d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -7
README.md CHANGED
@@ -3,6 +3,10 @@ license: apache-2.0
3
  base_model: sentence-transformers/LaBSE
4
  tags:
5
  - generated_from_trainer
 
 
 
 
6
  metrics:
7
  - accuracy
8
  - f1
@@ -11,6 +15,12 @@ metrics:
11
  model-index:
12
  - name: frozen_news_classifier_ft
13
  results: []
 
 
 
 
 
 
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
@@ -18,7 +28,9 @@ should probably proofread and complete it, then remove this comment. -->
18
 
19
  # frozen_news_classifier_ft
20
 
21
- This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on an unknown dataset.
 
 
22
  It achieves the following results on the evaluation set:
23
  - Loss: 0.7314
24
  - Accuracy: 0.7793
@@ -26,19 +38,68 @@ It achieves the following results on the evaluation set:
26
  - Precision: 0.7785
27
  - Recall: 0.7793
28
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  ## Model description
30
 
31
- More information needed
32
 
33
- ## Intended uses & limitations
34
 
35
- More information needed
36
 
37
- ## Training and evaluation data
38
 
39
- More information needed
40
 
41
- ## Training procedure
42
 
43
  ### Training hyperparameters
44
 
 
3
  base_model: sentence-transformers/LaBSE
4
  tags:
5
  - generated_from_trainer
6
+ - news
7
+ - russian
8
+ - media
9
+ - text-classification
10
  metrics:
11
  - accuracy
12
  - f1
 
15
  model-index:
16
  - name: frozen_news_classifier_ft
17
  results: []
18
+ datasets:
19
+ - data-silence/rus_news_classifier
20
+ pipeline_tag: text-classification
21
+ language:
22
+ - ru
23
+ library_name: transformers
24
  ---
25
 
26
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 
28
 
29
  # frozen_news_classifier_ft
30
 
31
+ This model is a fine-tuned version of [sentence-transformers/LaBSE](https://huggingface.co/sentence-transformers/LaBSE) on my [news dataset](https://huggingface.co/datasets/data-silence/rus_news_classifier).
32
+ The learning news dataset is a well-balanced sample of recent news from the last five years.
33
+
34
  It achieves the following results on the evaluation set:
35
  - Loss: 0.7314
36
  - Accuracy: 0.7793
 
38
  - Precision: 0.7785
39
  - Recall: 0.7793
40
 
41
+ ## How to use
42
+
43
+ ```python
44
+
45
+ import torch
46
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
47
+
48
+ universal_model_name = "data-silence/frozen_news_classifier_ft"
49
+ universal_tokenizer = AutoTokenizer.from_pretrained(universal_model_name)
50
+ universal_model = AutoModelForSequenceClassification.from_pretrained(universal_model_name)
51
+
52
+ # Перевод моделей в режим оценки и на нужное устройство
53
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
54
+ universal_model = universal_model.to(device)
55
+ universal_model.eval()
56
+
57
+ id2label = {
58
+ 0: 'climate', 1: 'conflicts', 2: 'culture', 3: 'economy', 4: 'gloss',
59
+ 5: 'health', 6: 'politics', 7: 'science', 8: 'society', 9: 'sports', 10: 'travel'
60
+ }
61
+
62
+
63
+ def create_sentence_or_batch_embeddings(sent: list[str]) -> list[list[float]]:
64
+ """Получает эмбеддинги списка текстов"""
65
+ # Токенизация входного текста
66
+ inputs = universal_tokenizer(sent, return_tensors="pt", padding=True, truncation=True).to(device)
67
+ with torch.no_grad():
68
+ outputs = universal_model.base_model(**inputs)
69
+ embeddings = outputs.pooler_output
70
+ embeddings = torch.nn.functional.normalize(embeddings, dim=1)
71
+ return embeddings.tolist()
72
+
73
+
74
+ def predict_category(news: list[str]) -> list[str]:
75
+ """Предсказывает категорию по тексту новости / новостей"""
76
+
77
+ # Токенизация с активацией выравнивания и усечения
78
+ inputs = universal_tokenizer(news, return_tensors="pt", truncation=True, padding=True)
79
+ # Получение логитов модели
80
+ with torch.no_grad():
81
+ outputs = universal_model(**inputs)
82
+ logits = outputs.logits
83
+
84
+ # Получение индексов предсказанных меток
85
+ predicted_labels = torch.argmax(logits, dim=-1).tolist()
86
+ # Преобразование индексов в категории
87
+ predicted_categories = [id2label[label] for label in predicted_labels]
88
+ return predicted_categories
89
+
90
+ ```
91
+
92
+
93
  ## Model description
94
 
95
+ The goal of this model was to create a universal model for categorizing Russian-language news that would preserve the ability of the basic LaBSE model to generate multi-lingual text embeddings in a single vector space.
96
 
 
97
 
 
98
 
99
+ ## Intended uses & limitations
100
 
101
+ Compared to my specialized model [any-news-classifier](https://huggingface.co/data-silence/any-news-classifier), which is designed to solve news classification problems, this model shows meaningfully worse metrics.
102
 
 
103
 
104
  ### Training hyperparameters
105