gonzalez-agirre commited on
Commit
d933abc
1 Parent(s): 33f9d6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -4
README.md CHANGED
@@ -55,12 +55,58 @@ widget:
55
 
56
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Part-of-speech-tagging (POS)
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  The **roberta-base-ca-v2-cased-pos** is a Part-of-speech-tagging (POS) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
59
 
60
- ## Datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  We used the POS dataset in Catalan from the [Universal Dependencies Treebank](https://huggingface.co/datasets/universal_dependencies) we refer to _Ancora-ca-pos_ for training and evaluation.
62
 
63
- ## Evaluation and results
 
 
 
 
 
 
 
 
 
64
  We evaluated the _roberta-base-ca-v2-cased-pos_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
65
 
66
  | Model | Ancora-ca-pos (F1) |
@@ -72,7 +118,11 @@ We evaluated the _roberta-base-ca-v2-cased-pos_ on the Ancora-ca-ner test set ag
72
 
73
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
74
 
75
- ## Citing
 
 
 
 
76
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
77
  ```bibtex
78
  @inproceedings{armengol-estape-etal-2021-multilingual,
@@ -97,4 +147,9 @@ If you use any of these resources (datasets or models) in your work, please cite
97
  ```
98
 
99
  ### Funding
100
- This work was funded by the [Catalan Government](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of the [AINA project.](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
 
 
 
 
 
 
55
 
56
  # Catalan BERTa-v2 (roberta-base-ca-v2) finetuned for Part-of-speech-tagging (POS)
57
 
58
+ ## Table of Contents
59
+ - [Model Description](#model-description)
60
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
61
+ - [How to Use](#how-to-use)
62
+ - [Training](#training)
63
+ - [Training Data](#training-data)
64
+ - [Training Procedure](#training-procedure)
65
+ - [Evaluation](#evaluation)
66
+ - [Variable and Metrics](#variable-and-metrics)
67
+ - [Evaluation Results](#evaluation-results)
68
+ - [Licensing Information](#licensing-information)
69
+ - [Citation Information](#citation-information)
70
+ - [Funding](#funding)
71
+ - [Contributions](#contributions)
72
+
73
+ ## Model description
74
+
75
  The **roberta-base-ca-v2-cased-pos** is a Part-of-speech-tagging (POS) model for the Catalan language fine-tuned from the [roberta-base-ca-v2](https://huggingface.co/projecte-aina/roberta-base-ca-v2) model, a [RoBERTa](https://arxiv.org/abs/1907.11692) base model pre-trained on a medium-size corpus collected from publicly available corpora and crawlers (check the roberta-base-ca-v2 model card for more details).
76
 
77
+ ## Intended Uses and Limitations
78
+
79
+ **roberta-base-ca-v2-cased-pos** model can be used to Part-of-speech-tagging (POS) a text. The model is limited by its training dataset and may not generalize well for all use cases.
80
+
81
+ ## How to Use
82
+
83
+ Here is how to use this model:
84
+
85
+ ```python
86
+ from transformers import pipeline
87
+ from pprint import pprint
88
+
89
+ nlp = pipeline("token-classification", model="projecte-aina/roberta-base-ca-v2-cased-pos")
90
+ example = "Em dic Lluïsa i visc a Santa Maria del Camí."
91
+
92
+ pos_results = nlp(example)
93
+ pprint(pos_results)
94
+ ```
95
+ ## Training
96
+
97
+ ### Training data
98
  We used the POS dataset in Catalan from the [Universal Dependencies Treebank](https://huggingface.co/datasets/universal_dependencies) we refer to _Ancora-ca-pos_ for training and evaluation.
99
 
100
+ ### Training Procedure
101
+ The model was trained with a batch size of 16 and a learning rate of 5e-5 for 5 epochs. We then selected the best checkpoint using the downstream task metric in the corresponding development set and then evaluated it on the test set.
102
+
103
+ ## Evaluation
104
+
105
+ ### Variable and Metrics
106
+
107
+ This model was finetuned maximizing F1 score.
108
+
109
+ ## Evaluation results
110
  We evaluated the _roberta-base-ca-v2-cased-pos_ on the Ancora-ca-ner test set against standard multilingual and monolingual baselines:
111
 
112
  | Model | Ancora-ca-pos (F1) |
 
118
 
119
  For more details, check the fine-tuning and evaluation scripts in the official [GitHub repository](https://github.com/projecte-aina/club).
120
 
121
+ ## Licensing Information
122
+
123
+ [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)
124
+
125
+ ## Citation Information
126
  If you use any of these resources (datasets or models) in your work, please cite our latest paper:
127
  ```bibtex
128
  @inproceedings{armengol-estape-etal-2021-multilingual,
 
147
  ```
148
 
149
  ### Funding
150
+ This work was funded by the [Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya](https://politiquesdigitals.gencat.cat/en/inici/index.html) within the framework of [Projecte AINA](https://politiquesdigitals.gencat.cat/ca/economia/catalonia-ai/aina).
151
+
152
+
153
+ ## Contributions
154
+
155
+ [N/A]