--- license: apache-2.0 library_name: generic tags: - text2text-generation - punctuation - sentence-boundary-detection - truecasing language: - af - am - ar - bg - bn - de - el - en - es - et - fa - fi - fr - gu - hi - hr - hu - id - is - it - ja - kk - kn - ko - ky - lt - lv - mk - ml - mr - nl - or - pa - pl - ps - pt - ro - ru - rw - so - sr - sw - ta - te - tr - uk - zh --- # Model Overview This is a fine-tuned `xlm-roberta` model that restores punctuation, true-cases (capitalizes), and detects sentence boundaries (full stops) in 47 languages. ## Post-Punctuation Tokens This model predicts the following set of punctuation tokens after each subtoken: | Token | Description | Relevant Languages | | ---: | :---------- | :----------- | | \ | No punctuation | All | | \ | Every character in this subword is followed by a period | Primarily English, some European | | . | Latin full stop | Many | | , | Latin comma | Many | | ? | Latin question mark | Many | | ? | Full-width question mark | Chinese, Japanese | | , | Full-width comma | Chinese, Japanese | | 。 | Full-width full stop | Chinese, Japanese | | 、 | Ideographic comma | Chinese, Japanese | | ・ | Middle dot | Japanese | | । | Danda | Hindi, Bengali, Oriya | | ؟ | Arabic question mark | Arabic | | ; | Greek question mark | Greek | | ። | Ethiopic full stop | Amharic | | ፣ | Ethiopic comma | Amharic | | ፧ | Ethiopic question mark | Amharic | ## Pre-Punctuation Tokens This model predicts the following set of punctuation tokens before each subword: | Token | Description | Relevant Languages | | ---: | :---------- | :----------- | | \ | No punctuation | All | | ¿ | Inverted question mark | Spanish | # Training Details This model was trained in the NeMo framework. ## Training Data This model was trained with News Crawl data from WMT. 1M lines of text for each language was used, except for a few low-resource languages which may have used less. Languages were chosen based on whether the News Crawl corpus contained enough reliable-quality data as judged by the author. # Limitations This model was trained on news data, and may not perform well on conversational or informal data. Further, this model is unlikely to be of production quality. It was trained with "only" 1M lines per language, and the dev sets may have been noisy due to the nature of web-scraped news data. This is also a base-sized model with many languages and many tasks, so capacity may be limited. # Evaluation In these metrics, keep in mind that 1. The data is noisy 2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect. When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages. 4. Punctuation can be subjective. E.g., `Hola mundo, ¿cómo estás?` or `Hola mundo. ¿Cómo estás?` When the sentences are longer and more practical, these ambiguities abound and affect all 3 analytics. ## Test Data and Example Generation Each test example was generated using the following procedure: 1. Concatenate 10 random sentences 2. Lower-case the concatenated sentence 3. Remove all punctuation The data is a held-out portion of News Crawl, which has been deduplicated. 3,000 lines of data per language was used, generating 3,000 unique examples of 10 sentences each. The last 4 sentences of each example were randomly sampled from the 3,000 and may be duplicated. Examples longer than the model's maximum length were truncated. The number of affected sentences can be estimated from the "full stop" support: with 3,000 sentences and 10 sentences per example, we expect 30,000 full stop targets total. ## Selected Language Evaluation Reports