Model Overview

This model performs text sentence boundary detection (SBD) with 49 common languages.

This model segments a long, punctuated text into one or more constituent sentences.

A key feature is that the model is multi-lingual and language-agnostic at inference time. Therefore, language tags do not need to be used and a single batch can contain multiple languages.

Model Inputs and Outputs

The model inputs should be punctuated texts.

For each input subword t, this model predicts the probability that t is the final token of a sentence (i.e., a sentence boundary).

Example Usage

The easiest way to use this model is to install punctuators:

$ pip install punctuators

Example Usage

from typing import List

from punctuators.models import SBDModelONNX

# Instantiate this model
# This will download the ONNX and SPE models. To clean up, delete this model from your HF cache directory.
m = SBDModelONNX.from_pretrained("sbd_multi_lang")

input_texts: List[str] = [
    # English (with a lot of acronyms)
    "the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.",
    # Chinese
    "魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。",
    # Spanish
    "él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.",
    # Thai
    "พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน",
    # Ukrainian
    "розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.",
    # Polish
    "szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.",
]

# Run inference
results: List[List[str]] = m.infer(input_texts)

# Print each input and it's segmented outputs
for input_text, output_texts in zip(input_texts, results):
    print(f"Input: {input_text}")
    print(f"Outputs:")
    for text in output_texts:
        print(f"\t{text}")
    print()

Expected outputs

Input: the new d.n.a. sample has been multiplexed, and the gametes are already dividing. let's get the c.p.d. over there. dinner's at 630 p.m. see that piece on you in the l.a. times? chicago p.d. will eat him alive.
Outputs:
    the new d.n.a. sample has been multiplexed, and the gametes are already dividing.
    let's get the c.p.d. over there.
    dinner's at 630 p.m.
    see that piece on you in the l.a. times?
    chicago p.d. will eat him alive.

Input: 魔鬼兵團都死了？但是如果这让你不快乐就别做了。您就不能发个电报吗。我們都準備好了。
Outputs:
    魔鬼兵團都死了？
    但是如果这让你不快乐就别做了。
    您就不能发个电报吗。
    我們都準備好了。

Input: él es uno de aquellos. ¿tiene algo de beber? cómo el aislamiento no vale la pena.
Outputs:
    él es uno de aquellos.
    ¿tiene algo de beber?
    cómo el aislamiento no vale la pena.

Input: พวกเขาต้องโกรธมากเลยใช่ไหม โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน
Outputs:
    พวกเขาต้องโกรธมากเลยใช่ไหม
    โทษทีนะลูกของเราไม่เป็นอะไรใช่ไหม
    ถึงเจ้าจะลากข้าไปเจ้าก็ไม่ได้อะไรอยู่ดี
    ผมคิดว่าจะดีกว่านะถ้าคุณไม่ออกไปไหน

Input: розігни і зігни, будь ласка. я знаю, ваши люди храбры. было приятно, правда? для начала, тебе нужен собственный свой самолет.
Outputs:
    розігни і зігни, будь ласка.
    я знаю, ваши люди храбры.
    было приятно, правда?
    для начала, тебе нужен собственный свой самолет.

Input: szedłem tylko do. pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka. ćwiczę już od dwóch tygodni a byłem zabity tylko raz.
Outputs:
    szedłem tylko do.
    pamiętaj, nigdy się nie obawiaj żyć na krawędzi ryzyka.
    ćwiczę już od dwóch tygodni a byłem zabity tylko raz.

Model Architecture

This is a data-driven approach to SBD. The model uses a SentencePiece tokenizer, a BERT-style encoder, and a linear classifier to predict which subwords are sentence boundaries.

Given that this is a relatively-easy NLP task, the model contains ~9M parameters (~8.2M of which are embeddings). This makes the model very fast and cheap at inference time, as SBD should be.

The BERT encoder is based on the following configuration:

8 heads
4 layers
128 hidden dim
512 intermediate/ff dim
64000 embeddings/vocab tokens

Model Training

This model was trained on a personal fork of NeMo, specifically this sbd branch.

Model was trained for several hundred thousand steps with ~1M lines of texts per language (~49M lines total) with a global batch size of 256 examples. Batches were multilingual and generated by randomly sampling each language.

Training Data

This model was trained on OpenSubtitles data.

Although this corpus is very noisy, it is one of few large-scale text corpora which have been manually segmented.

Automatically-segmented corpora are undesirable for at least two reasons:

The data-driven model would simply learn to mimic the system used to segment the corpus, acquiring no more knowledge than the original system (probably a simple rules-based system).
Rules-based systems fail catastrophically for some languages, which can be hard to detect for a non-speaker of that language (e.g., me).

Heuristics were used to attempt to clean the data before training. Some examples of the cleaning are:

Drop sentences which start with a lower-case letter. Assume these lines are errorful.
For inputs that do not end with a full stop, append the default full stop for that language. Assume that for single-sentence declarative sentences, full stops are not important for subtitles.
Drop inputs that have more than 20 words (or 32 chars, for continuous-script languages). Assume these lines contain more than one sentence, and therefore we cannot create reliable targets.
Drop objectively junk lines: all punctuation/special characters, empty lines, etc.
Normalize punctuation: no more than one consecutive punctuation token (except Spanish, where inverted punctuation can appear after non-inverted punctuation).

Training Example Generation

To create examples for the model, we

Assume each input line is exactly one sentence
Concatenate sentences together, with the concatenation points becoming the sentence boundary targets

For this particular model, each example consisted of between 1 and 9 sentences concatenated together, which shows the model between 0 and 8 positive targets (sentence boundaries). The number of sentences to use was chosen random and uniformly, so each example had, on average, 4 sentence boundaries.

This model uses a maximum sequence length of 256, which for OpenSubtitles is relatively long. If, after concatenating sentences, an example contains more than 256 tokens, the sequence is simply truncated to the first 256 subwords.

50% of input texts were lower-cased for both the tokenizer and classification models. This provides some augmentation, but more importantly allows for this model to inserted into an NLP pipeline either before or after true-casing. Using this model before true-casing would allow the true-casing model to exploit the conditional probability of sentence boundaries w.r.t. capitalization.

Language Specific Rules

The training data was pre-processed for language-specific punctuation and spacing rules.

The following guidelines were used during training. If inference inputs differ, the model may perform poorly.

All spaces were removed from continuous-script languages (Chinese, Japanese).
Chinese: Chinese and Japanese use full-width periods "。", question marks "？", and commas "，".
Hindi/Bengali: These languages use the danda "।" as a full-stop, not ".".
Arabic: Uses reverse question marks "؟", not "?".

Limitations and known issues

Noisy training data

This model was trained on OpenSubtitles, data which is notoriously noisy. The model may have learned some bad habits from this data.

An assumption made during training is that every input line is exactly one sentence. However, that's not always the case. So the model might have some false negatives which are explained by the training data having multiple sentences on some lines.

Language-specific expectations

As discussed in a previous section, each language should be formatted and punctuated per that languages rules.

E.g., Chinese text should contain full-width periods, not latin periods, and contain no space.

In practice, data often does not adhere to these rules, but the model has not been augmented to deal with this potential issue.

Metrics

It's difficult to properly evaluate this model, since we rely on the proposition that the input data contains exactly one sentence per line. In reality, the data sets used thus far are noisy and often contain more than one sentence per line.

Metrics are not published for now, and evaluation is limited to manual spot-checking.

Sufficient test sets for this analytic are being looked into.