1-800-BAD-CODE commited on
Commit
e7a5edc
1 Parent(s): e2feff2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -2
README.md CHANGED
@@ -15,6 +15,7 @@ This model accepts as input lower-cased, unpunctuated English text and performs
15
 
16
  In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
17
 
 
18
  # Usage
19
  The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
20
 
@@ -22,6 +23,7 @@ The easy way to use this model is to install [punctuators](https://github.com/1-
22
  pip install punctuators
23
  ```
24
 
 
25
  Running the following script should load this model and run some texts:
26
  <details open>
27
 
@@ -99,6 +101,10 @@ Since true-casing should be done on a per-character basis, the classification ne
99
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
100
 
101
 
 
 
 
 
102
  ## Punctuation Tokens
103
  This model predicts the following set of punctuation tokens:
104
 
@@ -133,7 +139,7 @@ The training data was noisy, and no manual cleaning was utilized.
133
  Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
134
 
135
  | Token | Count |
136
- | ---: | :---------- |
137
  | Mr | 115232 |
138
  | Mr. | 108212 |
139
 
@@ -153,7 +159,7 @@ Thus, the model's acronym and abbreviation predictions may be a bit unpredictabl
153
  In these metrics, keep in mind that
154
  1. The data is noisy
155
  2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
156
- When conditioning on reference punctuation, true-casing and SBD is practically 100% for most languages.
157
  4. Punctuation can be subjective. E.g.,
158
 
159
  `Hello Frank, how's it going?`
 
15
 
16
  In contast to many similar models, this model can predict punctuated acronyms (e.g., "U.S.") via a special "acronym" class, as well as arbitarily-capitalized words (NATO, McDonald's, etc.) via multi-label true-casing predictions.
17
 
18
+
19
  # Usage
20
  The easy way to use this model is to install [punctuators](https://github.com/1-800-BAD-CODE/punctuators):
21
 
 
23
  pip install punctuators
24
  ```
25
 
26
+
27
  Running the following script should load this model and run some texts:
28
  <details open>
29
 
 
101
  This scheme captures acronyms, e.g., "NATO", as well as bi-capitalized words, e.g., "MacDonald".
102
 
103
 
104
+ The model's maximum length is 256 subtokens. However, the [punctuators](https://github.com/1-800-BAD-CODE/punctuators) package
105
+ as described above will transparently predict on overlapping subgsegments of longer input texts and fuse the results before returning output,
106
+ allowing inputs to be arbitrarily long.
107
+
108
  ## Punctuation Tokens
109
  This model predicts the following set of punctuation tokens:
110
 
 
139
  Acronyms and abbreviations are especially noisy; the table below shows how many variations of each token appear in the training data.
140
 
141
  | Token | Count |
142
+ | -: | :- |
143
  | Mr | 115232 |
144
  | Mr. | 108212 |
145
 
 
159
  In these metrics, keep in mind that
160
  1. The data is noisy
161
  2. Sentence boundaries and true-casing are conditioned on predicted punctuation, which is the most difficult task and sometimes incorrect.
162
+ When conditioning on reference punctuation, true-casing and SBD metrics are much higher w.r.t. the reference targets.
163
  4. Punctuation can be subjective. E.g.,
164
 
165
  `Hello Frank, how's it going?`