latest tokenization_rwkv5.py have some bugs with chinese text

#5
by Cloud-Strife - opened

Hi ,

I found a issue with your latest tokenization_rwkv5.py

In your python you use whitespace_tokenize to strip input text, that is useful for english input as englis words alrways have white space here.

But chniese text often have no whitespace, so if I use chinese dataset, then below function return tokens will be only 1 element and same as input text length.

def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text.
The separators are kept
"""
text = text.strip()
ipdb.set_trace()
if not text:
return []
tokens = re.split(b"(?= )", text)
return tokens

Then there will be a issue with class WordpieceTokenizer(object): -> def tokenize(self, text):

1> for chinese text this token will be as long as input text, for example 1024 or more long for a short inputs...
2> you just use a simple search as end is the length of chars and check if can match, if not match end -1
So if the end size is too large, the tokenizer will hang here for a very long time...

for token in whitespace_tokenize(text):
...
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = bytes(chars[start:end])
if substr in self.vocab:
cur_substr = substr
break
end -= 1

I think this code need to optimize for such usecases.

Thanks,
Qing

I have a simple way to fix this hang issue for chinese text.

You only need add some rules in below function to manual add space after each chinese character only :
def _tokenize(self, text, split_special_tokens=False):
text = re.sub(r'([\u4e00-\u9fa5])', r'\1 ', text)
return self.wordpiece_tokenizer.tokenize(text.encode("utf-8"))

Thanks,
Qing

Sign up or log in to comment