RWKV/rwkv-5-world-3b · latest tokenization_rwkv5.py have some bugs with chinese text

Hi ,

I found a issue with your latest tokenization_rwkv5.py

In your python you use whitespace_tokenize to strip input text, that is useful for english input as englis words alrways have white space here.

But chniese text often have no whitespace, so if I use chinese dataset, then below function return tokens will be only 1 element and same as input text length.

def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text.
The separators are kept
"""
text = text.strip()
ipdb.set_trace()
if not text:
return []
tokens = re.split(b"(?= )", text)
return tokens

Then there will be a issue with class WordpieceTokenizer(object): -> def tokenize(self, text):

1> for chinese text this token will be as long as input text, for example 1024 or more long for a short inputs...
2> you just use a simple search as end is the length of chars and check if can match, if not match end -1
So if the end size is too large, the tokenizer will hang here for a very long time...

for token in whitespace_tokenize(text):
...
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = bytes(chars[start:end])
if substr in self.vocab:
cur_substr = substr
break
end -= 1

I think this code need to optimize for such usecases.

Thanks,
Qing