--- tags: - tokenizer - nlp --- # NLTK Tokenizer for Transformers ๐Ÿค— ## ๐Ÿ“– Overview The NLTK Tokenizer is a custom tokenizer class designed for use with the Hugging Face Transformers library. This tokenizer leverage the `NlktTokenizer` class extends the `PreTrainedTokenizer` from the Hugging Face's Transformers library to create a [NLTK](https://www.nltk.org/index.html)-based tokenizer. This approach combines the robust pre-training and easy integration features of the `PreTrainedTokenizer` with the linguistic processing strengths of NLTK's `word_tokenize`. The result is a tokenizer that is both powerful in handling diverse language patterns and compatible with advanced NLP modeling techniques. ## ๐Ÿ› ๏ธ Installation To use the NLTK Tokenizer, ensure you have both `transformers` and `nltk` libraries installed. You can install them using: - ### With pip ```bash pip install transformers nltk ``` - ### With Conda ```bash conda install -c huggingface transformers nltk ``` ## ๐Ÿšดโ€โ™‚๏ธ Getting Started ### Initializing the Tokenizer - Clone this repo - Go to the directory where you cloned this repo - Initialize the NLTK Tokenizer with a vocabulary file. Note that your vocab file should list one token per lines: ```python from tokenization_nltk import NlktTokenizer tokenizer = NlktTokenizer(vocab_file='path/to/your/vocabulary.txt') #vocab.txt ``` - Enjoy ๐Ÿค— ## ๐Ÿ”ฌ Basic Usage Examples 1. **Simple Tokenization:** ```python text = "Hello Shirin, How are you?" tokens = tokenizer.tokenize(text) print("Tokens:", tokens) #ouput: Tokens: ['Hello', 'Shirin', ',', 'How', 'are', 'you', '?'] ``` 2. **Including Special Tokens:** ```python text = "Hello, world!" tokens = tokenizer.tokenize(text, add_special_tokens=True) print(tokens) #output: ['', 'Hello', ',', 'world', '!', ''] ``` 3. **Token-ID Conversion:** ```python tokens = ['the', 'weather', 'IS', 'Sunny', '!'] token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us print(token_ids) #output: [1997, 4634, 2004, 11560, 1000] ``` 4. **ID-Token Conversion:** ```python ids = [1, 24707, 4634, 19238, 1000, 31000] tokens = [tokenizer.convert_ids_to_tokens(id) for id in ids] print(tokens) #output: ['[PAD]', 'cloudy', 'weather', 'sucks', '!', '[UNK]'] ``` 5. **Tokenizing a Long Text:** ```python long_text = "In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell, nor yet a dry, bare, sandy hole with nothing in it to sit down on or to eat: it was a hobbit-hole, and that means comfort." long_tokens = tokenizer.tokenize(long_text) print("Tokens:", long_tokens) #output: ['In', 'a', 'hole', 'in', 'the', 'ground', 'there', 'lived', 'a', 'hobbit', '.', 'Not', 'a', 'nasty', ',', 'dirty', ',', 'wet', 'hole', ',', 'filled', 'with', 'the', 'ends', 'of', 'worms', 'and', 'an', 'oozy', 'smell', ',', 'nor', 'yet', 'a', 'dry', ',', 'bare', ',', 'sandy', 'hole', 'with', 'nothing', 'in', 'it', 'to', 'sit', 'down', 'on', 'or', 'to', 'eat', ':', 'it', 'was', 'a', 'hobbit-hole', ',', 'and', 'that', 'means', 'comfort', '.'] ``` 6. **Tokenizing Sentences with Emojis:** ```python text_with_emoji = "I love pizza ๐Ÿ•! Do you like it too?" tokens_with_emoji = tokenizer.tokenize(text_with_emoji) print("Tokens:", tokens_with_emoji) #output: ['I', 'love', 'pizza', '๐Ÿ•', '!', 'Do', 'you', 'like', 'it', 'too', '?'] ``` 7. **Saving the Tokenizer:** Save the tokenizer's state, including its vocabulary: ```python tokenizer.save_vocabulary(save_directory='path/to/save') ``` ## ๐Ÿงช Evaluation using ```Pytest``` We have comprehensively tested our tokenizer by implementing various test cases using pytest, ensuring its robustness and functionality across different input scenarios. Make sure to try it yourself by: ```python pytest test_tokenization_nltk.py ``` ## โš ๏ธ Limitations: - **Contextual understanding:** Biggest concern with NLTK's tokenization is that it operates mainly at the word level! This means it does not capture nuanced tokenization decisions needed for some NLP tasks that require sub-word or character-level understanding! - **Language Complexity:** NLTK might struggle with tokenizing languages with complex morphologies or those requiring specialized tokenization rules. For instance, it might struggle to handle languages that heavily rely on context, like some forms of Chinese or Japanese. - **Out-of-Vocabulary Words:** If the tokenizer encounters words not present in its vocabulary (like 31000 id in the last example), it might use an [UNK] (unknown) token or handle them poorly, affecting downstream tasks' performance. - **Limited Preprocessing Performance:** It does not support fully the emojis. ## ๐Ÿค— Hub Integration Make sure you have your vocabularary file (```vocab.txt```) in the same directory where you have the project. 1. **Simple Tokenization:** ```python import torch from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("ShirinYamani/task", trust_remote_code = True) text = "Example sentence for tokenization." # Tokenize the text tokens = tokenizer.tokenize(text) print("Tokens:", tokens) ``` 2. **Including Special Tokens:** ```python text = "Hello, world!" tokens = tokenizer.tokenize(text, add_special_tokens=True) print(tokens) #output: ['', 'Hello', ',', 'world', '!', ''] ``` 3. **Token-ID Conversion:** ```python tokens = ['the', 'weather', 'IS', 'Sunny', '!'] token_ids = [tokenizer.convert_tokens_to_ids(token.lower()) for token in tokens]#lower() because the vocab.txt is all in lower case for us print(token_ids) #output: [1997, 4634, 2004, 11560, 1000] ```