TypeError: 'NoneType' object cannot be interpreted as an integer

#3
by tanliboy - opened

Hi Qwen2 team,

I am trying to run Zephyr DPO recipe (https://github.com/huggingface/alignment-handbook/tree/main/recipes/zephyr-7b-beta) to fine-tune this model but consistently running into this error. (The SFT training works fine). Does this model use a special checkpoint configuration I need to configure? Any thoughts on the potential reason?

" [rank6]: TypeError: 'NoneType' object cannot be interpreted as an integer
[rank5]: Traceback (most recent call last):
[rank5]: File "/home/litan/alignment-handbook/scripts/run_dpo.py", line 261, in
[rank5]: main()
[rank5]: File "/home/litan/alignment-handbook/scripts/run_dpo.py", line 214, in main
[rank5]: train_result = trainer.train(resume_from_checkpoint=checkpoint)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 1850, in train
[rank5]: return inner_training_loop(
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/transformers/trainer.py", line 2165, in _inner_training_loop
[rank5]: for step, inputs in enumerate(epoch_iterator):
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in iter
[rank5]: current_batch = next(dataloader_iter)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 631, in next
[rank5]: data = self._next_data()
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
[rank5]: data = self._dataset_fetcher.fetch(index) # may raise StopIteration
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
[rank5]: return self.collate_fn(data)
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/trl/trainer/utils.py", line 338, in call
[rank5]: to_pad = [torch.LongTensor(ex[k]) for ex in features]
[rank5]: File "/opt/conda/envs/handbook/lib/python3.10/site-packages/trl/trainer/utils.py", line 338, in
[rank5]: to_pad = [torch.LongTensor(ex[k]) for ex in features]
[rank5]: TypeError: 'NoneType' object cannot be interpreted as an integer
[2024-06-15 02:51:57,401] [INFO] [utils.py:802:see_memory_usage] After initializing ZeRO optimizer"

In case anyone runs into the same problem, I figured out it is related the inconsistence between bos_token_id and bos_token.
I worked around it by changing
"bos_token": null to be "bos_token": <|endoftext|> in the tokenizer_config.json file.

Sign up or log in to comment