placeholder tokens are zero initialized

#89
by xdseunghyun - opened

hi guys, thank you for sharing an awesome model !
I'm trying to fine-tuning phi3 medium but found gradient return NaN right after first optimization step.
and i found some tokens are zero initialized and this is the reason.
when i add another special token with 0.02 init (conventional 1/d_model std) and train with this token, it was fine.

so here's my question,
does zero vector can cause NaN ? why you init placeholder tokens with zero?

Sign up or log in to comment