Finetunning the model on custom dataset of text-video pairs.

#59
by Snarky36 - opened

Hello everyone. I am trying to find out if it is possible to finetune a T2V model using a custome dataset that would have multiple pairs of text as input and a video as output. Is it possible to do something like this and if it is is there any example of finetunning text-to-video-ms-1.7b?
I saw that there is a Tune-A-Video repo that is finetunning a model onli using a single video and i was wondering if i could make it work with multiple promps and videos.
My dataset has aprox 1300 of text-video pair with sign language and i would like to make the model to translate from a natural language into a video with a man that is speaking in sign language.

Hi @Snarky36 did you find anything regarding this?

Hello @h-pal , unfortunately from this one no but i found something that is really well explained and you can train text to video with their approach. You can find their repo here and their response regarding fine tunning: https://github.com/ali-vilab/VGen/issues/112
Wish you best of luck!

thanks so much @Snarky36

Sign up or log in to comment