microsoft/Florence-2-large · Fine-tuning for multiple tasks strategy

I would like to fine-tune this model on a specific set of images and combining 2 different tasks (used in cascade).

The idea is that once received the input image, the model should perform the image captioning task (MORE_DETAILED_CAPTION) to describe the image, and then use the CAPTION_TO_PHRASE_GROUNDING in order to have a 'visual perspective' of what the model has described (a sort of gradcam of the text).

What should I do in this case? Fine tune the model twice, starting from the image captioning task and then use the obtained model to train the model for the second task?