Florence-2 Fine-Tuning with PathVQA

This document provides an overview of fine-tuning the Florence-2 model with the PathVQA dataset. Florence-2 is a sequence-to-sequence model that excels in various computer vision tasks by leveraging its robust architecture and extensive pre-training dataset.

Florence-2 Model Overview

Florence-2 formulates computer vision problems as sequence-to-sequence tasks. The model takes images and text as inputs and generates text as output. Below is a detailed breakdown of the model and its components:

Model Architecture

DaViT Vision Encoder: Converts images into visual embeddings.
BERT Text Encoder: Converts text prompts into text and location embeddings.
Transformer Architecture: A standard encoder-decoder transformer processes the embeddings to generate text and location tokens.

Strength of Florence-2

The model's strength lies not in its architecture but in the extensive dataset it was pre-trained on. The authors created the FLD-5B dataset to address the limitations of existing datasets like WIT and SA-1B, which contain limited information.

FLD-5B Dataset

Content: Over 5 billion annotations for 126 million images, including boxes, masks, captions, and grounding.
Creation Process: Largely automated using off-the-shelf task-specific models and a set of heuristics and quality checks to clean the obtained results.

PathVQA Dataset

PathVQA is a dataset of question-answer pairs on pathology images, intended for training and testing Medical Visual Question Answering (VQA) systems. It includes both open-ended questions and binary "yes/no" questions.

Source

The dataset is built from two publicly available pathology textbooks:

"Textbook of Pathology"
"Basic Pathology"

Additionally, images were sourced from the "Pathology Education Informational Resource" (PEIR) digital library. The copyrights of images and captions belong to the publishers and authors of these two books and the owners of the PEIR digital library.

Dataset Summary

Total Images: 5,004
Total Question-Answer Pairs: 32,795
- Referenced Images: 4,289
- Unused Images: 715

After removing duplicate image-question-answer triplets, the dataset contains 32,632 question-answer pairs on 4,289 images.

Supported Tasks and Leaderboards

The PathVQA dataset has an active leaderboard on Papers with Code, ranking models based on:

Yes/No Accuracy: Accuracy of generated answers for binary "yes/no" questions.
Free-form Accuracy: Accuracy of generated answers for open-ended questions.
Overall Accuracy: Accuracy of generated answers across all questions.

Fine-Tuning Florence-2 with PathVQA

The Florence-2 model was fine-tuned using the PathVQA dataset to adapt it for medical visual question answering tasks.

Methodology

Dataset Preparation: The PathVQA dataset was obtained from the updated Google Drive link shared by the authors on February 15, 2023.
Data Cleaning: Duplicate image-question-answer triplets were removed.
Fine-Tuning: The model was fine-tuned for seven epochs with the training set.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Mohammed Ali Abbas
Model type: VQA
License: [More Information Needed]
Finetuned from model [optional]: [More Information Needed]