cmarkea
/

paligemma-3b-ft-docvqa-896-lora

Visual Question Answering

PEFT

Safetensors

French

English

Model card Files Files and versions Community

SOKOUDJOU commited on 29 days ago

Commit

3992e44

•

1 Parent(s): e2183ff

Update README.md

Browse files

Files changed (1) hide show

README.md +7 -1

README.md CHANGED Viewed

@@ -15,7 +15,7 @@ pipeline_tag: visual-question-answering
 paligemma-3b-ft-docvqa-896-lora is a fine-tuned version of the [google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896/edit/main/README.md) model, specifically trained on the [doc-vqa](https://huggingface.co/datasets/cmarkea/doc-vqa) dataset published by cmarkea. Optimized using the LoRA (Low-Rank Adaptation) method, this model was designed to enhance performance while reducing the complexity of fine-tuning.
-During training, particular attention was given to linguistic balance, with a focus on French. The model was exposed to a predominantly French context, with a 70% likelihood of interacting with French questions/answers for a given image. It operates exclusively in bfloat16 precision, optimizing computational resources.
 Thanks to its multilingual specialization and emphasis on French, this model excels in francophone environments, while also performing well in English. It is especially suited for tasks that require the analysis and understanding of complex documents, such as extracting information from forms, invoices, reports, and other text-based documents in a visual question-answering context.
@@ -52,6 +52,9 @@ image = Image.open(requests.get(url, stream=True).raw)
 model = PaliGemmaForConditionalGeneration.from_pretrained(
     model_id,
     torch_dtype=torch.bfloat16,
     device_map=device,
 ).eval()
@@ -75,6 +78,9 @@ with torch.inference_mode():
 [More Information Needed]
 ## Citation

 paligemma-3b-ft-docvqa-896-lora is a fine-tuned version of the [google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896/edit/main/README.md) model, specifically trained on the [doc-vqa](https://huggingface.co/datasets/cmarkea/doc-vqa) dataset published by cmarkea. Optimized using the LoRA (Low-Rank Adaptation) method, this model was designed to enhance performance while reducing the complexity of fine-tuning.
+During training, particular attention was given to linguistic balance, with a focus on French. The model was exposed to a predominantly French context, with a 70% likelihood of interacting with French questions/answers for a given image. It operates exclusively in bfloat16 precision, optimizing computational resources. The entire training process took 3 week on a single A100 40GB.
 Thanks to its multilingual specialization and emphasis on French, this model excels in francophone environments, while also performing well in English. It is especially suited for tasks that require the analysis and understanding of complex documents, such as extracting information from forms, invoices, reports, and other text-based documents in a visual question-answering context.
 model = PaliGemmaForConditionalGeneration.from_pretrained(
     model_id,
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f478333545cc30503e3fcd/8O49whlhlgRR8377NjkAl.png)
     torch_dtype=torch.bfloat16,
     device_map=device,
 ).eval()
 [More Information Needed]
+By following the LLM-as-Juries evaluation method, the following results were obtained using three judge models (GPT-4o, Gemini1.5 Pro, and Claude 3.5-Sonnet). These models were evaluated based on a well-defined scoring rubric specifically designed for the VQA (Visual Question Answering) context, with clear criteria for each score to ensure the highest possible precision in meeting expectations.
+![constellation](https://i.postimg.cc/XNdfdg49/constellation-0.png)
 ## Citation