SOKOUDJOU commited on
Commit
1fa5428
1 Parent(s): 1d0c5e1

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -0
README.md ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: google/paligemma-3b-ft-docvqa-896
3
+ library_name: peft
4
+ license: apache-2.0
5
+ datasets:
6
+ - cmarkea/doc-vqa
7
+ language:
8
+ - fr
9
+ - en
10
+ pipeline_tag: visual-question-answering
11
+ ---
12
+
13
+ # paligemma-3b-ft-docvqa-896-lora
14
+
15
+
16
+ **paligemma-3b-ft-docvqa-896-lora** is a fine-tuned version of the **[google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896/edit/main/README.md)** model, specifically trained on the **[doc-vqa](https://huggingface.co/datasets/cmarkea/doc-vqa)** dataset published by cmarkea. Optimized using the **LoRA** (Low-Rank Adaptation) method, this model was designed to enhance performance while reducing the complexity of fine-tuning.
17
+
18
+ During training, particular attention was given to linguistic balance, with a focus on French. The model was exposed to a predominantly French context, with a 70% likelihood of interacting with French questions/answers for a given image. It operates exclusively in bfloat16 precision, optimizing computational resources. The entire training process took 3 week on a single A100 40GB.
19
+
20
+ Thanks to its multilingual specialization and emphasis on French, this model excels in francophone environments, while also performing well in English. It is especially suited for tasks that require the analysis and understanding of complex documents, such as extracting information from forms, invoices, reports, and other text-based documents in a visual question-answering context.
21
+
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+
26
+ <!-- Provide a longer summary of what this model is. -->
27
+
28
+
29
+
30
+ - **Developed by:** Loïc SOKOUDJOU SONAGU and Yoann SOLA
31
+ - **Model type:** Multi-modal model (image+text)
32
+ - **Language(s) (NLP):** French, English
33
+ - **License:** Apache 2.0
34
+ - **Finetuned from model [optional]:** [google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896/edit/main/README.md)
35
+
36
+
37
+ ## Usage
38
+
39
+ Model usage is simple via `transformers` API
40
+
41
+ ```python
42
+ from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
43
+ from PIL import Image
44
+ import requests
45
+ import torch
46
+ device = "cuda" if torch.cuda.is_available() else "cpu"
47
+
48
+ model_id = "cmarkea/paligemma-3b-ft-docvqa-896-lora"
49
+
50
+ url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
51
+ image = Image.open(requests.get(url, stream=True).raw)
52
+
53
+ model = PaliGemmaForConditionalGeneration.from_pretrained(
54
+ model_id,
55
+
56
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65f478333545cc30503e3fcd/8O49whlhlgRR8377NjkAl.png)
57
+
58
+ torch_dtype=torch.bfloat16,
59
+ device_map=device,
60
+ ).eval()
61
+
62
+ processor = AutoProcessor.from_pretrained("google/paligemma-3b-ft-docvqa-896")
63
+
64
+ # Instruct the model to create a caption in french
65
+ prompt = "caption fr"
66
+ model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
67
+ input_len = model_inputs["input_ids"].shape[-1]
68
+
69
+ with torch.inference_mode():
70
+ generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
71
+ generation = generation[0][input_len:]
72
+ decoded = processor.decode(generation, skip_special_tokens=True)
73
+ print(decoded)
74
+ ```
75
+
76
+
77
+ ### Results
78
+
79
+ By following the **LLM-as-Juries** evaluation method, the following results were obtained using three judge models (GPT-4o, Gemini1.5 Pro, and Claude 3.5-Sonnet). These models were evaluated based on a well-defined scoring rubric specifically designed for the VQA context, with clear criteria for each score to ensure the highest possible precision in meeting expectations.
80
+
81
+ ![constellation](https://i.postimg.cc/kMRmcBpQ/constellation-0.png)
82
+
83
+
84
+ ## Citation
85
+
86
+ ```bibtex
87
+ @online{Depaligemma,
88
+ AUTHOR = {Loïc SOKOUDJOU SONAGU and Yoann SOLA},
89
+ URL = {https://huggingface.co/cmarkea/paligemma-3b-ft-docvqa-896-lora},
90
+ YEAR = {2024},
91
+ KEYWORDS = {Multimodal ; VQA},
92
+ }
93
+ ```
94
+ Find the base model paper [here](https://arxiv.org/abs/2407.07726).
95
+
96
+ - PEFT 0.11.1