File size: 3,775 Bytes
b410e76
 
 
8a5c404
 
 
 
 
 
 
b410e76
 
af37333
b410e76
 
6cd0d1e
e2183ff
7d61cb1
e2183ff
7d61cb1
b410e76
 
 
 
 
 
 
 
 
8a5c404
 
 
1109a44
b410e76
 
af37333
b410e76
af37333
b410e76
af37333
 
 
 
 
 
b410e76
af37333
b410e76
af37333
 
b410e76
af37333
 
e2183ff
af37333
 
b410e76
af37333
b410e76
af37333
 
 
 
b410e76
af37333
 
 
 
 
 
b410e76
 
f6bbd29
b410e76
 
15562c2
3992e44
15562c2
b410e76
696238f
b410e76
af37333
b410e76
af37333
15562c2
af37333
 
 
 
 
15562c2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
base_model: google/paligemma-3b-ft-docvqa-896
library_name: peft
license: apache-2.0
datasets:
- cmarkea/doc-vqa
language:
- fr
- en
pipeline_tag: visual-question-answering
---

# paligemma-3b-ft-docvqa-896-lora


**paligemma-3b-ft-docvqa-896-lora** is a fine-tuned version of the **[google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896)** model, specifically trained on the **[doc-vqa](https://huggingface.co/datasets/cmarkea/doc-vqa)** dataset published by Crédit Mutuel Arkéa. Optimized using the **LoRA** (Low-Rank Adaptation) method, this model was designed to enhance performance while reducing the complexity of fine-tuning.

During training, particular attention was given to linguistic balance, with a focus on french. The model was exposed to a predominantly french context, with a 70% likelihood of interacting with french questions/answers for a given image. It operates exclusively in bfloat16 precision, optimizing computational resources. The entire training process took 3 week on a single A100 40GB.

Thanks to its multilingual specialization and emphasis on french, this model excels in francophone environments, while also performing well in english. It is especially suited for tasks that require the analysis and understanding of complex documents, such as extracting information from forms, invoices, reports, and other text-based documents in a visual question-answering context.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** French, English
- **License:** Apache 2.0
- **Finetuned from model [optional]:** [google/paligemma-3b-ft-docvqa-896](https://huggingface.co/google/paligemma-3b-ft-docvqa-896)


## Usage

Model usage is simple via `transformers` API

```python
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"

model_id = "cmarkea/paligemma-3b-ft-docvqa-896-lora"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map=device,
).eval()

processor = AutoProcessor.from_pretrained("google/paligemma-3b-ft-docvqa-896")

# Instruct the model to create a caption in french
prompt = "caption fr"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)
```



### Results

By following the **[LLM-as-Juries](https://arxiv.org/abs/2404.18796)** evaluation method, the following results were obtained using three judge models (GPT-4o, Gemini1.5 Pro and Claude 3.5-Sonnet). This metric was adapted to the VQA context, with clear criteria for each score (0 to 5) to ensure the highest possible precision in meeting expectations.

![constellation](https://i.postimg.cc/t4tjhy6b/constellation-0.png)

**idefics2-8b-ft-docvqa-lora** and **paligemma-3b-ft-docvqa-896-lora** demonstrate equivalent performance despite having different model sizes.

## Citation

```bibtex
@online{SoSoPaligemma,
  AUTHOR = {Loïc SOKOUDJOU SONAGU and Yoann SOLA},
  URL = {https://huggingface.co/cmarkea/paligemma-3b-ft-docvqa-896-lora},
  YEAR = {2024},
  KEYWORDS = {Multimodal ; VQA},
}
```