File size: 4,009 Bytes
1fa5428
1098667
1fa5428
 
 
 
 
 
 
 
 
 
08a1091
1fa5428
 
fbd3223
1fa5428
71128ae
1fa5428
71128ae
1fa5428
 
 
 
 
 
 
 
 
 
 
 
 
08a1091
1fa5428
 
 
 
 
 
 
 
 
08a1091
 
1fa5428
08a1091
 
1fa5428
a18eea7
 
1fa5428
08a1091
 
1fa5428
aa623cb
08a1091
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fa5428
08a1091
 
 
 
1fa5428
 
 
 
 
673d349
1fa5428
673d349
1fa5428
3ff8958
1fa5428
 
 
 
673d349
1fa5428
69145a5
1fa5428
 
 
673d349
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
base_model: HuggingFaceM4/idefics2-8b
library_name: peft
license: apache-2.0
datasets:
- cmarkea/doc-vqa
language:
- fr
- en
pipeline_tag: visual-question-answering
---

# idefics2-8b-ft-docvqa-lora


**idefics2-8b-ft-docvqa-lora** is a fine-tuned version of the **[HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)** model, specifically trained on the **[doc-vqa](https://huggingface.co/datasets/cmarkea/doc-vqa)** dataset published by Crédit Mutuel Arkéa. Optimized using the **LoRA** (Low-Rank Adaptation) method, this model was designed to enhance performance while reducing the complexity of fine-tuning.

During training, particular attention was given to linguistic balance, with a focus on french. The model was exposed to a predominantly french context, with a 70% likelihood of interacting with french questions/answers for a given image. It operates exclusively in bfloat16 precision, optimizing computational resources. The entire training process took 3 days on a single A100 40GB.

Thanks to its multilingual specialization and emphasis on french, this model excels in francophone environments, while also performing well in english. It is especially suited for tasks that require the analysis and understanding of complex documents, such as extracting information from forms, invoices, reports, and other text-based documents in a visual question-answering context.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->



- **Developed by:** Loïc SOKOUDJOU SONAGU and Yoann SOLA
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** French, English
- **License:** Apache 2.0
- **Finetuned from model:** [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)


## Usage

Model usage is simple via `transformers` API

```python
import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image


device = "cuda" if torch.cuda.is_available() else "cpu"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")

model_id = "cmarkea/idefics2-8b-ft-docvqa-lora"
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
).to(device)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    }       
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, image=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}


# Generate
with torch.inference_mode():
  generated_ids = model.generate(**inputs, max_new_tokens=500)
  generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
  print(generated_texts)

```


### Results

By following the **[LLM-as-Juries](https://arxiv.org/abs/2404.18796)** evaluation method, the following results were obtained using three judge models (GPT-4o, Gemini1.5 Pro and Claude 3.5-Sonnet). This metric was adapted to the VQA context, with clear criteria for each score (0 to 5) to ensure the highest possible precision in meeting expectations.

![constellation](https://i.postimg.cc/t4tjhy6b/constellation-0.png)

**idefics2-8b-ft-docvqa-lora** and **paligemma-3b-ft-docvqa-896-lora** demonstrate equivalent performance despite having different model sizes.

## Citation

```bibtex
@online{SOSoIdefics2,
  AUTHOR = {Loïc SOKOUDJOU SONAGU and Yoann SOLA},
  URL = {https://huggingface.co/cmarkea/idefics2-8b-ft-docvqa-lora},
  YEAR = {2024},
  KEYWORDS = {Multimodal ; VQA},
}
```