moranyanuka commited on
Commit
4eaa10c
1 Parent(s): 51ae538

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -0
README.md CHANGED
@@ -1,3 +1,123 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # Mocha Checkpoint for BLIP-Large Model
5
+
6
+ The official checkpoint of BLIP-Large model, finetuned on MS-COCO with the MOCHa RL frameword, introduced in [MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations](https://arxiv.org/pdf/2312.03631.pdf)
7
+
8
+ [Project Page](https://assafbk.github.io/mocha/)
9
+
10
+ ## Usage
11
+
12
+ You can use this model for conditional and un-conditional image captioning
13
+
14
+ ### Using the Pytorch model
15
+
16
+ #### Running the model on CPU
17
+
18
+ <details>
19
+ <summary> Click to expand </summary>
20
+
21
+ ```python
22
+ import requests
23
+ from PIL import Image
24
+ from transformers import BlipProcessor, BlipForConditionalGeneration
25
+
26
+ processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
27
+ model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
28
+
29
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
30
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
31
+
32
+ # conditional image captioning
33
+ text = "a photography of"
34
+ inputs = processor(raw_image, text, return_tensors="pt")
35
+
36
+ out = model.generate(**inputs)
37
+ print(processor.decode(out[0], skip_special_tokens=True))
38
+
39
+ # unconditional image captioning
40
+ inputs = processor(raw_image, return_tensors="pt")
41
+
42
+ out = model.generate(**inputs)
43
+ print(processor.decode(out[0], skip_special_tokens=True))
44
+ ```
45
+ </details>
46
+
47
+ #### Running the model on GPU
48
+
49
+ ##### In full precision
50
+
51
+ <details>
52
+ <summary> Click to expand </summary>
53
+
54
+ ```python
55
+ import requests
56
+ from PIL import Image
57
+ from transformers import BlipProcessor, BlipForConditionalGeneration
58
+
59
+ processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
60
+ model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha").to("cuda")
61
+
62
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
63
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
64
+
65
+ # conditional image captioning
66
+ text = "a photography of"
67
+ inputs = processor(raw_image, text, return_tensors="pt").to("cuda")
68
+
69
+ out = model.generate(**inputs)
70
+ print(processor.decode(out[0], skip_special_tokens=True))
71
+
72
+ # unconditional image captioning
73
+ inputs = processor(raw_image, return_tensors="pt").to("cuda")
74
+
75
+ out = model.generate(**inputs)
76
+ print(processor.decode(out[0], skip_special_tokens=True))
77
+ ```
78
+ </details>
79
+
80
+ ##### In half precision (`float16`)
81
+
82
+ <details>
83
+ <summary> Click to expand </summary>
84
+
85
+ ```python
86
+ import torch
87
+ import requests
88
+ from PIL import Image
89
+ from transformers import BlipProcessor, BlipForConditionalGeneration
90
+
91
+ processor = BlipProcessor.from_pretrained("moranyanuka/blip-image-captioning-large-mocha")
92
+ model = BlipForConditionalGeneration.from_pretrained("moranyanuka/blip-image-captioning-large-mocha", torch_dtype=torch.float16).to("cuda")
93
+
94
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
95
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
96
+
97
+ # conditional image captioning
98
+ text = "a photography of"
99
+ inputs = processor(raw_image, text, return_tensors="pt").to("cuda", torch.float16)
100
+
101
+ out = model.generate(**inputs)
102
+ print(processor.decode(out[0], skip_special_tokens=True))
103
+ # >>> a photography of a woman and her dog
104
+
105
+ # unconditional image captioning
106
+ inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
107
+
108
+ out = model.generate(**inputs)
109
+ print(processor.decode(out[0], skip_special_tokens=True))
110
+ >>> a woman sitting on the beach with her dog
111
+ ```
112
+ </details>
113
+
114
+ bibtex:
115
+ ```
116
+ @misc{benkish2023mocha,
117
+ title={MOCHa: Multi-Objective Reinforcement Mitigating Caption Hallucinations},
118
+ author={Assaf Ben-Kish and Moran Yanuka and Morris Alper and Raja Giryes and Hadar Averbuch-Elor},
119
+ year={2023},
120
+ eprint={2312.03631},
121
+ archivePrefix={arXiv},
122
+ primaryClass={cs.CV}
123
+ }