FeynModel V 0.1

Welcome to the FeynModel repository, a Vision Language model with the reasoning capabilities of an LLM (Large Language Model). It aims to explore the combined power of vision and language for scientific reasoning tasks. This model is fine-tuned using the LoRA (Low-Rank Adaptation) method, optimizing it for enhanced performance in a variety of vision and language tasks.

Version 0.1 utilizes pretrained layers from the DaVit Vision Tower of Florence2-base (Microsoft) and Gemma2-2B (Google), and was fine-tuned on M3IT, COCO, and ScienceQA datasets. It employs an S6 block to integrate context memory for Q*TS (experimental).

how to use

# make sur to have torch, transformers, pillow, einos ,einops and timm libraries 
from transformers import AutoProcessor,  AutoModelForCausalLM
model_id='Imagroune/feynmodel'
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id,trust_remote_code=True)
# if have a cuda device 
model.to('cuda')
# else if you have cpu you can use
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    device_map='cpu'  # Assure que le modèle est chargé sur le CPU
    ,torch_dtype=torch.bfloat16  # Charger le modèle en demi-précision
)

LLM Inference


input_text = "<start_of_turn>user\nCombien d'helicoptère un humain adulte peut manger en un seul repas?.<end_of_turn> <start_of_turn>model\n"
input_ids = processor.tokenizer(input_text, return_tensors="pt").to("cuda")

# Génération du texte en mode streaming
max_length = input_ids.input_ids.shape[1] + 1024  # Longueur maximale totale
stream_output = []  # Liste pour stocker le flux de sortie

# Génération et affichage en mode streaming
for output in model.generate(input_ids=input_ids.input_ids,max_length=max_length, do_sample=True, temperature=0.7):
    decoded_output = processor.tokenizer.decode(output, skip_special_tokens=True)
    stream_output.append(decoded_output)
    print(decoded_output, end="", flush=True)

it will output something like :

This is a trick question!  Here's why:

* **Helicopters don't have food to eat.** Helicopters are machines that fly. They don't have mouths or stomachs!
* **Humans don't fly through food.** We eat food to give our bodies energy. But we don't eat food that we can fly through! 

Let me know if you'd like to learn about how people eat different foods.

Vision Inference

from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList

class PrintTokensStoppingCriteria(StoppingCriteria):
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def __call__(self, input_ids, scores, **kwargs):
        # Decode the last generated token and print it
        last_token_id = input_ids[0, -1].item()
        token = self.tokenizer.decode([last_token_id], skip_special_tokens=True)
        print(token, end='', flush=True)
        
        # Continue generating tokens until a stopping condition is met
        # Return True to stop, False to continue
        return False
stopping_criteria = PrintTokensStoppingCriteria(processor.tokenizer)

from PIL import Image
import requests 
input_text = "<start_of_turn>user\n what is this ?<end_of_turn>\n<start_of_turn>model"


url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)


input_text="""<start_of_turn>user
Create a concise caption that accurately describes the main elements in the image provided
<end_of_turn>
<start_of_turn>model

"""
inputs = processor(text=input_text, images=image, return_tensors="pt")
inputs = {key: value.cuda() for key, value in inputs.items()}
# NB : if you are using bflot16 ==> 
inputs = {key: value.to(dtype=model.dtype) if value.dtype == torch.float32 else value for key, value in inputs.items()}

image


max_length =inputs['input_ids'].shape[1] + 1024  # Longueur maximale totale
stream_output = []  # Liste pour stocker le flux de sortie
# Génération et affichage en mode streaming
ret= model.generate(inputs['input_ids'], pixel_values=inputs['pixel_values'],stopping_criteria=StoppingCriteriaList([stopping_criteria]),max_length=2048, do_sample=True, temperature=0.7)

# An older, green car sits parked on the curb in front of a building.