Method for loading with transformers without dequantizing

by ddphys - opened 19 days ago

19 days ago

•

Loading the model with transformers using "Tokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)" automatically dequantizes the model. Using transformers and not llama.cpp, what is a good method for calling the model for inference?, right now I am trying with BitsAndBytesConfig and loading the model with "quantization_config = BitsAndBytesConfig(load_in_4bit=True)". Any thoughts? - Thanks in advance.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment