Method for loading with transformers without dequantizing

#2
by ddphys - opened

Loading the model with transformers using "Tokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)" automatically dequantizes the model. Using transformers and not llama.cpp, what is a good method for calling the model for inference?, right now I am trying with BitsAndBytesConfig and loading the model with "quantization_config = BitsAndBytesConfig(load_in_4bit=True)". Any thoughts? - Thanks in advance.

Sign up or log in to comment