Error loading model

#1
by sm54 - opened

Hello,

I've tried loading the q8_0 quant, and I get this error, using windows text generation webui:

llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'deepseek2'
llama_load_model_from_file: failed to load model
19:48:02-513121 ERROR Failed to load the model.

text gen llama-cpp needs an update

Turn off flash attention. This seems to be a known bug.

i would think that's a different error than 'unknown model architecture' but i may be wrong

Loading some layers to GPU (-ngl) with latest llama.cpp returned "llama_init_from_gpt_params: error: failed to load model".
Using only CPU solved this for me (as mentioned here https://github.com/ggerganov/llama.cpp/pull/7519).
Using flash attention (-fa) gave error: "GGML_ASSERT: ggml.c:5716: ggml_nelements(a) == ne0*ne1".

@wrtn2 you have to disable flash attention for this model to use GPU

@bartowski Thanks, good to know! In my case the card lacks sufficient RAM, so I'd set llama to load only a subset of the layers on the GPU, which is possible with a number of models, but seems not to be on this one.

Hi all,
could you tell me, how you make it run?

Right now, I am using this cumbersome ipynb

from llama_cpp import Llama

llm = Llama(
      model_path="/DeepSeek-Coder-V2-Lite-Instruct-GGUF/DeepSeek-Coder-V2-Lite-Instruct-Q8_1.gguf",
      n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=1337, # Uncomment to set a specific seed
      n_ctx=8*2048, # Uncomment to increase the context window
)

response = llm.create_chat_completion(
      messages = [
          {"role": "system", "content": "You are a helpful coding assistant."},
          {
              "role": "user",
              "content": "give me quick sort in c++."
          }
      ]
)
print(response["choices"][0]["message"]["content"])

Is there a more convenient way, using huggingface or anything else?

Thank you in advance!

(updated the name to Q8_0_L from Q8_1 just now fyi)

That looks like a fine implementation, is there an issue you're running into or just trying to find a better way?

alright, great, thank you very much!

I am just used to the transformers lingo and thought, maybe there's a better way.

and thanks for the fast reply!

This comment has been hidden

Hi. I wanted to test a model up to 8 gigabytes. Downloaded IQ 3. It doesn't work in programs - GPT4 All and LM Studio(((( I'd appreciate it if you could help me get it up and running.

Getting this in LMstudio w flash attention off, tried both w GPU offload and CPU only, same message. Not sure what to do :/ Preset is Deepseek Coder, maybe it needs a deepseek coder instruct preset?

error:
"llama.cpp error: 'error loading model architecture: unknown model architecture: 'deepseek2''"

Update to 0.2.25 from the website, or ignore it if you're already on it

I'm running LM Studio 0.2.26, and it fails. Tried gpt4all, Jan, ollama with Chatollama. Nothing will load this model. tried q4, q8. Flash attention is disabled. How do I use this model?

Ok, I figured it out. If you are using Ollama in a docker:

docker pull ollama/ollama:latest

Sign up or log in to comment