Testing experimental quants

#2
by bartowski - opened

@ZeroWw try to compare these ones if you can

I'm going to be testing Meta-Llama-3-8B-Instruct-f16-q4_K_S.gguf against Meta-Llama-3-8B-Instruct-q4_K_S.gguf, I'll share any findings in this thread

excellent, I appreciate it!!

Here is a repo with some results: ddh0/UnquantizedEmbeddingTesting

There are a couple files in the repo that are not detailed in the README, but there is some information there that may be interesting. Let me know if there are any specific models or tests that you'd like done.

TLDR: there is a measurable difference between models with unquantized vs quantized embedding/output tensors, but exactly how important the difference is should be investigated more

cc @ZeroWw

Explaining Newton's laws of motion using examples and analogies

Q8_0: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.
f16.Q2_K: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.

Q4_K_S: it does not have enough contextual understanding of the prompt in order to properly adhere to the instructions; it only gives an example and analogy.
f16.Q4_K_S: it has enough contextual understanding of the prompt in order to properly adhere to the instructions; it gives the definition of each law of motion, an example, and an analogy.

Even something as basic as this, where giving the definition is heavily implied, Q4_K_S fails to understand this yet f16.Q2_K does so while being slightly smaller.

Create an algorithm in Python code to generate a random password between 8 and 15 characters containing lowercase letters, uppercase letters, and numbers

Q8_0: in-depth code explanation; gave a step-by-step explanation of what the code does, identified a potential shortcoming, and offered a suggestion for modifying the code.
f16.Q2_K: basic code explanation; all it did was state that the code fit the criteria.

Q4_K_S: surface level code explanation; made very obvious observations of the code such as, "random generates random characters" and "generate_password generates the password."
f16.Q4_K_S: in-depth code explanation; gave a step-by-step explanation of what the code does, identified a potential shortcoming, and offered a suggestion for modifying the code.

I ran the code for all 4 of them and they all did what was asked, and the code for all of them was nearly identical, expect for Q4_K_S, which took a very different approach from the rest. The difference between f16 and non-f16 embeddings and output tokens is very clear with the Q4_K_S and f16.Q4_K_S comparison: Q4_K_S gave an extremely obvious code explanation that had no depth, while f16.Q4_K_S understood step-by-step what the code was doing.

Conclusion

F16.Q2_K has just as much contextual understanding as Q8_0; Q4_K_S, on the other hand, had significantly worse contextual understanding than f16.Q2_K despite being slightly larger than it. I only highlighted two, but most of the other side-by-side comparisons had the same conclusion, just to varying degrees. In my own personal tests, I have seen a difference in contextual understanding between Q8_0 and f16.Q8_0, but it was nothing as comprehensive as ddh0's.

At a bare minimum, Q4 and below should be using f16 embeddings and output tensors, or at the very least be given as an additional option to choose from since it increases the file size. Some 1:1 comparisons between Q8_0 and f16.Q8_0 and Q6_K and f16.Q6_K would be good to see if this should be implemented across the board. I would be particularly interested in a comparison between Q6_K and f16.Q4_K_M since they're nearly identical in file size.

@HiroseKoichi okay, running f16-q6 vs q6 and f16-q8 vs q8 soon

Test results for f16-q6_K vs q6_K and f16-q8_0 vs q8_0 are available in the repo (still need to update the README)

My feedback for q8_0 VS q8_1 based on a 4200-token 21 questions survey, Client= LM Studio, temp=0, topP=0.95, system prompt: Perform the task to the best of your ability.

First shot for each were basically the same, after regenerated more than 3 times, there was some differences: 1. q8_1 followed the instructions better, q8_0 stopped responding after a summarization task in the middle. 2. Quality of answered tasks was similar.

I suspect q8_0 file is broken, I also downloaded and tried bartowski/tabula-8b-GGUF q8_0 and q8_0_L. I don't know what's wrong with this, both doesn't work with LM Studio v.0.2.25, with presets Llama3 or ChatML.

I completely spaced out the fact that what I looked at was a comparison of f16 vs. f16.Q4_K_S and not f16.Q4_K_S vs. Q4_K_S. Most of my previous conclusion should still be correct; however, I have to go back and redo some things. I'll update when I finish comparing all of the variants with each other, but at a quick glance, f16, f16.Q8_0, and f16.Q6_K all seem to be nearly identical and preferable over Q8_0.

EDIT: Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was? Also, I hate to ask this since the comparisons were already run, but would you be able do another run where each model has its own separate file for the responses? It would make it much easier to do the comparisons since I can just highlight the differences in a text editor.

Actually, there's a mismatch on the README vs. file; the README says that it was f16.Q4_K_S vs. Q4_K_S, but the file says it was f16 vs. f16.Q4_K_S. @ddh0 could you clarify which of the two it was?

@HiroseKoichi It's f16-q4_K_S vs. regular q4_K_S

Also, which models would you like me to compare?

Not comparisons this time; I want each model individually run on the 40 prompts so that they each have their own text file. The current output is good for automatic evaluation but very hard for manual evaluation. If I want to compare the output of one model against another, I have to copy the first half back into both responses and then into their own separate text files if I want to see them visually side-by-side. If each model's responses are in its own text file, then I can just select two files and run a diff check in a text editor to highlight all the differences.

Ah okay. I'll set that up

I want each model individually run on the 40 prompts so that they each have their own text file

@HiroseKoichi sorry for the delay, this is done now. Each model has its results in a separate file in the repo: ddh0/UnquantizedEmbeddingTesting

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

CC @bartowski @ZeroWw @helloAI333

All 20 different quantizations are included, from q2_K to q8_0 to f16-q2_K to f16-q8_0. I'm very interested to see what differences you find

too many because you used random seeds.
in a comparison like this the seed should be fixed and you should include also some questions that include reasoning and some that include creative writing.
That's because the output tensor affects the "way" it express itself, while the embed tensor affects more it's understanding.
Also, add one test of the pure f16 (convert the hf model to f16) like:
python llama.cpp/convert-hf-to-gguf.py --outtype f16 model_name --outfile ${model_name}.f16.gguf

that's because f16 above will be the "baseline".

here you can find a bunch of models with the f16 and f16.q5, f16.q6 and f16.q8: https://huggingface.co/RobertSinclair

CC @ddh0 , @bartowski @helloAI333

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

too many because you used random seeds.

Don't think seeds are relevant in this case as I'm not doing any sampling

@ddho
in general, no... but making the same questions achieves different results according to seeds.. and it's more difficult to determine how a model is degraded if the seeds are random.

he's got temperature = 0.0 which means that seed doesn't play a role

@ddh0 I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.

Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.

Results for pure bf16 test are up: Results_Meta-Llama-3-8B-Instruct-bf16.gguf.txt

I created a pull request to fix the formatting of the files. The current ones have the escape sequences written in plain text instead of rendered.

Thank you, but this is intentional and I don't think it's a problem

Can you also drop an additional text file that has the file sizes of all the models? Thanks again for running all of this.

Will do now

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

Here is a text file with the sizes of each model in bytes (as outputted from ls -al on my machine): sizes.txt

weird.. in your "sizes" I read:
7835472160 Jun 16 18:30 Meta-Llama-3-8B-Instruct-f16-q6_K.gguf

while in my quantization is:
7.84 GB

can you check if the file is the same?
https://huggingface.co/ZeroWw/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct.q6_k.gguf

I ask because I am not sure what makes my quantization better.. it might be anything.

I would suggest you to do tests comparing the F16 in my repository to the q5 q6 and q8 in the same directory.
Those are sure the right files.

to obtain them I run a colab notebook which main part is this:

import os
import subprocess

repo_model_name = 'gradientai/Llama-3-8B-Instruct-Gradient-1048k' #@param ["mistralai/Mistral-7B-Instruct-v0.3", "lucyknada/microsoft_WizardLM-2-7B", "meta-llama/Meta-Llama-3-8B-Instruct", "BarraHome/Mistroll-7B-v2.2","Qwen/Qwen1.5-7B-Chat","microsoft/Phi-3-mini-128k-instruct","microsoft/Phi-3-medium-128k-instruct","google/gemma-7b",'zhengr/MixTAO-7Bx2-MoE-v8.1','CohereForAI/aya-23-8B','01-ai/Yi-1.5-9B-32K','deepseek-ai/DeepSeek-Coder-V2-Lite-Base','01-ai/Yi-1.5-6B-Chat','ZeusLabs/L3-Aethora-15B-V2','Nitral-AI/Hathor_Stable-v0.2-L3-8B'] {allow-input: true}
model_name = os.path.basename(repo_model_name)

# Download Model
print(f'Downloading {repo_model_name}')
subprocess.run(['huggingface-cli', 'download', repo_model_name, '--local-dir', model_name], stdout=subprocess.DEVNULL)

# Convert Model
print('Converting model to f16.')
subprocess.run(['python', 'llama.cpp/convert-hf-to-gguf.py', '--outtype', 'f16', model_name, '--outfile', f'{model_name}.f16.gguf'], stdout=subprocess.DEVNULL)

# Remove the original model directory
os.system(f'rm -rf {model_name}')

# Quantize Model
quantization_types = ['q5_k', 'q6_k', 'q8_0']
for q_type in quantization_types:
    print(f'Quantizing {q_type}')
    subprocess.run(['./build/bin/llama-quantize', '--allow-requantize', '--output-tensor-type', 'f16', '--token-embedding-type', 'f16', f'{model_name}.f16.gguf', f'{model_name}.{q_type}.gguf', q_type, str(os.cpu_count())], stdout=subprocess.DEVNULL)

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

7835472160 bytes is equal to 7.835 GB, which rounds up to 7.84GB

7835472160/1024/1024/1024 = 7.29 GB

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

No, that's 7.29 Gibibytes (GiB), not gigabytes (GB). See here

so you confirm your file has the same size in bytes?

No, I do not confirm that. If you want to confirm that on your own, go ahead

Edit: I don't think that the exact file size in bytes is going to help you figure anything out, for what it's worth

This is how the sizes should be:

-rw-r--r-- 1 root root 16068890912 Jun 28 05:55 Meta-Llama-3-8B-Instruct.f16.gguf
-rw-r--r-- 1 root root  7042224416 Jun 28 06:07 Meta-Llama-3-8B-Instruct.q5_k.gguf
-rw-r--r-- 1 root root  7835472160 Jun 28 06:15 Meta-Llama-3-8B-Instruct.q6_k.gguf
-rw-r--r-- 1 root root  9525776672 Jun 28 06:17 Meta-Llama-3-8B-Instruct.q8_0.gguf

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

What is your point, exactly? I don't think my file needs to be the exact same size in bytes as yours. What are you getting at?

No need to be snippy, but if the size is not the same it means the quantization process was different than the one I proposed. That's all.

Sign up or log in to comment