yuvraj17's picture
Added Nous Eval-Scores
72c0667 verified
|
raw
history blame
5.88 kB
metadata
base_model:
  - yuvraj17/EvolCodeLlama-3.1-8B-Instruct
  - yzhuang/Meta-Llama-3-8B-Instruct_fictional_gsm8k_English_v1
tags:
  - merge
  - mergekit
  - lazymergekit
  - yuvraj17/EvolCodeLlama-3.1-8B-Instruct
  - yzhuang/Meta-Llama-3-8B-Instruct_fictional_gsm8k_English_v1

Llama3-8B-Instruct-Slerp

Llama3-8B-Instruct-Slerp is a merge of the following models using LazyMergekit:

🧩 Configuration

slices:
  - sources:
      - model: yuvraj17/EvolCodeLlama-3.1-8B-Instruct
        layer_range: [0, 32]
      - model: yzhuang/Meta-Llama-3-8B-Instruct_fictional_gsm8k_English_v1
        layer_range: [0, 32]
merge_method: slerp
base_model: yuvraj17/EvolCodeLlama-3.1-8B-Instruct
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: float16

💻 Usage

!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "yuvraj17/Llama3-8B-Instruct-Slerp"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

🏆 Evaluation Scores

Nous

Model AGIEval TruthfulQA Bigbench
yuvraj17/Llama3-8B-Instruct-Slerp 38.32 57.15 43.91

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 23.62 ± 2.67
acc_norm 22.05 ± 2.61
agieval_logiqa_en 0 acc 27.50 ± 1.75
acc_norm 31.80 ± 1.83
agieval_lsat_ar 0 acc 21.30 ± 2.71
acc_norm 20.87 ± 2.69
agieval_lsat_lr 0 acc 35.29 ± 2.12
acc_norm 37.65 ± 2.15
agieval_lsat_rc 0 acc 42.01 ± 3.01
acc_norm 39.78 ± 2.99
agieval_sat_en 0 acc 55.83 ± 3.47
acc_norm 50.49 ± 3.49
agieval_sat_en_without_passage 0 acc 36.89 ± 3.37
acc_norm 34.95 ± 3.33
agieval_sat_math 0 acc 29.55 ± 3.08
acc_norm 28.64 ± 3.05

Average score: 33.28%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 33.54 ± 1.65
mc2 49.78 ± 1.53

Average score: 49.78%

BigBench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 47.89 ± 3.63
bigbench_date_understanding 0 multiple_choice_grade 39.02 ± 2.54
bigbench_disambiguation_qa 0 multiple_choice_grade 33.72 ± 2.95
bigbench_geometric_shapes 0 multiple_choice_grade 20.61 ± 2.14
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 31.40 ± 2.08
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.71 ± 1.61
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 47.00 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 27.40 ± 1.99
bigbench_navigate 0 multiple_choice_grade 50.10 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 38.40 ± 1.09
bigbench_ruin_names 0 multiple_choice_grade 27.23 ± 2.11
bigbench_salient_translation_error_detection 0 multiple_choice_grade 25.45 ± 1.38
bigbench_snarks 0 multiple_choice_grade 46.41 ± 3.72
bigbench_sports_understanding 0 multiple_choice_grade 50.30 ± 1.59
bigbench_temporal_sequences 0 multiple_choice_grade 37.30 ± 1.53
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 21.36 ± 1.16
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.14 ± 0.90
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 47.00 ± 2.89

Average score: 35.38%