metadata

language:
  - pt
license: llama2
library_name: transformers
tags:
  - code
  - analytics
  - analise-dados
  - portugues-BR
datasets:
  - semantixai/Test-Dataset-Lloro
base_model: codellama/CodeLlama-7b-Instruct-hf
model-index:
  - name: LloroV2
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: ENEM Challenge (No Images)
          type: eduagarcia/enem_challenge
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 26.03
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: BLUEX (No Images)
          type: eduagarcia-temp/BLUEX_without_images
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 29.07
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: OAB Exams
          type: eduagarcia/oab_exams
          split: train
          args:
            num_few_shot: 3
        metrics:
          - type: acc
            value: 32.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Assin2 RTE
          type: assin2
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 57.19
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Assin2 STS
          type: eduagarcia/portuguese_benchmark
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: pearson
            value: 26.81
            name: pearson
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: FaQuAD NLI
          type: ruanchaves/faquad-nli
          split: test
          args:
            num_few_shot: 15
        metrics:
          - type: f1_macro
            value: 43.77
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HateBR Binary
          type: ruanchaves/hatebr
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 68.02
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: PT Hate Speech Binary
          type: hate_speech_portuguese
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 38.53
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: tweetSentBR
          type: eduagarcia-temp/tweetsentbr
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: f1_macro
            value: 35.21
            name: f1-macro
        source:
          url: >-
            https://huggingface.co/spaces/eduagarcia/open_pt_llm_leaderboard?query=semantixai/LloroV2
          name: Open Portuguese LLM Leaderboard

Lloro 7B

Lloro, developed by Semantix Research Labs , is a language Model that was trained to effectively perform Portuguese Data Analysis in Python. It is a fine-tuned version of codellama/CodeLlama-7b-Instruct-hf, that was trained on synthetic datasets . The fine-tuning process was performed using the QLORA metodology on a GPU V100 with 16 GB of RAM.

Model description

Model type: A 7B parameter fine-tuned on synthetic datasets.

Language(s) (NLP): Primarily Portuguese, but the model is capable to understand English as well

Finetuned from model: codellama/CodeLlama-7b-Instruct-hf

What is Lloro's intended use(s)?

Lloro is built for data analysis in Portuguese contexts .

Input : Text

Output : Text (Code)

Usage

Using Transformers

#Import required libraries
import torch
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

#Load Model
model_name = "semantixai/LloroV2"
base_model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True,
        torch_dtype=torch.float16,
        device_map="auto",
    )

#Load Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


#Define Prompt
user_prompt = "Desenvolva um algoritmo em Python para calcular a média e a mediana dos preços de vendas por tipo de material do produto."
system = "Provide answers in Python without explanations, only the code"
prompt_template = f"[INST] <<SYS>>\\n{system}\\n<</SYS>>\\n\\n{user_prompt}[/INST]"

#Call the model
input_ids = tokenizer([prompt_template], return_tensors="pt")["input_ids"].to("cuda")

            
outputs = base_model.generate(
    input_ids,
    do_sample=True,
    top_p=0.95,
    max_new_tokens=1024,
    temperature=0.1,
    )

#Decode and retrieve Output
output_text = tokenizer.batch_decode(outputs, skip_prompt=True, skip_special_tokens=False)
display(output_text)

Using an OpenAI compatible inference server (like vLLM)

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1",
)
user_prompt = "Desenvolva um algoritmo em Python para calcular a média e a mediana dos preços de vendas por tipo de material do produto."
completion = client.chat.completions.create(temperature=0.1,frequency_penalty=0.1,model="semantixai/LloroV2",messages=[{"role":"system","content":"Provide answers in Python without explanations, only the code"},{"role":"user","content":user_prompt}])

Params Training Parameters

Params	Training Data	Examples	Tokens	LR
7B	Pairs synthetic instructions/code	28907	3 031 188	1e-5

Model Sources

Test Dataset Repository: https://huggingface.co/datasets/semantixai/Test-Dataset-Lloro

Model Dates Lloro was trained between November 2023 and January 2024.

Performance

Modelo	LLM as Judge	Code Bleu Score	Rouge-L	CodeBert- Precision	CodeBert-Recall	CodeBert-F1	CodeBert-F3
GPT 3.5	91.22%	0.2745	0.2189	0.7502	0.7146	0.7303	0.7175
Instruct -Base	97.40%	0.2487	0.1146	0.6997	0.6473	0.6713	0.6518
Instruct -FT	97.76%	0.3264	0.3602	0.7942	0.8178	0.8042	0.8147

Training Infos: The following hyperparameters were used during training:

Parameter	Value
learning_rate	1e-5
weight_decay	0.0001
train_batch_size	1
eval_batch_size	1
seed	42
optimizer	Adam - paged_adamw_32bit
lr_scheduler_type	cosine
lr_scheduler_warmup_ratio	0.03
num_epochs	5.0

QLoRA hyperparameters The following parameters related with the Quantized Low-Rank Adaptation and Quantization were used during training:

Parameter	Value
lora_r	16
lora_alpha	64
lora_dropout	0.1
storage_dtype	"nf4"
compute_dtype	"float16"

Experiments

Model	Epochs	Overfitting	Final Epochs	Training Hours	CO2 Emission (Kg)
Code Llama Instruct	1	No	1	8.1	1.337
Code Llama Instruct	5	Yes	3	45.6	9.12

Framework versions

Library	Version
bitsandbytes	0.40.2
Datasets	2.14.3
Pytorch	2.0.1
Tokenizers	0.14.1
Transformers	4.34.0

Open Portuguese LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Average	39.68
ENEM Challenge (No Images)	26.03
BLUEX (No Images)	29.07
OAB Exams	32.53
Assin2 RTE	57.19
Assin2 STS	26.81
FaQuAD NLI	43.77
HateBR Binary	68.02
PT Hate Speech Binary	38.53
tweetSentBR	35.21