Creating a Training Dataset for NER and RE

#173
by sferro - opened

Dear all,

I am using meta-llama/Meta-Llama-3-8B-Instruct to produce a training dataset to be used not only with Llama3, but also with other spaCy models that require a specifically structured dataset to train them.
The problem that I am encountering is that when I am printing/saving the data in a JSON/textual file, I see that some trigger words that I have inserted to recognise the start and the end of entity recognition and relation extraction for a query/sentence are sometimes placed wrongly in the file.

I am using the following messages to prompt the model

messages = [
    {"role": "system", "content": "You extract the Named Entities in the form: entity text (start_idx, end_idx, label)"},
    {"role": "system", "content": "Also, you extract the relations in the form: (subject, predicate, object)"},
    {"role": "user", "content": ""},
]

Then, I pass the sentence to be processed in the 3rd dictionary as follows:

with open('llama3_output_new.txt', 'a+') as f:
    for d in tqdm(list_data):
        text = d['text']
        if text != '':
            messages[2]["content"] = text

        prompt = pipeline.tokenizer.apply_chat_template(
            messages, 
            tokenize=False, 
            add_generation_prompt=True      
        )
        outputs = pipeline(
            prompt,
            max_new_tokens=256,
            eos_token_id=terminators,
            do_sample=True,
            temperature=0.6,
            top_p=0.9,
        )
        generated_text = outputs[0]["generated_text"][len(prompt):]
        f.write('Original text: ' + messages[2]["content"])
        f.write(generated_text)
        f.write('------END------')

The first line which is written in the file is the following:
Original text: You extract the Named Entities in the form: entity text (start_idx, end_idx, label),
while I was expecting that just after 'Original text: ' there should be the original text, indeed.
Then, sometimes, these strings that I have introduced to have "triggers" to limit the information about a query seem set in the wrong places:

1. (Mr. John Doe, purchased, object)
2. (Mr. John Doe, acquired, object)
3. (Mr. John Doe, died, 1990)
4. (Mr. John Doe, possibly acquired, object from David Red in 1985)
5. (Mrs.------END------

At first, I thought the problem was that I was using more than a GPU, but then I restricted the model to use a GPU, only. The reported results are for the case of the model using 1 GPU.

Moreover, I suppose that the model's output is serialised correctly by the model itself, also if it uses more than a GPU.

Do you have any suggestions? Any answer about this issue?

Sign up or log in to comment