CooperW commited on
Commit
6a67fae
1 Parent(s): 7c157e7

Upload 4 files

Browse files
Files changed (4) hide show
  1. Readme.md +138 -0
  2. TESTER.ipynb +0 -0
  3. TRAINER.ipynb +0 -0
  4. dataFormat.ipynb +127 -0
Readme.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project SecureAi Labs
2
+
3
+ This project is designed for fine-tuning language models using the Unsloth library with LoRA adapters, and it provides utilities for training, testing, and formatting data for various models like Phi-3, Gemma, and Meta-Llama.
4
+
5
+ ## Table of Contents
6
+ 1. [Prerequisites](#prerequisites)
7
+ 2. [File Descriptions](#file-descriptions)
8
+ - [TRAINER.ipynb](#traineripynb)
9
+ - [TESTER.ipynb](#testeripynb)
10
+ - [dataFormat.ipynb](#dataformatipynb)
11
+ 3. [Usage](#usage)
12
+ - [Environment Setup](#environment-setup)
13
+ - [Training a Model](#training-a-model)
14
+ - [Testing the Model](#testing-the-model)
15
+ - [Formatting Data](#formatting-data)
16
+ 4. [Additional Resources](#additional-resources)
17
+
18
+ ---
19
+
20
+ ## Prerequisites
21
+
22
+ Before running the project, ensure you have the following:
23
+ - A [Hugging Face](https://huggingface.co) account and token.
24
+ - Google Colab or a local environment with Python 3.x and CUDA support.
25
+ - Installed packages like `unsloth`, `huggingface_hub`, `peft`, `trl`, and others (automatically installed in the notebooks).
26
+
27
+ NOTE GPU Requirements:
28
+
29
+ ```python
30
+ models = [
31
+ 'Phi-3.5-mini-instruct-bnb-4bit', # |Min Training Gpu : T4, Min Testing GPU: T4, Max Model size : 14.748 GB|
32
+ 'gemma-2-27b-it-bnb-4bit', # |Min Training Gpu: A100, Min Testing GPU: L4, Max Model size: 39.564GB|
33
+ 'Meta-Llama-3.1-8B-Instruct-bnb-4bit' # |Min Training Gpu: T4, Min Testing GPU: T4, Max Model size : 22.168GB|
34
+ ]
35
+ ```
36
+
37
+ Refer to the [Unsloth Documentation](https://unsloth.ai/) for more details.
38
+
39
+ ## File Descriptions
40
+
41
+ ### 1. `TRAINER.ipynb`
42
+
43
+ This notebook is responsible for training a language model with LoRA adapters using the Unsloth library. The core functionality includes:
44
+ - Loading a pre-trained model from Hugging Face using `FastLanguageModel`.
45
+ - Attaching LoRA adapters for efficient fine-tuning of large models.
46
+ - Setting training configurations (e.g., learning rate, number of epochs, batch size) using the `SFTTrainer` from the `transformers` library.
47
+ - Optionally, resuming training from the last checkpoint.
48
+ - Uploading checkpoints and models to Hugging Face during or after training.
49
+
50
+ #### How to Use:
51
+ 1. Open this notebook in Google Colab or a similar environment.
52
+ 2. Ensure you have set up your Hugging Face token (refer to the section below for setup).
53
+ 3. Customize the training parameters if needed.
54
+ 4. Run the notebook cells to train the model.
55
+
56
+ ### 2. `TESTER.ipynb`
57
+
58
+ This notebook handles the evaluation of a fine-tuned model. It allows testing the model's accuracy and efficiency on a test dataset using pre-defined metrics like accuracy, precision, recall, and F1 score. It provides the following functionalities:
59
+ - Loads the fine-tuned model with its LoRA adapters.
60
+ - Defines a function to evaluate the model's predictions on a test dataset.
61
+ - Outputs accuracy and other classification metrics.
62
+ - Displays confusion matrices for better insight into model performance.
63
+
64
+ #### How to Use:
65
+ 1. Load this notebook in your environment.
66
+ 2. Specify the test dataset and model details.
67
+ 3. Run the evaluation loop to get accuracy, predictions, and metrics visualizations.
68
+
69
+ ### 3. `dataFormat.ipynb`
70
+
71
+ This notebook formats datasets into the correct structure for training and testing models. It provides functionality to map raw text data into a format suitable for language model training, particularly for multi-turn conversations:
72
+ - Formats conversations into a chat-based template using Unsloth's `chat_templates`.
73
+ - Maps data fields like "role", "content", and user/assistant conversations.
74
+ - Prepares the dataset for tokenization and input to the model.
75
+
76
+ #### How to Use:
77
+ 1. Open the notebook and specify the dataset you wish to format.
78
+ 2. Adjust any template settings based on the model you're using.
79
+ 3. Run the notebook to output the formatted dataset.
80
+
81
+ ---
82
+
83
+ ## Usage
84
+
85
+ ### Environment Setup
86
+
87
+ 1. **Install Unsloth**:
88
+ The following command is included in the notebooks to install Unsloth:
89
+ ```bash
90
+ !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
91
+ ```
92
+
93
+ 2. **Install Additional Dependencies**:
94
+ These dependencies are also required:
95
+ ```bash
96
+ !pip install --no-deps xformers==0.0.27 trl peft accelerate bitsandbytes triton
97
+ ```
98
+
99
+ 3. **Hugging Face Token Setup**:
100
+ - Add your Hugging Face token as an environment variable in Google Colab or in your local environment.
101
+ - Use the Hugging Face token to download models and upload checkpoints:
102
+ ```python
103
+ from google.colab import userdata
104
+ from huggingface_hub import login
105
+ login(userdata.get('TOKEN'))
106
+ ```
107
+
108
+ ### Training a Model
109
+
110
+ 1. Open `TRAINER.ipynb`.
111
+ 2. Customize the model, template, and LoRA settings in the notebook.
112
+ 3. Set training configurations (e.g., epochs, learning rate).
113
+ 4. Run the notebook to start the training process.
114
+
115
+ The model will automatically be saved at checkpoints and uploaded to Hugging Face.
116
+
117
+ ### Testing the Model
118
+
119
+ 1. Load `TESTER.ipynb` in your environment.
120
+ 2. Load the fine-tuned model with LoRA adapters.
121
+ 3. Specify a test dataset in the appropriate format.
122
+ 4. Run the evaluation function to get predictions, accuracy, and other metrics.
123
+
124
+ ### Formatting Data
125
+
126
+ 1. Use `dataFormat.ipynb` to format raw data into a training-friendly structure.
127
+ 2. Map the conversation fields using the `formatting_prompts_func`.
128
+ 3. Output the formatted data and use it in the training or testing notebooks.
129
+
130
+ ---
131
+
132
+ ## Additional Resources
133
+
134
+ - Unsloth Documentation: [Unsloth.ai](https://unsloth.ai/)
135
+ - Hugging Face Security Tokens: [Hugging Face Tokens](https://huggingface.co/docs/hub/en/security-tokens)
136
+ - For issues, please refer to each library's official documentation or GitHub pages.
137
+
138
+ ---
TESTER.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
TRAINER.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
dataFormat.ipynb ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "fe805f91",
6
+ "metadata": {},
7
+ "source": [
8
+ "This script is used to convert the Ton iot datasets into jsonl that can be used by the tokenizer in the training and testing scripts in order to be in the correct format to train the models."
9
+ ]
10
+ },
11
+ {
12
+ "cell_type": "code",
13
+ "execution_count": 2,
14
+ "id": "4dc6bc25",
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import csv\n",
19
+ "import json\n",
20
+ "\n",
21
+ "# Path to desired file to be converted\n",
22
+ "csv_file = 'train_test_network.csv'\n",
23
+ "\n",
24
+ "# Name of file to be written to\n",
25
+ "jsonl_file = 'train_test_network.jsonl'\n",
26
+ "\n",
27
+ "with open(csv_file, 'r') as f, open(jsonl_file, 'w') as jsonl_f:\n",
28
+ " reader = csv.DictReader(f)\n",
29
+ " \n",
30
+ " for row in reader:\n",
31
+ " # Normalize the label to ensure comparison is case-insensitive\n",
32
+ " label = row['type'].strip().lower()\n",
33
+ " \n",
34
+ " # For converting the dataset to binary. If you are doing multi-class comment out this line\n",
35
+ " label = 'normal' if label == 'normal' else 'attack'\n",
36
+ "\n",
37
+ " \n",
38
+ " # Construct the user message by concatenating all relevant fields\n",
39
+ " user_message = \", \".join([f\"{key}: {value}\" for key, value in row.items() if key != 'label' and key != 'type'])\n",
40
+ " \n",
41
+ " # Create the conversation in the desired format\n",
42
+ " conversation = {\n",
43
+ " \"conversations\": [\n",
44
+ " {\"from\": \"human\", \"value\": user_message},\n",
45
+ " {\"from\": \"gpt\", \"value\": label}\n",
46
+ " ]\n",
47
+ " }\n",
48
+ " \n",
49
+ " # Write each conversation as a JSON object in a new line\n",
50
+ " jsonl_f.write(json.dumps(conversation) + \"\\n\")"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "markdown",
55
+ "id": "4444baf3",
56
+ "metadata": {},
57
+ "source": [
58
+ "Output will look like:\n",
59
+ "\n",
60
+ "```\n",
61
+ "{\"conversations\": [{\"from\": \"human\", \"value\": \"\\ufeffsrc_ip: 192.168.1.192, src_port: 5353, dst_ip: 224.0.0.251, dst_port: 5353, proto: udp, service: dns, duration: 0, src_bytes: 0, dst_bytes: 0, conn_state: S0, missed_bytes: 0, src_pkts: 1, src_ip_bytes: 73, dst_pkts: 0, dst_ip_bytes: 0, dns_query: _ipps._tcp.local, dns_qclass: 1, dns_qtype: 12, dns_rcode: 0, dns_AA: F, dns_RD: F, dns_RA: F, dns_rejected: F, ssl_version: -, ssl_cipher: -, ssl_resumed: -, ssl_established: -, ssl_subject: -, ssl_issuer: -, http_trans_depth: -, http_method: -, http_uri: -, http_version: -, http_request_body_len: 0, http_response_body_len: 0, http_status_code: 0, http_user_agent: -, http_orig_mime_types: -, http_resp_mime_types: -, weird_name: -, weird_addl: -, weird_notice: -\"}, {\"from\": \"gpt\", \"value\": \"normal\"}]}\n",
62
+ "```\n"
63
+ ]
64
+ },
65
+ {
66
+ "cell_type": "markdown",
67
+ "id": "0a36087b",
68
+ "metadata": {},
69
+ "source": [
70
+ "If you want to Split datasets and save them seperately in order to upload them to huggingface use this. "
71
+ ]
72
+ },
73
+ {
74
+ "cell_type": "code",
75
+ "execution_count": null,
76
+ "id": "19944860",
77
+ "metadata": {},
78
+ "outputs": [],
79
+ "source": [
80
+ "# Load dataset from the JSONL file\n",
81
+ "dataset = load_dataset('json', data_files=jsonl_file, split='train')\n",
82
+ "\n",
83
+ "# Split the dataset\n",
84
+ "split_dataset = dataset.train_test_split(test_size=0.2) # 20% test data\n",
85
+ "\n",
86
+ "# Access the training and testing splits\n",
87
+ "train_dataset = split_dataset['train']\n",
88
+ "test_dataset = split_dataset['test']\n",
89
+ "\n",
90
+ "# Define paths for saving the datasets\n",
91
+ "jsonl_train = 'train_network.jsonl'\n",
92
+ "jsonl_test = 'test_network.jsonl'\n",
93
+ "\n",
94
+ "# Save train_dataset to JSONL file\n",
95
+ "with open(jsonl_train, 'w') as train_f:\n",
96
+ " for example in train_dataset:\n",
97
+ " train_f.write(json.dumps(example) + \"\\n\")\n",
98
+ "\n",
99
+ "# Save test_dataset to JSONL file\n",
100
+ "with open(jsonl_test, 'w') as test_f:\n",
101
+ " for example in test_dataset:\n",
102
+ " test_f.write(json.dumps(example) + \"\\n\")"
103
+ ]
104
+ }
105
+ ],
106
+ "metadata": {
107
+ "kernelspec": {
108
+ "display_name": "Python 3 (ipykernel)",
109
+ "language": "python",
110
+ "name": "python3"
111
+ },
112
+ "language_info": {
113
+ "codemirror_mode": {
114
+ "name": "ipython",
115
+ "version": 3
116
+ },
117
+ "file_extension": ".py",
118
+ "mimetype": "text/x-python",
119
+ "name": "python",
120
+ "nbconvert_exporter": "python",
121
+ "pygments_lexer": "ipython3",
122
+ "version": "3.11.3"
123
+ }
124
+ },
125
+ "nbformat": 4,
126
+ "nbformat_minor": 5
127
+ }