File size: 9,589 Bytes

c4c345f
 
 
 
cfd2935
c4c345f
 
 
cfd2935
 
 
 
 
 
 
 
 
6175e30
a892d6d
00fc7e4
 
 
 
63d0cad
9ffdfa0
7ff540e
8dc22ae
7ff540e
 
9ffdfa0
7ff540e
 
a892d6d
897a80e
7ff540e
 
 
 
8dc22ae
9f3e0d8
 
 
7ff5bac
f6dcb97
c792e3f
157ac85
8dc22ae
c8e8bdb
a892d6d
 
 
 
111032e
0c0c755
8dc22ae
c8e8bdb
247d1e4
 
 
 
6175e30
8dc22ae
deae7da
ca800c4
 
 
 
 
 
 
 
 
f6dcb97
ca800c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6dcb97
 
9ffdfa0
dc88d2d
7d4e17b
150fb85
 
7d4e17b
150fb85
8dc22ae
150fb85
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8dc22ae
150fb85
 
9f3e0d8
150fb85
9f3e0d8
150fb85
 
 
 
 
 
8dc22ae
150fb85
 
 
 
 
 
 
 
 
 
 
 
8dc22ae
150fb85
 
 
 
 
 
 
 
 
8dc22ae
1f254d5
ca800c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1f254d5
ca800c4
 
1f254d5
ca800c4
1f254d5
dc88d2d

---
license: mit
language:
- ko
- en
base_model:
- openchat/openchat_3.5
pipeline_tag: text-generation
metrics:
- accuracy
library_name: adapter-transformers
tags:
- finance
- biology
- legal
- art
- text-generation-inference
---

### ⛱   ktdsbaseLM v0.11은 openchat3.5를 Foundation 모델로 하는 한국어 및 한국의 다양한 
### 문화에 적용할 수 있도록 하기 위해
### 개발 되었으며 자체 제작한 53영역의 한국어 데이터를 활용하여 한국 사회 가치와 
### 문화를 이해하는 모델 입니다. ✌

                

# ❶ 모델 설명
- 모델명 및 주요기능:
  KTDSbaseLM v0.11은 OpenChat 3.5 모델을 기반으로 SFT 방식으로 파인튜닝된 Mistral 7B / openchat3.5 기반 모델입니다.
  한국어와 한국의 다양한 문화적 맥락을 이해하도록 설계되었으며 ✨✨, 자체 제작한 135개 영역의 한국어
  데이터를 활용해 한국 사회의 가치와 문화를 반영합니다.
  주요 기능으로는 텍스트 생성, 대화 추론, 문서 요약, 질의응답, 감정 분석 및 자연어 처리 관련 다양한 작업을 지원하며,
  활용 분야는 법률, 재무, 과학, 교육, 비즈니스, 문화 연구 등 다양한 분야에서 응용될 수 있습니다.
- 모델 아키텍처: KTDSBaseLM v0.11은 Mistral 7B 모델을 기반으로, 파라미터 수는 70억 개(7B)로 구성된 고성능 언어 모델입니다.
  이 모델은 OpenChat 3.5를 파운데이션 모델로 삼아, SFT(지도 미세 조정) 방식을 통해 한국어와 한국 문화에 특화된 성능을 발휘하도록 훈련되었습니다.
  Mistral 7B의 경량화된 구조는 빠른 추론 속도와 메모리 효율성을 보장하며, 다양한 자연어 처리 작업에 적합하게 최적화되어 있습니다.
  이 아키텍처는 텍스트 생성, 질의응답, 문서 요약, 감정 분석과 같은 다양한 작업에서 탁월한 성능을 보여줍니다.

# ❷ 학습 데이터
- ktdsbaseLM v0.11은 자체 개발한 총 3.6GB 크기의 데이터를 바탕으로 학습되었습니다. 모두 233만 건의 QnA, 요약, 분류 등 데이터를 포함하며,
  그 중 133만 건은 53개 영역의 객관식 문제로 구성되었습니다. 이 영역에는 한국사, 사회, 재무, 법률, 세무, 수학, 생물, 물리, 화학 등이 포함되며,
  Chain of Thought 방식으로 학습되었습니다. 또한 130만 건의 주관식 문제는 한국사, 재무, 법률, 세무, 수학 등 38개 영역에 걸쳐 학습되었습니다.
  학습 데이터 중 한국의 사회 가치와 인간의 감정을 이해하고 지시한 사항에 따라 출력할 수 있는 데이터를 학습하였습니다.
- 학습 Instruction Datasets Format: 
  <pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
    
# ❸ 사용 사례
  ktdsbaseLM v0.11은 다양한 응용 분야에서 사용될 수 있습니다. 예를 들어:
- 교육 분야: 역사, 수학, 과학 등 다양한 학습 자료에 대한 질의응답 및 설명 생성.
- 비즈니스: 법률, 재무, 세무 관련 질의에 대한 답변 제공 및 문서 요약.
- 연구 및 문화: 한국 사회와 문화에 맞춘 자연어 처리 작업, 감정 분석, 문서 생성 및 번역.
- 고객 서비스: 사용자와의 대화 생성 및 맞춤형 응답 제공.
- 이 모델은 다양한 자연어 처리 작업에서 높은 활용도를 가집니다.

# ❹ 한계 ⛈⛈
- ktdsBaseLM v0.11은 한국어와 한국 문화에 특화되어 있으나, 
  특정 영역(예: 최신 국제 자료, 전문 분야)의 데이터 부족으로 인해 다른 언어 또는
  문화에 대한 응답의 정확성이 떨어질 수 있습니다.
  또한, 복잡한 논리적 사고를 요구하는 문제에 대해 제한된 추론 능력을 보일 수 있으며,
  편향된 데이터가 포함될 경우 편향된 응답이 생성될 가능성도 존재합니다.

# ❺ 사용 방법
  <pre><code>
  import os
  import os.path as osp
  import sys
  import fire
  import json
  from typing import List, Union
  import pandas as pd
  import torch
  from torch.nn import functional as F
  
  import transformers
  from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
  from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
  from transformers import LlamaForCausalLM, LlamaTokenizer
  from transformers import AutoModelForCausalLM, AutoTokenizer
  
  from datasets import load_dataset
  
  from peft import (
      LoraConfig,
      get_peft_model,
      set_peft_model_state_dict
  )
  from peft import PeftModel
  import re
  import ast
  
  device = 'auto' #@param {type: "string"}
  model = '' #@param {type: "string"}
  model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bnb_config,
    #load_in_4bit=True, # Quantization Load
    device_map=device)

  tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)

  input_text = "안녕하세요."
  inputs = tokenizer(input_text, return_tensors="pt")
  inputs = inputs.to("cuda:0")
  
  with torch.no_grad():
      outputs = model.generate(**inputs, max_length=1024)
  
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  
</code></pre>

## ✅ ktds는 openchat 외에도 LlaMA, Polyglot, EEVE 등 대표적인 LLM에 다양한 영역의 한국의 문화와 지식을 파인튜닝한 LLM을 제공할 예정입니다.
---
Here’s the English version of the provided text:



# ❶ Model Description

**Model Name and Key Features**:  
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model. 
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society. 
The model supports tasks such as text generation, conversation inference, document summarization, 
question answering, sentiment analysis, and other NLP tasks. 
Its applications span fields like law, finance, science, education, business, and cultural research.

**Model Architecture**:  
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model. 
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture. 
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency, 
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.

---

# ❷ Training Data

KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances. 
This includes 1.33 million multiple-choice questions across 53 domains such as history, 
finance, law, tax, and science, trained with the Chain of Thought method. Additionally, 
1.3 million short-answer questions cover 38 domains including history, finance, and law. 

**Training Instruction Dataset Format**:  
`{"prompt": "prompt text", "completion": "ideal generated text"}`

---

# ❸ Use Cases

KTDSbaseLM v0.11 can be used across multiple fields, such as:

- **Education**: Answering questions and generating explanations for subjects like history, math, and science.
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
- **Customer Service**: Generating conversations and personalized responses for users.

This model is highly versatile in various NLP tasks.

---

# ❹ Limitations

KTDSBaseLM v0.11 is specialized in Korean language and culture. 
However, it may lack accuracy in responding to topics outside its scope, 
such as international or specialized data. 
Additionally, it may have limited reasoning ability for complex logical problems and 
may produce biased responses if trained on biased data.

---

# ❺ Usage Instructions
<pre><code>
  import os
  import os.path as osp
  import sys
  import fire
  import json
  from typing import List, Union
  import pandas as pd
  import torch
  from torch.nn import functional as F
  
  import transformers
  from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
  from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
  from transformers import LlamaForCausalLM, LlamaTokenizer
  from transformers import AutoModelForCausalLM, AutoTokenizer
  
  from datasets import load_dataset
  
  from peft import (
      LoraConfig,
      get_peft_model,
      set_peft_model_state_dict
  )
  from peft import PeftModel
  import re
  import ast
  
  device = 'auto' #@param {type: "string"}
  model = '' #@param {type: "string"}
  model = AutoModelForCausalLM.from_pretrained(
    model,
    quantization_config=bnb_config,
    #load_in_4bit=True, # Quantization Load
    device_map=device)

  tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)

  input_text = "안녕하세요."
  inputs = tokenizer(input_text, return_tensors="pt")
  inputs = inputs.to("cuda:0")
  
  with torch.no_grad():
      outputs = model.generate(**inputs, max_length=1024)
  
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
</code></pre>

## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge, 
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE. 
## These models will be tailored to better understand and generate content specific to Korean contexts.