OliveSensorAPI/generate_data/tutorial_EN.md

116 lines
4.1 KiB
Markdown
Raw Normal View History

2024-03-11 15:28:00 +08:00
# EmoLLM fine-tuning data generation tutorial
2024-03-10 15:52:18 +08:00
## **I. Objectives and Background**
2024-03-10 15:52:18 +08:00
2024-03-11 15:28:00 +08:00
In order to have a better representation of our large mental models, we must have high quality datasets. To achieve this goal, we decided to use four powerful AI grand models: **Wenxin Yiyan**, **Tongyi Qianwen**, **Feifei Spark**, and **Zhipu GLM** to generate conversation data. In addition, we will enhance the cognitive depth of the dataset and improve the generalization ability of the model by adding a small number of self-cognitive datasets.
2024-03-10 15:52:18 +08:00
## **II. dataset generation method**
2024-03-10 15:52:18 +08:00
1. **Model selection and data preparation**
2024-03-11 15:28:00 +08:00
Choose four big language models, namely Wenxin Yiyan, Tongyi Qianwen, IFei Spark and Zhipu GLM, obtain the API to call the corresponding interface, and prepare to generate dialogue data.
3. **Single-turn and multi-turn dialogue data generation**
2024-03-10 15:52:18 +08:00
2024-03-11 15:28:00 +08:00
Using these four models, we generated 10,000 single and multi-turn conversation data. In doing so, we ensure the diversity, complexity and validity of our data.
2024-03-10 15:52:18 +08:00
2024-03-11 15:28:00 +08:00
Because mental activity is often complex, in order to ensure the diversity of data. We selected a total of 16 * 28 `448` scenarios for dataset generation. For specific scenario names, please refer to the configuration of the two parameters`emotions_list and areas_of_life`in config.yml.
2024-03-11 15:28:00 +08:00
4. **Inclusion of self-perception datasets**
2024-03-10 15:52:18 +08:00
2024-03-11 15:28:00 +08:00
In order to enhance the cognitive ability of the model, we specially added a part of self-cognitive dataset. These datasets help the model better understand the context and improve the naturalness and coherence of the conversation.
2024-03-10 15:52:18 +08:00
## **III. Practical steps**
2024-03-10 15:52:18 +08:00
1. **Initialize**
2024-03-11 15:28:00 +08:00
* Install the required software and libraries
2024-03-10 15:52:18 +08:00
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
2024-03-11 15:28:00 +08:00
* Prepare input data and configuration parameters
2024-03-10 15:52:18 +08:00
See `config.yml` for annotations
2. **Model selection and configuration**
2024-03-11 15:28:00 +08:00
* Select the right model for your needs
2024-03-10 15:52:18 +08:00
In order to enable everyone to play with the large model, we chose the InterLLM2-7B as our baseline model (consumer graphics cards can also be deployed fine-tuned oh).
2024-03-11 15:28:00 +08:00
* Make necessary configurations and adjustments to the model
Use XTuner for fine-tuning based on our dataset and configuration strategy.
2024-03-10 15:52:18 +08:00
3. **Data generation**
2024-03-11 15:28:00 +08:00
* Data generation using Tongyi Qianwen
2024-03-10 15:52:18 +08:00
```bash
# Terminal operation
bash run_qwen.bash
# Or just use python without bash
python qwen_gen_data_NoBash.py
2024-03-10 15:52:18 +08:00
```
2024-03-11 15:28:00 +08:00
* Data generation using Wenxin Yiyan
2024-03-10 15:52:18 +08:00
```bash
# Terminal operation
python ernie_gen_data.py
```
2024-03-11 15:28:00 +08:00
* Data generation using Zhipu GLM
2024-03-10 15:52:18 +08:00
```bash
# Terminal operation
python zhipuai_gen_data.py
```
2024-03-11 15:28:00 +08:00
* Data generation using IFlystar Fire
2024-03-10 15:52:18 +08:00
```bash
# Terminal operation
python ./xinghuo/gen_data.py
```
4. **Integration of self-cognition datasets**
2024-03-11 15:28:00 +08:00
* Self-cognition dataset this needs to be manually generated in accordance with the format, the following format can be
2024-03-10 15:52:18 +08:00
```json
[
{
"conversation": [
{
"input": "请介绍一下你自己",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
},
{
"conversation": [
{
"input": "请做一下自我介绍",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
}
]
```
2024-03-11 15:28:00 +08:00
5. **dataset integration**
Before dataset integration, we need to check whether the generated data has formatting errors, type mismatches, etc. We need check.py to check the data. Finally, merge_json.py is used to combine all the json into one overall json file.
2024-03-10 15:52:18 +08:00
6. **Evaluation and optimization**
2024-03-11 15:28:00 +08:00
* Evaluate the generated dataset using appropriate evaluation metrics
* Make necessary optimizations and adjustments based on the evaluation results
2024-03-10 15:52:18 +08:00
7. **Testing and deployment**
2024-03-11 15:28:00 +08:00
* Evaluate the trained model using an independent test set
* Make necessary adjustments and optimizations based on test results
* Deploy the final model into a real application