Merge branch 'main' of github.com:8baby8/EmoLLM

This commit is contained in:
8baby8 2024-03-11 13:15:50 +08:00
commit c644e3c696
66 changed files with 369322 additions and 81 deletions

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2024 aJupyter、Farewell、jujimeizuo、Smiling&Weeping、散步
Copyright (c) 2024 aJupyter、MING-ZCH、Farewell、jujimeizuo
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
@ -12,6 +12,11 @@ furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
For portions of the software that are derived from other repositories, the original licenses shall apply.
The specific portions are documented in the `./datasets/README.md` included with this distribution.
Users are responsible for complying with the terms and conditions of the original licenses
when using or distributing these portions of the software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

View File

@ -17,7 +17,7 @@
<h3 align="center">EmoLLM</h3>
<p align="center">
简体中文| <a href="README_English_version.md" >English</a>
简体中文| <a href="README_EN.md" >English</a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
@ -39,6 +39,7 @@
| 模型 | 类型 |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | 全量微调 |
@ -63,20 +64,23 @@
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
### 最近更新
- 【2024.3.9】 新增并发功能加速 QA 对生成
- 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)需要两块A100*80G更新专业评估详见[evaluate](./evaluate/)更新基于PaddleOCR的PDF转txt工具脚本详见[scripts](./scripts/)
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)。
- 【2024.2.27】更新英文readme和一系列数据集舔狗和单轮对话
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei)[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py)[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)温柔御姐角色扮演专用带有Emoji表情即将推出 `温柔御姐心理医生艾薇`
- 【2024.2.18】 [基于Qwen1_5-0_5B-Chat全量微调版本开源](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary),算力有限的道友可以玩起来~
<details>
<summary>查看更多</summary>
- 【2024.2.6】 EmoLLM在[**Openxlab** ](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 平台下载量高达18.7k,欢迎大家体验!
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
<details>
<summary>查看更多</summary>
- 【2024.2.5】 项目荣获公众号**NLP工程化**推文宣传[推文链接](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A),为博主推广一波,欢迎大家关注!!🥳🥳
<p align="center">
@ -89,6 +93,13 @@
</details>
### 路线图
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_ZH.png" alt="Roadmap_ZH">
</a>
## 目录
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
@ -128,8 +139,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
- [部署指南](#部署指南)
- 查看更多详情
<details>
<summary>更多详情</summary>
### 文件目录说明
@ -158,6 +167,9 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
详见[部署指南](demo/README.md)
<details>
<summary>更多详情</summary>
### 使用到的框架
- [Xtuner](https://github.com/InternLM/xtuner)
@ -189,7 +201,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@哈尔滨工业大学(威海)在读本科生
[Farewell](https://github.com/8baby8)@
[Farewell](https://github.com/8baby8)@飞桨领航团区域主管、文心大模型核心开发者
[ZhouXinAo](https://github.com/zxazys)@南开大学在读硕士
@ -201,6 +213,16 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
[ZeyuBa](https://github.com/ZeyuBa)@自动化所在读硕士
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@宾夕法尼亚大学在读硕士
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
[Mxoder](https://github.com/Mxoder)@北京航空航天大学在读本科生
[Anooyman](https://github.com/Anooyman) @南京理工大学硕士
### 版权说明
该项目签署了MIT 授权许可,详情请参阅 [LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
@ -221,7 +243,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=aJupyter/EmoLLM&type=Date)](https://star-history.com/#aJupyter/EmoLLM&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
@ -238,3 +260,13 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
## 交流群
- 如果失效请移步Issue区
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM官方交流群">
</p>

View File

@ -40,6 +40,7 @@
| model | type |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_7B_chat | full finetuning |
| InternLM2_1_8B_chat | full finetuning |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | full finetuning |
@ -62,6 +63,8 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
### Recent Updates
- 【2024.3.9】New concurrency feature speeds up QA pair generation
- 【2024.3.3】 [Based on InternLM2-7B-chat full amount of fine-tuned version of open source](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
@ -70,15 +73,16 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
<details>
<summary>View More</summary>
- 【2024.2.6】 [Open-sourced based on the Qwen1_5-0_5B-Chat full-scale fine-tuned version](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary), friends with limited computing power can start experimenting~
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
<details>
<summary>View More</summary>
- 【2024.2.5】 The project has been promoted by the official WeChat account NLP Engineering. Here's the [link](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A) to the article. Welcome everyone to follow!! 🥳🥳
<p align="center">
@ -91,6 +95,13 @@ The Model is aimed at fully understanding and promoting the mental health of ind
</details>
### Roadmap
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_EN.png" alt="Roadmap_EN">
</a>
## Contents
- [EmoLLM - Large Languge Model for Mental Health](#emollm---large-languge-model-for-mental-health)
@ -131,8 +142,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
- [Deployment Guide](#deployment-guide)
- View More Details
<details>
<summary>Additional Details</summary>
### File Directory Explanation
@ -161,6 +171,9 @@ For details, see the [fine-tuning guide](xtuner_config/README.md)
For details, see the [deployment guide](demo/README.md)
<details>
<summary>Additional Details</summary>
### Frameworks Used
- [Xtuner](https://github.com/InternLM/xtuner)
@ -204,6 +217,16 @@ This project uses Git for version control. You can see the current available ver
[ZeyuBa](https://github.com/ZeyuBa)@Master's student at Institute of Automation
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@Master's student at University of Pennsylvania
[Nobody-ML](https://github.com/Nobody-ML)@Undergraduate at China University of Petroleum (East China)
[chg0901](https://github.com/chg0901)@PhD Candidate at Kwangwoon University
[Mxoder](https://github.com/Mxoder)@Undergraduate at Beihang University
[Anooyman](https://github.com/Anooyman) @Master of Nanjing University of Science and Technology
### Copyright Notice
The project is licensed under the MIT License. Please refer to the details
@ -226,7 +249,7 @@ The project is licensed under the MIT License. Please refer to the details
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=aJupyter/EmoLLM&type=Date)](https://star-history.com/#aJupyter/EmoLLM&Date)
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
@ -243,3 +266,10 @@ The project is licensed under the MIT License. Please refer to the details
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
## Communication group
- If it fails, go to the Issue section.
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM official communication group">
</p>

BIN
assets/Roadmap_EN.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 243 KiB

BIN
assets/Roadmap_ZH.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 235 KiB

BIN
assets/emoxlmdeploy.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 956 KiB

BIN
assets/turbomind结构.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 698 KiB

42
datasets/README.md Normal file
View File

@ -0,0 +1,42 @@
# EmoLLM数据集
* 数据集按用处分为两种类型:**General** 和 **Role-play**
* 数据按格式分为两种类型:**QA** 和 **Conversation**
* 数据汇总General**6个数据集**Role-play**3个数据集**
## 数据集类型
* **General**:通用数据集,包含心理学知识、心理咨询技术等通用内容
* **Role-play**:角色扮演数据集,包含特定角色对话风格数据等内容
## 数据类型
* **QA**:问答对
* **Conversation**:多轮对话
## 数据集汇总
| Category | Dataset | Type | Total |
| :---------: | :-------------------: | :----------: | :-----: |
| *General* | data | Conversation | 5600+ |
| *General* | data_pro | Conversation | 36500+ |
| *General* | multi_turn_dataset_1 | Conversation | 36,000+ |
| *General* | multi_turn_dataset_2 | Conversation | 27,000+ |
| *General* | single_turn_dataset_1 | QA | 14000+ |
| *General* | single_turn_dataset_2 | QA | 18300+ |
| *Role-play* | aiwei | Conversation | 4000+ |
| *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ |
| …… | …… | …… | …… |
## 数据集来源
**General**
* 数据集 data 来自本项目
* 数据集 data_pro 来自本项目
* 数据集 multi_turn_dataset_1 来源 [Smile](https://github.com/qiuhuachuan/smile)
* 数据集 multi_turn_dataset_2 来源 [CPsyCounD](https://github.com/CAS-SIAT-XinHai/CPsyCoun)
* 数据集 single_turn_dataset_1 来自本项目
* 数据集 single_turn_dataset_2 来自本项目
**Role-play**
* 数据集 aiwei 来自本项目
* 数据集 tiangou 来自本项目
* 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)

43
datasets/README_EN.md Normal file
View File

@ -0,0 +1,43 @@
# EmoLLM's datasets
* Category of dataset: **General** and **Role-play**
* Type of data: **QA** and **Conversation**
* Summary: General(**6 datasets**), Role-play(**3 datasets**)
## Category
* **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
* **Role-play**: role-playing dataset, including character-specific conversation style data, etc.
## Type
* **QA**: question-and-answer pair
* **Conversation**: multi-turn consultation dialogue
## Summary
| Category | Dataset | Type | Total |
| :---------: | :-------------------: | :----------: | :-----: |
| *General* | data | Conversation | 5600+ |
| *General* | data_pro | Conversation | 36500+ |
| *General* | multi_turn_dataset_1 | Conversation | 36,000+ |
| *General* | multi_turn_dataset_2 | Conversation | 27,000+ |
| *General* | single_turn_dataset_1 | QA | 14000+ |
| *General* | single_turn_dataset_2 | QA | 18300+ |
| *Role-play* | aiwei | Conversation | 4000+ |
| *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ |
| …… | …… | …… | …… |
## Source
**General**
* dataset `data` from this repo
* dataset `data_pro` from this repo
* dataset `multi_turn_dataset_1` from [Smile](https://github.com/qiuhuachuan/smile)
* dataset `multi_turn_dataset_2` from [CPsyCounD](https://github.com/CAS-SIAT-XinHai/CPsyCoun)
* dataset `single_turn_dataset_1` from this repo
* dataset `single_turn_dataset_2` from this repo
**Role-play**
* dataset `aiwei` from this repo
* dataset `tiangou` from this repo
* dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)

101142
datasets/SoulStar_data.json Normal file

File diff suppressed because one or more lines are too long

151974
datasets/processed/output.json Normal file

File diff suppressed because it is too large Load Diff

113444
datasets/processed/output2.json Normal file

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,12 @@
import json
# 打开JSON文件并读取其内容
with open('/root/Emollm/datasets/multi_turn_dataset_2.json', 'rt', encoding='utf-8') as file:
data = json.load(file)
n = 0
for i in data:
i['conversation'][0]['system'] = "你是心理健康助手EmoLLM由EmoLLM团队打造。你旨在通过专业心理咨询协助来访者完成心理诊断。请充分利用专业心理学知识与咨询技术一步步帮助来访者解决心理问题。"
with open('output2.json', 'wt', encoding='utf-8') as file:
json.dump(data, file, ensure_ascii=False, indent=4)

View File

@ -15,17 +15,16 @@ pip install -r requirements.txt
```
- 下载模型
- 模型权重https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
- 通过 openxlab.model.download 下载,详情请看 [cli_internlm2](./cli_internlm2.py)
- 模型权重https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
- 通过 openxlab.model.download 下载,详情请看 [cli_internlm2](./cli_internlm2.py)
```bash
from openxlab.model import download
```bash
from openxlab.model import download
download(model_repo='jujimeizuo/EmoLLM_Model',
output='model')
```
- 可以手动下载,放在 `./model` 目录下,然后把上面的代码删掉
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
```
- 可以手动下载,放在 `./model` 目录下,然后把上面的代码删掉
- cli_demo

59
demo/README_EN.md Normal file
View File

@ -0,0 +1,59 @@
# Deploying Guide for EmoLLM
## Local Deployment
- Clone repo
```bash
git clone https://github.com/aJupyter/EmoLLM.git
```
- Install dependencies
```bash
pip install -r requirements.txt
```
- Download the model
- Model weightshttps://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
- Download via openxlab.model.download, see [cli_internlm2](./cli_internlm2.py) for details
```bash
from openxlab.model import download
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
```
- You can also download manually and place it in the `./model` directory, then delete the above code.
- cli_demo
```bash
python ./demo/cli_internlm2.py
```
- web_demo
```bash
python ./app.py
```
If deploying on a server, you need to configure local port mapping.
## Deploy on OpenXLab
- Log in to OpenXLab and create a Gradio application
![Login OpenXLab](../assets/deploy_1.png)
- Select configurations and create the project
![config](../assets/deploy_2.png)
- Wait for the build and startup
![wait a minutes](../assets/deploy_3.png)
- Try your own project
![enjoy](../assets/deploy_4.png)

227
deploy/Imdeploy_EN.md Normal file
View File

@ -0,0 +1,227 @@
![](../assets/emoxlmdeploy.png)
# Local deployment of LMDeploy
## 1.Environment configuration
<details>
<summary>Specific deployment environment</summary>
Package Version
------------------------- -----------
accelerate 0.27.2
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.14.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.2.0
async-timeout 4.0.3
attrs 23.2.0
blinker 1.7.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
contourpy 1.2.0
crcmod 1.7
cryptography 41.0.3
cycler 0.12.1
datasets 2.17.0
dill 0.3.8
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.109.2
ffmpy 0.3.2
filelock 3.13.1
fire 0.5.0
flash-attn 2.4.2
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2023.10.0
fuzzywuzzy 0.18.0
gitdb 4.0.11
GitPython 3.1.42
gmpy2 2.1.2
gradio 3.50.2
gradio_client 0.6.1
h11 0.14.0
httpcore 1.0.3
httpx 0.26.0
huggingface-hub 0.20.3
idna 3.4
importlib-metadata 6.11.0
importlib-resources 6.1.1
Jinja2 3.1.2
jmespath 0.10.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lmdeploy 0.2.4
markdown-it-py 3.0.0
MarkupSafe 2.1.1
matplotlib 3.8.3
mdurl 0.1.2
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mmengine-lite 0.10.3
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-runtime-cu11 11.8.89
nvidia-nccl-cu11 2.19.3
openxlab 0.0.34
orjson 3.9.14
oss2 2.17.0
packaging 23.2
pandas 2.2.0
peft 0.8.2
Pillow 9.5.0
pip 23.3.1
platformdirs 4.2.0
protobuf 4.25.3
psutil 5.9.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pybind11 2.11.1
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.1
pydantic_core 2.16.2
pydeck 0.8.1b0
pydub 0.25.1
Pygments 2.17.2
Pympler 1.0.1
pynvml 11.5.0
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.9
pytz 2023.4
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.28.2
rich 13.4.2
rpds-py 0.18.0
safetensors 0.4.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 60.2.0
shortuuid 1.0.11
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.36.3
streamlit 1.24.0
sudo 1.0.0
sympy 1.11.1
tenacity 8.2.3
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
tornado 6.4
tqdm 4.65.2
transformers 4.37.1
triton 2.2.0
typing_extensions 4.9.0
tzdata 2024.1
tzlocal 4.3.1
urllib3 1.26.18
uvicorn 0.27.1
validators 0.22.0
watchdog 4.0.0
websockets 11.0.3
wheel 0.41.2
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.17.0
</details>
lmdeploy has not been installed. We will install it manually next. It is recommended to install the latest stable version. If you use the InternStudio development environment, run the following command first. Otherwise, an error occurs.
```
# Resolved ModuleNotFoundError: No module named 'packaging' problem
pip install packaging
# Use flash_attn's precompiled package to solve slow installation problems
pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
```
Because the default installation is the runtime dependency package, but we also need to deploy and quantify here, so select [all] here. You can then examine the lmdeploy package again, as shown in the following figure
```
pip install 'lmdeploy[all]==v0.1.0'
```
However, the 0.1.0 version of lmdeploy does not support the Turbomind conversion of InternLM2-7B-chat
Note that the version of lmdeploy needs to be updated:
```
# We used version 0.2.4 of lmdeploy
pip install --upgrade lmdeploy
```
## 2.Model transformation
To use TurboMind inference model, it is necessary to convert the model into TurboMind format first, which supports online conversion and offline conversion. Online conversion can directly load the Huggingface model, and offline conversion needs to save the model before loading.
TurboMind is an efficient inference engine for LLM inference, based on Nvidia's FasterTransformer. Its main features include: LLaMa structural model support, persistent batch inference mode and scalable KV cache manager.
### 2.1 Online conversion
lmdeploy supports direct reading of Huggingface model weights. Currently, three types are supported:
The models quantified by lmdeploy on huggingface.co are llama2-70b-4bit and internlm-chat-20b-4bit
Other LM models on huggingface.co, such as Qwen/ QWEN-7B-chat
An example is as follows:
```
# Requires a network environment with access to Huggingface
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
```
The above two lines show how to directly load Huggingface's model, the first to load the version quantified using lmdeploy, and the second to load the other LLM models.
We can also launch the local Huggingface model directly, as shown below.
```
lmdeploy chat turbomind /EmoLLM --model-name internlm2-chat-7b
```
The preceding commands start a local dialog interface. You can use Bash to talk to LLM.
### 2.2 Offline conversion
The offline transformation requires converting the model to the lmdeploy TurboMind format before starting the service, as shown below.
```
# Transform modelFastTransformer格式 TurboMind
lmdeploy convert internlm2-chat-7b /EmoLLM
```
Upon completion, a workspace folder will be generated in the current directory. These are the files that TurboMind and Triton need for "model inference."
## 3.Run locally
### 3.1 TurboMind Inference + Command line local dialog
After the model transformation is complete, we have the conditions to use model inference, and then we can proceed to the real model inference.
Let's try Bash Local Chat first, and then use Local Chat to call TurboMind instead of API Server. In simple terms, TurboMind is executed directly by command line code. So, there is a difference between the actual architecture diagram and the previous one.
There are several ways to run it, such as Turbomind, PyTorch, DeepSpeed. But PyTorch and DeepSpeed are actually Huggingface's Transformers package, PyTorch means the native Transformer package, DeepSpeed means the use of DeepSpeed as an inference framework. Pytorch/DeepSpeed is currently weak and does not have production capacity, so it is not recommended to use.
Run the following command.
```
# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace
```
To exit, enter exit and press enter twice. At this point, the Server is the locally run model (TurboMind), and the command line can be seen as the front end.
### 3.2 TurboMind Inference + API service
In the above part, we tried to start the Client directly using the command line. Next, we tried how to use lmdepoy to service it.
First, start the service with the following command.
```
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
```
Details please see [documents](https://lmdeploy.readthedocs.io/zh-cn/stable/serving/restful_api.html)

400
deploy/lmdeploy.md Normal file
View File

@ -0,0 +1,400 @@
![](../assets/emoxlmdeploy.png)
# LMDeploy本地部署
## 0. LMDeploy简介
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。 这个强大的工具箱提供以下核心功能:
- 高效的推理LMDeploy 开发了 Persistent Batch(即 Continuous Batch)Blocked K/V Cache动态拆分和融合张量并行高效的计算 kernel等重要特性。推理性能是 vLLM 的 1.8 倍
- 可靠的量化LMDeploy 支持权重量化和 k/v 量化。4bit 模型推理效率是 FP16 下的 2.4 倍。量化模型的可靠性已通过 OpenCompass 评测得到充分验证。
- 便捷的服务通过请求分发服务LMDeploy 支持多模型在多机、多卡上的推理服务。
- 有状态推理:通过缓存多轮对话过程中 attention 的 k/v记住对话历史从而避免重复处理历史会话。显著提升长文本多轮对话场景中的效率。
## 1. 环境配置
<details>
<summary>具体部署环境</summary>
Package Version
------------------------- -----------
accelerate 0.27.2
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.14.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.2.0
async-timeout 4.0.3
attrs 23.2.0
blinker 1.7.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
contourpy 1.2.0
crcmod 1.7
cryptography 41.0.3
cycler 0.12.1
datasets 2.17.0
dill 0.3.8
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.109.2
ffmpy 0.3.2
filelock 3.13.1
fire 0.5.0
flash-attn 2.4.2
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2023.10.0
fuzzywuzzy 0.18.0
gitdb 4.0.11
GitPython 3.1.42
gmpy2 2.1.2
gradio 3.50.2
gradio_client 0.6.1
h11 0.14.0
httpcore 1.0.3
httpx 0.26.0
huggingface-hub 0.20.3
idna 3.4
importlib-metadata 6.11.0
importlib-resources 6.1.1
Jinja2 3.1.2
jmespath 0.10.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lmdeploy 0.2.4
markdown-it-py 3.0.0
MarkupSafe 2.1.1
matplotlib 3.8.3
mdurl 0.1.2
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mmengine-lite 0.10.3
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-runtime-cu11 11.8.89
nvidia-nccl-cu11 2.19.3
openxlab 0.0.34
orjson 3.9.14
oss2 2.17.0
packaging 23.2
pandas 2.2.0
peft 0.8.2
Pillow 9.5.0
pip 23.3.1
platformdirs 4.2.0
protobuf 4.25.3
psutil 5.9.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pybind11 2.11.1
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.1
pydantic_core 2.16.2
pydeck 0.8.1b0
pydub 0.25.1
Pygments 2.17.2
Pympler 1.0.1
pynvml 11.5.0
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.9
pytz 2023.4
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.28.2
rich 13.4.2
rpds-py 0.18.0
safetensors 0.4.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 60.2.0
shortuuid 1.0.11
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.36.3
streamlit 1.24.0
sudo 1.0.0
sympy 1.11.1
tenacity 8.2.3
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
tornado 6.4
tqdm 4.65.2
transformers 4.37.1
triton 2.2.0
typing_extensions 4.9.0
tzdata 2024.1
tzlocal 4.3.1
urllib3 1.26.18
uvicorn 0.27.1
validators 0.22.0
watchdog 4.0.0
websockets 11.0.3
wheel 0.41.2
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.17.0
</details>
lmdeploy 没有安装,我们接下来手动安装一下,建议安装最新的稳定版。 如果是在 InternStudio 开发环境,需要先运行下面的命令,否则会报错。
```
# 解决 ModuleNotFoundError: No module named 'packaging' 问题
pip install packaging
# 使用 flash_attn 的预编译包解决安装过慢问题
pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
```
由于默认安装的是 runtime 依赖包,但是我们这里还需要部署和量化,所以,这里选择 [all]。然后可以再检查一下 lmdeploy 包,如下图所示
```
pip install 'lmdeploy[all]==v0.1.0'
```
EMOLLM 是由 InternLM2 训练而来但是lmdeploy的0.1.0版本并不支持InternLM2-7B-chat的Turbomind转化
注意lmdeploy的版本要要进行更新:
```
# 我们使用的是0.2.4版本的lmdeploy
pip install --upgrade lmdeploy
```
## 2. 模型转化
使用 LMDeploy 中的推理引擎 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。
TurboMind 是一款关于 LLM 推理的高效推理引擎,基于英伟达的 FasterTransformer 研发而成。它的主要功能包括LLaMa 结构模型的支持persistent batch 推理模式和可扩展的 KV 缓存管理器。
TurboMind结构如下
![turbomind结构](../assets/turbomind结构.png)
### 2.1 在线转化
lmdeploy 支持直接读取 Huggingface 模型权重,目前共支持三种类型:
在 huggingface.co 上面通过 lmdeploy 量化的模型,如 llama2-70b-4bit, internlm-chat-20b-4bit
huggingface.co 上面其他 LM 模型,如 Qwen/Qwen-7B-Chat
示例如下:
```
# 需要能访问 Huggingface 的网络环境
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
```
上面两行命令分别展示了如何直接加载 Huggingface 的模型,第一条命令是加载使用 lmdeploy 量化的版本,第二条命令是加载其他 LLM 模型。
我们也可以直接启动本地的 Huggingface 模型,如下所示。
```
lmdeploy chat turbomind /EmoLLM --model-name internlm2-chat-7b
```
以上命令都会启动一个本地对话界面,通过 Bash 可以与 LLM 进行对话。
### 2.2 离线转化
离线转换需要在启动服务之前,将模型转为 lmdeploy TurboMind 的格式,如下所示。
```
# 转换模型FastTransformer格式 TurboMind
lmdeploy convert internlm2-chat-7b /EmoLLM
```
执行完成后将会在当前目录生成一个```workspace```的文件夹。这里面包含的就是 TurboMind 和 Triton “模型推理”需要到的文件。
## 3.本地运行
### 3.1 TurboMind 推理+命令行本地对话
模型转换完成后,我们就具备了使用模型推理的条件,接下来就可以进行真正的模型推理环节。
我们先尝试本地对话Bash Local Chat下面用Local Chat 表示)在这里其实是跳过 API Server 直接调用 TurboMind。简单来说就是命令行代码直接执行 TurboMind。所以说实际和前面的架构图是有区别的。
这里支持多种方式运行比如Turbomind、PyTorch、DeepSpeed。但 PyTorch 和 DeepSpeed 调用的其实都是 Huggingface 的 Transformers 包PyTorch表示原生的 Transformer 包DeepSpeed 表示使用了 DeepSpeed 作为推理框架。Pytorch/DeepSpeed 目前功能都比较弱,不具备生产能力,不推荐使用。
执行命令如下。
```
# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace
```
输入后两次回车退出时输入exit 回车两次即可。此时Server 就是本地跑起来的模型TurboMind命令行可以看作是前端。
### 3.2 TurboMind推理+API服务
在上面的部分我们尝试了直接用命令行启动 Client接下来我们尝试如何运用 lmdepoy 进行服务化
首先,通过下面命令启动服务。
```
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
```
详细内容请见[文档](https://lmdeploy.readthedocs.io/zh-cn/stable/serving/restful_api.html)
## 4. 模型量化
模型量化主要包括 KV Cache 量化和模型参数量化。量化是一种以参数或计算中间结果精度下降换空间节省(以及同时带来的性能提升)的策略。
前置概念:
- 计算密集compute-bound: 指推理过程中,绝大部分时间消耗在数值计算上;针对计算密集型场景,可以通过使用更快的硬件计算单元来提升计算速。
- 访存密集memory-bound: 指推理过程中,绝大部分时间消耗在数据读取上;针对访存密集型场景,一般通过减少访存次数、提高计算访存比或降低访存量来优化。
常见的 LLM 模型由于 Decoder Only 架构的特性,实际推理时大多数的时间都消耗在了逐 Token 生成阶段Decoding 阶段),是典型的访存密集型场景。
对于优化 LLM 模型推理中的访存密集问题,我们可以使用 **KV Cache 量化**和 **4bit Weight Only 量化W4A16**。KV Cache 量化是指将逐 TokenDecoding生成过程中的上下文 K 和 V 中间结果进行 INT8 量化计算时再反量化以降低生成过程中的显存占用。4bit Weight 量化,将 FP16 的模型权重量化为 INT4Kernel 计算时,访存量直接降为 FP16 模型的 1/4大幅降低了访存成本。Weight Only 是指仅量化权重,数值计算依然采用 FP16需要将 INT4 权重反量化)。
### 4.1 KV Cache 量化
#### 4.1.1 量化步骤
KV Cache 量化是将已经生成序列的 KV 变成 Int8使用过程一共包括三步
第一步:计算 minmax。主要思路是通过计算给定输入样本在每一层不同位置处计算结果的统计情况。
- 对于 Attention 的 K 和 V取每个 Head 各自维度在所有Token的最大、最小和绝对值最大值。对每一层来说上面三组值都是 `(num_heads, head_dim)` 的矩阵。这里的统计结果将用于本小节的 KV Cache。
- 对于模型每层的输入:取对应维度的最大、最小、均值、绝对值最大和绝对值均值。每一层每个位置的输入都有对应的统计值,它们大多是 `(hidden_dim, )` 的一维向量,当然在 FFN 层由于结构是先变宽后恢复,因此恢复的位置维度并不相同。这里的统计结果用于下个小节的模型参数量化,主要用在缩放环节。
第一步执行命令如下:
```bash
# 计算 minmax
lmdeploy lite calibrate \
--model /EmoLLM \
--calib_dataset "c4" \
--calib_samples 128 \
--calib_seqlen 2048 \
--work_dir ./quant_output
```
在这个命令行中,会选择 128 条输入样本,每条样本长度为 2048数据集选择 C4输入模型后就会得到上面的各种统计值。值得说明的是如果显存不足可以适当调小 samples 的数量或 sample 的长度。
> 这一步需要从 Huggingface 下载 "c4" 数据集国内经常不成功。对于在InternStudio 上的用户,需要对读取数据集的代码文件做一下替换。共包括两步:
>
> - 第一步:复制 `calib_dataloader.py` 到安装目录替换该文件:`cp /root/share/temp/datasets/c4/calib_dataloader.py /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/utils/`
> - 第二步将用到的数据集c4复制到下面的目录`cp -r /root/share/temp/datasets/c4/ /root/.cache/huggingface/datasets/`
第二步:通过 minmax 获取量化参数。主要就是利用下面这个公式,获取每一层的 K V 中心值zp和缩放值scale
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
有这两个值就可以进行量化和解量化操作了。具体来说,就是对历史的 K 和 V 存储 quant 后的值,使用时在 dequant。
第二步的执行命令如下:
```bash
# 通过 minmax 获取量化参数
lmdeploy lite kv_qparams \
--work_dir ./quant_output \
--turbomind_dir workspace/triton_models/weights/ \
--kv_sym False \
--num_tp 1
```
在这个命令中,`num_tp` 的含义前面介绍过,表示 Tensor 的并行数。每一层的中心值和缩放值会存储到 `workspace` 的参数目录中以便后续使用。`kv_sym` 为 `True` 时会使用另一种(对称)量化方法,它用到了第一步存储的绝对值最大值,而不是最大值和最小值。
第三步:修改配置。也就是修改 `weights/config.ini` 文件KV int8 开关),只需要把 `quant_policy` 改为 4 即可。
这一步需要额外说明的是,如果用的是 TurboMind1.0,还需要修改参数 `use_context_fmha`,将其改为 0。
接下来就可以正常运行前面的各种服务了,只不过咱们现在可是用上了 KV Cache 量化,能更省(运行时)显存了。
### 4.2 W4A16 量化
#### 4.2.1 量化步骤
W4A16中的A是指Activation保持FP16只对参数进行 4bit 量化。使用过程也可以看作是三步。
第一步:同 4.1.1,不再赘述。
第二步:量化权重模型。利用第一步得到的统计值对参数进行量化,具体又包括两小步:
- 缩放参数。
- 整体量化。
第二步的执行命令如下:
```bash
# 量化权重模型
lmdeploy lite auto_awq \
--model /EmoLLM \
--w_bits 4 \
--w_group_size 128 \
--work_dir ./quant_output
```
命令中 `w_bits` 表示量化的位数,`w_group_size` 表示量化分组统计的尺寸,`work_dir` 是量化后模型输出的位置。这里需要特别说明的是,因为没有 `torch.int4`所以实际存储时8个 4bit 权重会被打包到一个 int32 值中。所以,如果你把这部分量化后的参数加载进来就会发现它们是 int32 类型的。
最后一步:转换成 TurboMind 格式。
```bash
# 转换模型的layout存放在默认路径 ./workspace 下
lmdeploy convert internlm2-chat-7b ./quant_output \
--model-format awq \
--group-size 128
```
这个 `group-size` 就是上一步的那个 `w_group_size`。如果不想和之前的 `workspace` 重复,可以指定输出目录:`--dst_path`,比如:
```bash
lmdeploy convert internlm2-chat-7b ./quant_output \
--model-format awq \
--group-size 128 \
--dst_path ./workspace_quant
```
接下来和上一节一样,可以正常运行前面的各种服务了,不过咱们现在用的是量化后的模型。
最后再补充一点,量化模型和 KV Cache 量化也可以一起使用,以达到最大限度节省显存。
### 4.3 最佳实践
首先我们需要明白一点,服务部署和量化是没有直接关联的,量化的最主要目的是降低显存占用,主要包括两方面的显存:模型参数和中间过程计算结果。前者对应 W4A16 量化,后者对应 KV Cache 量化。
量化在降低显存的同时,一般还能带来性能的提升,因为更小精度的浮点数要比高精度的浮点数计算效率高,而整型要比浮点数高很多。
所以我们的建议是:在各种配置下尝试,看效果能否满足需要。这一般需要在自己的数据集上进行测试。具体步骤如下。
- Step1优先尝试正常非量化版本评估效果。
- 如果效果不行,需要尝试更大参数模型或者微调。
- 如果效果可以,跳到下一步。
- Step2尝试正常版本+KV Cache 量化,评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,跳到下一步。
- Step3尝试量化版本评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,跳到下一步。
- Step4尝试量化版本+ KV Cache 量化,评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,使用方案。
另外需要补充说明的是,使用哪种量化版本、开启哪些功能,除了上述流程外,**还需要考虑框架、显卡的支持情况**,比如有些框架可能不支持 W4A16 的推理,那即便转换好了也用不了。
根据实践经验,一般情况下:
- 精度越高,显存占用越多,推理效率越低,但一般效果较好。
- Server 端推理一般用非量化版本或半精度、BF16、Int8 等精度的量化版本,比较少使用更低精度的量化版本。
- 端侧推理一般都使用量化版本,且大多是低精度的量化版本。这主要是因为计算资源所限。
以上是针对项目开发情况,如果是自己尝试(玩儿)的话:
- 如果资源足够有GPU卡很重要那就用非量化的正常版本。
- 如果没有 GPU 卡,只有 CPU不管什么芯片那还是尝试量化版本。
- 如果生成文本长度很长,显存不够,就开启 KV Cache。
建议大家根据实际情况灵活选择方案。
更多细节查看 [LMDeploy 官方文档](https://lmdeploy.readthedocs.io/zh-cn/latest/quantization/w4a16.html)

View File

@ -0,0 +1,50 @@
# EmoLLM通用指标评估
## 简介
本文档提供了关于如何使用 `eval.py``metric.py` 两个脚本的指导。这些脚本用于评估 EmoLLM-心理健康大模型的生成结果。
## 安装
- Python 3.x
- PyTorch
- Transformers
- Datasets
- NLTK
- Rouge
- Jieba
可以使用以下命令安装:
```bash
pip install torch transformers datasets nltk rouge jieba
```
## 用法
### convert.py
将原始多轮对话数据转换为测评用的单轮数据。
### eval.py
`eval.py` 脚本用于生成医生的回复并进行评估,主要分为以下几部分:
1. 加载模型和分词器。
2. 设置测试参数,如测试数据数量和批处理大小。
3. 准备数据。
4. 生成响应并评估。
### metric.py
`metric.py` 脚本包含计算评估指标的函数,可设置按字符级别或按词级别进行评估,目前包含 BLEU 和 ROUGE 分数。
## 测试结果
对data.json中的数据进行测试结果如下
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |

View File

@ -0,0 +1,50 @@
# EmoLLM general indicator evaluation
## Introduction
This document provides instructions on how to use the 'eval.py' and 'metric.py' scripts. These scripts are used to evaluate the generation results of EmoLLM- a large model of mental health.
## Installation
- Python 3.x
- PyTorch
- Transformers
- Datasets
- NLTK
- Rouge
- Jieba
It can be installed using the following command:
```bash
pip install torch transformers datasets nltk rouge jieba
```
## Usage
### convert.py
Convert raw multi-round conversation data into single round data for evaluation.
### eval.py
The `eval.py` script is used to generate the doctor's response and evaluate it, mainly divided into the following parts:
1. Load the model and word divider.
2. Set test parameters, such as the number of test data and batch size.
3. Obtain data.
4. Generate responses and evaluate.
### metric.py
The `metric.py` script contains functions to calculate evaluation metrics, which can be set to evaluate by character level or word level, currently including BLEU and ROUGE scores.
## Test results
Test the data in data.json with the following results:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |

View File

@ -0,0 +1,111 @@
from transformers import AutoModelForCausalLM, AutoTokenizer,DataCollatorWithPadding
from qwen_generation_utils import decode_tokens
import torch
import datasets
model_dir = './model'
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", padding_side='left',trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto",pad_token_id=tokenizer.eos_token_id, trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
# InternLM 7B in 4bit will cost nearly 8GB GPU memory.
# pip install -U bitsandbytes
# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
# # convert data
# import ujson
# def transform_conversation_data(raw_data):
# try:
# instruction = '<|im_start|>system\n'+raw_data.get("conversation", "")[0]['system'] + "<|im_end|>\n"
# conversation = raw_data.get("conversation", [])
# for i, dialog in enumerate(conversation):
# instruction += "<|im_start|>user\n来访者" + dialog["input"]+ "<|im_end|>\n"
# if i < len(conversation) - 1:
# instruction += "<|im_start|>assistant\n医生" + dialog["output"]+"<|im_end|>\n"
# response = conversation[-1]["output"] if conversation else ""
# instruction +="<|im_start|>assistant\n医生"
# return {"instruction": instruction, "output": response}
# except Exception as e:
# pass
# with open(f'./data_dir/data.json', 'r', encoding='utf-8') as f1:
# data = ujson.load(f1)
# with open(f'./data_dir/converted.json', 'w', encoding='utf-8') as f:
# for j, item in enumerate(data):
# temp=transform_conversation_data(item)
# if temp:
# transformed_data =ujson.dumps(temp, ensure_ascii=False)
# f.write(transformed_data+'\n')
#set test params
#set test params
test_num=1596 #测试数据条数
batch_size=12
#prepare data and dataloader
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
references =dataset['output'][:test_num]
hypotheses = []
def preprocess(data):
length = list(map(len, data['instruction']))
model_inputs=tokenizer(data['instruction'], max_length=512, truncation=True )
labels=tokenizer(data['output'], padding=True,max_length=128, truncation=True )
model_inputs['labels']=labels['input_ids']
model_inputs['length'] = length
return model_inputs
preprocessed_dataset = dataset.map(preprocess, batched=True,remove_columns=['instruction', 'output',])
collator=DataCollatorWithPadding(tokenizer=tokenizer,)
from torch.utils.data import DataLoader
dataloader = DataLoader(preprocessed_dataset, batch_size=batch_size, collate_fn=collator)
#generate responses
stop_word="<|im_end|>"
for batch in dataloader:
batch_input_ids = torch.LongTensor(batch['input_ids']).to(model.device)
batch_labels = batch['labels']
attention_mask = batch['attention_mask']
length = batch['length']
batch_out_ids = model.generate(
batch_input_ids.to(model.device),
return_dict_in_generate=False,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
eos_token_id=92542
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
context_length=0,
raw_text_len=length[i],
chat_format="raw",
verbose=False,
errors='replace'
).replace("医生:","") for i in range(batch_size)]
hypotheses.extend([r.replace(stop_word," ").split()[0] if stop_word in r else r for r in batch_response])
# Load metric
from metric import compute_metrics
print(compute_metrics((hypotheses,references)))

View File

@ -0,0 +1,32 @@
# EmoLLM专业指标评估
## 简介
本文档介绍一种专业评测方法,并提供 EmoLLM 在专业指标的得分。
## 评测方法
本评测方法采用论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》提出的评测指标与方法。
* 指标Comprehensiveness, Professionalism, Authenticity, Safety
* 方法Turn-Based Dialogue Evaluation
* 数据集CPsyCounE
## 评测结果
* 评测模型: [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2_7B_chat_qlora)
* 得分:
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |
## 比较
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 在 InternLM2_7B_Chat 基础上提升较大;相比 Role-playing ChatGPT 在心理咨询任务上能力相近
* 对比结果图片来源于论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
![image](https://github.com/MING-ZCH/EmoLLM/assets/119648793/abc9f626-11bc-4ec8-84a4-427c4600a720)

View File

@ -0,0 +1,32 @@
# EmoLLM's professional evaluation
## Introduction
This document describes a professional evaluation method and provides EmoLLM's scores on professional metrics.
## Evaluation
The evaluation method, metric, and dataset from the paper《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》.
* Metric: Comprehensiveness, Professionalism, Authenticity, Safety
* Method: Turn-Based Dialogue Evaluation
* Dataset: CPsyCounE
## Result
* Model: [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2_7B_chat_qlora)
* Score
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |
## Comparison
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) is greatly improved on InternLM2_7B_Chat; Performance on the counseling task was similar compared to ChatGPT(Role-playing)
* The comparison results are from the paper《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
![image](https://github.com/MING-ZCH/EmoLLM/assets/119648793/abc9f626-11bc-4ec8-84a4-427c4600a720)

View File

@ -25,7 +25,7 @@ batch_size=12
#prepare data and dataloader
dataset = datasets.load_dataset('json', data_files='./train_dir/converted.json',split=f"train[:{test_num}]")
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
references =dataset['output'][:test_num]
hypotheses = []

View File

@ -1,56 +1,19 @@
# EmoLLM评测
# EmoLLM通用指标评估
## 通用指标评测
## 简介
* 具体评测指标和评测方法见 [General_evaluation.md](./General_evaluation.md)
此 README 文件提供了关于如何使用 `eval.py``metric.py` 两个脚本的指导。这些脚本用于评估 EmoLLM-心理健康大模型的生成结果。
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |
## 专业指标评测
* 具体评测指标和评测方法见 [Professional_evaluation.md](./Professional_evaluation.md)
## 安装
| Model | Comprehensiveness | rofessionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
- Python 3.x
- PyTorch
- Transformers
- Datasets
- NLTK
- Rouge
- Jieba
可以使用以下命令安装:
```bash
pip install torch transformers datasets nltk rouge jieba
```
## 用法
### convert.py
将原始多轮对话数据转换为测评用的单轮数据。
### eval.py
`eval.py` 脚本用于生成医生的回复并进行评估,主要分为以下几部分:
1. 加载模型和分词器。
2. 设置测试参数,如测试数据数量和批处理大小。
3. 准备数据。
4. 生成响应并评估。
### metric.py
`metric.py` 脚本包含计算评估指标的函数,可设置按字符级别或按词级别进行评估,目前包含 BLEU 和 ROUGE 分数。
## 测试结果
基于全量微调后的Qwen1_5-0_5B-Chat模型对data.json中的数据进行测试结果如下
| Metric | Value |
|---------|----------------------|
| ROUGE-1 | 27.23% |
| ROUGE-2 | 8.55% |
| ROUGE-L | 17.05% |
| BLEU-1 | 26.65% |
| BLEU-2 | 13.11% |
| BLEU-3 | 7.19% |
| BLEU-4 | 4.05% |

19
evaluate/README_EN.md Normal file
View File

@ -0,0 +1,19 @@
# EmoLLM Evaluation
## General Metrics Evaluation
* For specific metrics and methods, see [General_evaluation_EN.md](./General_evaluation_EN.md)
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |
## Professional Metrics Evaluation
* For specific metrics and methods, see [Professional_evaluation_EN.md](./Professional_evaluation_EN.md)
| Model | Comprehensiveness | rofessionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |

View File

@ -18,8 +18,8 @@ def compute_metrics(eval_pred):
rouge = Rouge()
bleu =np.array([0,0,0,0])
weights = [(1.,0,0,0),(1./2., 1./2.),(1./3., 1./3., 1./3.),(1./4., 1./4., 1./4., 1./4.)]
bleu =np.array([0.,0.,0.,0.])
weights = [(1.,0.,0.,0.),(1./2., 1./2.),(1./3., 1./3., 1./3.),(1./4., 1./4., 1./4., 1./4.)]
for decoded_label, decoded_pred in zip(decoded_labels, decoded_preds):
bleu +=np.array( sentence_bleu(
references=[decoded_label.split(' ')],

View File

@ -0,0 +1,100 @@
# EMO Psychological large model fine-tuning data generation tutorial
**I. Objectives and Background**
In order to have a better representation of our large mental models, we must have high quality data sets. To achieve this goal, we decided to use four powerful AI grand models: Wenxin Yiyi, Tongyi Qianwen, Feifei Spark, and NXP AI to generate conversation data. In addition, we will enhance the cognitive depth of the dataset and improve the generalization ability of the model by adding a small number of self-cognitive datasets.
**II. Data set generation method**
1. **Model selection and data preparation**
Choose four big language models, namely Wenxin Yiyi, Tongyi Qianwen, IFei Spark and Zhipu, obtain the API to call the corresponding interface, and prepare to generate dialogue data.
2. **Single round and multiple round dialogue data generation **
Using these four models, we generated 10,000 single - and multi-round conversation data. In doing so, we ensure the diversity, complexity and validity of our data.
Because mental activity is often complex, in order to ensure the diversity of data. We selected a total of 16 * 28 `448` scenarios for data set generation. For specific scenario names, please refer to the configuration of the two parameters`emotions_list and areas_of_life`in config.yml.
3. **Inclusion of self-perception datasets**
In order to enhance the cognitive ability of the model, we specially added a part of self-cognitive data set. These data sets help the model better understand the context and improve the naturalness and coherence of the conversation.
**III. Practical steps**
1. **Initialize**
* Install the required software and libraries.
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
* Prepare input data and configuration parameters.
See `config.yml` for annotations
2. **Model selection and configuration**
* Select the right model for your needs.
In order to enable everyone to play with the large model, we chose the InterLLM2-7B as our baseline model (consumer graphics cards can also be deployed fine-tuned oh).
* Make necessary configuration and adjustments to the model.
Use XTuner for fine-tuning based on our data set and configuration strategy
3. **Data generation**
* Data generation using Tongyi Qianwen large model.
```bash
# Terminal operation
bash run_qwen.bash
```
* Use Baidu Wenxin large model for data generation.
```bash
# Terminal operation
python ernie_gen_data.py
```
* Data generation using the NXP AI large model.
```bash
# Terminal operation
python zhipuai_gen_data.py
```
* Use IFlystar Fire model for data generation.
```bash
# Terminal operation
python ./xinghuo/gen_data.py
```
4. **Integration of self-cognition datasets**
* Self-cognition data set this needs to be manually generated in accordance with the format, the following format can be.
```json
[
{
"conversation": [
{
"input": "请介绍一下你自己",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
},
{
"conversation": [
{
"input": "请做一下自我介绍",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
}
]
```
5. **Data set integration.**
Before data set integration, we need to check whether the generated data has formatting errors, type mismatches, etc. We need check.py to check the data. Finally, merge_json.py is used to combine all the json into one overall json file.
6. **Evaluation and optimization**
* Evaluate the generated dataset using appropriate evaluation metrics.
* Make necessary optimizations and adjustments based on the evaluation results.
7. **Testing and deployment**
* Evaluate the trained model using an independent test set.
* Make necessary adjustments and optimizations based on test results.
* Deploy the final model into a real application.

View File

@ -0,0 +1,10 @@
# Introduction
* gen_Chat 使用于生成ChatGLM3-6B的数据集
* gen_data 使用于生成InternLM所需要的数据集
⭐注意事项~
星火大模型V1.5生成特定主题时会出现**安全限制**,模型会拒绝回答,要注意类似数据的处理。
例:{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "xxx", "output": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。"}

View File

@ -0,0 +1,10 @@
# Introduction
* gen_Chat is used to generate the ChatGLM3-6B dataset
* gen_data is used to generate the data set required for InternLM
⭐Precautions~
Spark Big Model V1.5 generates a specific topic with a **security limit**, the model will refuse to answer, be aware of similar data processing.
Example{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "xxx", "output": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。"}

View File

@ -1,3 +0,0 @@
gen_Chat 使用于生成ChatGLM3-6B的数据集
gen_data 适用于生成InternLM所需要的数据集
但是需要注意~火大模型用1.5生成时会有{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。", "output": ""}类似这样的数据集,要注意数据处理

0
rag/README.md Normal file
View File

0
rag/README_EN.md Normal file
View File

4
rag/requirements.txt Normal file
View File

@ -0,0 +1,4 @@
sentence_transformers
transformers
numpy
loguru

19
rag/src/config/config.py Normal file
View File

@ -0,0 +1,19 @@
import os
cur_dir = os.path.dirname(os.path.abspath(__file__)) # config
src_dir = os.path.dirname(cur_dir) # src
base_dir = os.path.dirname(src_dir) # base
# model
model_dir = os.path.join(base_dir, 'model') # model
embedding_path = os.path.join(model_dir, 'gte-small-zh') # embedding
llm_path = os.path.join(model_dir, 'pythia-14m') # llm
# data
data_dir = os.path.join(base_dir, 'data') # data
knowledge_json_path = os.path.join(data_dir, 'knowledge.json') # json
knowledge_pkl_path = os.path.join(data_dir, 'knowledge.pkl') # pickle
# log
log_dir = os.path.join(base_dir, 'log') # log
log_path = os.path.join(log_dir, 'log.log') # file

67
rag/src/main.py Normal file
View File

@ -0,0 +1,67 @@
import os
import json
import pickle
import numpy as np
from typing import Tuple
from sentence_transformers import SentenceTransformer
from config.config import knowledge_json_path, knowledge_pkl_path
from util.encode import load_embedding, encode_qa
"""
读取知识库
"""
def load_knowledge() -> Tuple[list, list]:
# 如果 pkl 不存在,则先编码存储
if not os.path.exists(knowledge_pkl_path):
encode_qa(knowledge_json_path, knowledge_pkl_path)
# 加载 json 和 pkl
with open(knowledge_json_path, 'r', encoding='utf-8') as f1, open(knowledge_pkl_path, 'rb') as f2:
knowledge = json.load(f1)
encoded_knowledge = pickle.load(f2)
return knowledge, encoded_knowledge
"""
召回 top_k 个相关的文本段
"""
def find_top_k(
emb: SentenceTransformer,
query: str,
knowledge: list,
encoded_knowledge: list,
k=3
) -> list[str]:
# 编码 query
query_embedding = emb.encode(query)
# 查找 top_k
scores = query_embedding @ encoded_knowledge.T
# 使用 argpartition 找出每行第 k 个大的值的索引,第 k 个位置左侧都是比它大的值,右侧都是比它小的值
top_k_indices = np.argpartition(scores, -k)[-k:]
# 由于 argpartition 不保证顺序,我们需要对提取出的 k 个索引进行排序
top_k_values_sorted_indices = np.argsort(scores[top_k_indices])[::-1]
top_k_indices = top_k_indices[top_k_values_sorted_indices]
# 返回
contents = [knowledge[index] for index in top_k_indices]
return contents
def main():
emb = load_embedding()
knowledge, encoded_knowledge = load_knowledge()
query = "认知心理学研究哪些心理活动?"
contents = find_top_k(emb, query, knowledge, encoded_knowledge, 2)
print('召回的 top-k 条相关内容如下:')
print(json.dumps(contents, ensure_ascii=False, indent=2))
# 这里我没实现 LLM 部分,如果有 LLM
## 1. 读取 LLM
## 2. 将 contents 拼接为 prompt传给 LLM作为 {已知内容}
## 3. 要求 LLM 根据已知内容回复
if __name__ == '__main__':
main()

57
rag/src/util/encode.py Normal file
View File

@ -0,0 +1,57 @@
import json
import pickle
from loguru import logger
from sentence_transformers import SentenceTransformer
from config.config import embedding_path
"""
加载向量模型
"""
def load_embedding() -> SentenceTransformer:
logger.info('Loading embedding...')
emb = SentenceTransformer(embedding_path)
logger.info('Embedding loaded.')
return emb
"""
文本编码
"""
def encode_raw_corpus(file_path: str, store_path: str) -> None:
emb = load_embedding()
with open(file_path, 'r', encoding='utf-8') as f:
read_lines = f.readlines()
"""
对文本分割例如按句子分割
"""
lines = []
# 分割好后的存入 lines 中
# 编码(转换为向量)
encoded_knowledge = emb.encode(lines)
with open(store_path, 'wb') as f:
pickle.dump(encoded_knowledge, f)
"""
QA 对编码
暂时只实现了加载 jsoncsv和txt先没写
"""
def encode_qa(file_path: str, store_path: str) -> None:
emb = load_embedding()
with open(file_path, 'r', encoding='utf-8') as f:
qa_list = json.load(f)
# 将 QA 对拼起来作为完整一句来编码,也可以只编码 Q
lines = []
for qa in qa_list:
question = qa['question']
answer = qa['answer']
lines.append(question + answer)
encoded_knowledge = emb.encode(lines)
with open(store_path, 'wb') as f:
pickle.dump(encoded_knowledge, f)

0
rag/src/util/llm.py Normal file
View File

62
rag/src/util/text_seg.py Normal file
View File

@ -0,0 +1,62 @@
# 对文本类的非QA对数据做切分 --- 使用qwen的api对书籍进行语义分割
import json
import random
import argparse
import yaml
import re
import copy
from tqdm import tqdm
# config.yml文件由自己定义
with open('config.yml', 'r', encoding='utf-8') as f:
configs = yaml.load(f.read(), Loader=yaml.FullLoader)
def qwen_api(content):
import dashscope
from http import HTTPStatus
Input = '''我们的分割要求是每一个划分占一行请你帮我将下列txt文本按照书本的内容比如事件的背景心理学名词的定义特点阶段划分实验内容等进行划分要求文本内容不能缩减也可以按照语义分割比如某几句话都是讲的一回事就划分一行要求划分之后的文本内容详细主题明确要求每一个划分仅用一行表示。以下为要求分割的txt文本{}
'''.format(content)
dashscope.api_key = configs['dashscope_api_key']
response = dashscope.Generation.call(
model='qwen-max',
prompt=Input,
history=[],
)
if response.status_code == HTTPStatus.OK:
result = response.output.text
print(result)
else:
result = 'ERROR'
return result
def save_jsonl(data_lis, file_path):
import json
# 将字典列表写入文件,每一行一个字典
with open(file_path, 'at', encoding='utf-8') as file:
for item in data_lis:
json_string = json.dumps(item, ensure_ascii=False) + '\n'
file.write(json_string)
if __name__ == '__main__':
file_name = 'a0.jsonl'
conversations = []
path = configs['txt_path']
f = open(path, 'r', encoding='utf-8')
str = f.read()
f.close()
for i in tqdm(range(0, len(str), 2500)):
# 保证所有文本都能按照完整的语义进行分割
content = str[i:i+3500]
print(content)
answer = qwen_api(content)
f2 = open('seg1.txt', 'a', encoding='utf-8')
f2.write(answer)
f2.close()

View File

@ -0,0 +1,37 @@
# QA Generation Pipeline
## 1. 使用方法
1. 检查 `requirements.txt` 中的依赖是否满足。
2. 调整代码中 `system_prompt`确保与repo最新版本一致保证生成QA的多样性和稳定性。
3. 将txt文件放到与 `model`同级目录 `data`文件夹中.
4. 在 `config/config.py` 配置所需的 API KEY`main.py` 启动即可。生成的 QA 对会以 jsonl 的格式存在 `data/generated` 下。
### 1.1 API KEY 获取方法
目前仅包含了 qwen。
#### 1.1.1 Qwen
前往[模型服务灵积-API-KEY管理 (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),点击”创建新的 API-KEY“将获取的 API KEY 填至 `config/config.py` 中的 `DASHSCOPE_API_KEY` 即可。
## 2. 注意事项
### 2.1 系统提示 System Prompt
注意,目前的解析方案是基于模型会生成 markdown 包裹的 json 块的前提的,更改 system prompt 时需要保证这一点不变。
### 2.2 滑动窗口 Sliding Window
滑动窗口的 `window_size``overlap_size` 都可以在 `util/data_loader.py` 中的 `get_txt_content` 函数中更改。目前是按照句子分割的滑动窗口。
### 2.3 书本文件格式 Corpus Format
目前仅支持了 txt 格式,可以将清洗好的书籍文本放在 `data` 文件夹下,程序会递归检索该文件夹下的所有 txt 文件。
## TODO
1. 支持更多模型Gemini、GPT、ChatGLM……
2. 支持多线程调用模型
3. 支持更多文本格式PDF……
4. 支持更多切分文本的方式

View File

@ -0,0 +1,37 @@
# QA Generation Pipeline
## 1. Use method
1. Check whether the dependencies in `requirements.txt` are satisfied.
2. Adjust the `system_prompt`in the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA.
3. Put the txt file into the `data` folder in the same directory as `model`.
4. Configure the required API KEY in `config/config.py` and start from `main.py`. The generated QA pairs are stored in the jsonl format under `data/generated`.
### 1.1 API KEY obtaining method
Currently only qwen is included.
#### 1.1.1 Qwen
To[model service spirit product - API - KEY management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey)click on "create a new API - KEY", Fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.
## 2. Precautions
### 2.1 The System Prompt is displayed
Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.
### 2.2 Sliding Window
Both `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py.` Currently it is a sliding window divided by sentence.
### 2.3 Corpus Format
At present, only txt format is supported, and the cleaned book text can be placed under the `data` folder, and the program will recursively retrieve all txt files under the folder.
## TODO
1. Support more models (Gemini, GPT, ChatGLM...)
2. Support multi-threaded call model
3. Support more text formats (PDF...)
4. Support more ways to split text

View File

View File

@ -0,0 +1,38 @@
import os
"""
文件夹路径
"""
cur_dir = os.path.dirname(os.path.abspath(__file__)) # config
base_dir = os.path.dirname(cur_dir) # base
# model
model_dir = os.path.join(base_dir, 'model') # model
# data
data_dir = os.path.join(base_dir, 'data') # data
result_dir = os.path.join(data_dir, 'generated') # result
# log
log_dir = os.path.join(base_dir, 'log') # log
log_file_path = os.path.join(log_dir, 'log.log') # file
# system prompt
system_prompt_file_path = os.path.join(base_dir, 'system_prompt_v2.md') # system prompt
"""
环境变量
"""
# api-keys
DASHSCOPE_API_KEY = ''
"""
控制参数
"""
storage_interval = 10
window_size = 8
overlap_size = 2
multi_process_num = 3

View File

@ -0,0 +1,108 @@
import os
import json
import time
from tqdm import tqdm
import concurrent.futures
from datetime import datetime
import numpy as np
from config.config import result_dir, storage_interval, window_size, overlap_size, multi_process_num
from model.qwen import call_qwen_single_turn
from util.logger import get_logger
from util.data_loader import get_file_list, get_txt_content, capture_qa, merge_sub_qa_generation, save_to_file
logger = get_logger()
"""
每个线程产生 QA 对以及存储到子文件中
"""
def single_thread_generate(thread_num, interval, model_caller, storage_jsonl_path, contents):
storage_counter = 0
storage_list = []
for content in tqdm(contents):
try:
response = model_caller(content)
captured_qa = capture_qa(response)
if captured_qa is None:
continue
storage_list.extend(captured_qa)
storage_counter += 1
if storage_counter % interval == 0:
save_to_file(storage_jsonl_path, storage_list)
storage_counter = 0
storage_list = []
except Exception as exc:
logger.error("QA generation error : %s" % (exc))
if storage_list:
save_to_file(storage_jsonl_path, storage_list)
storage_list = []
"""
生成 QA
model_name: 可调用的模型名称暂时只实现了 qwen
interval: 存储间隔即每隔多少条存一次文件过密的间隔会增大 IO 开销
"""
def generate_qa(
model_name: str = 'qwen',
interval: int = 10,
):
current_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
if model_name == 'qwen':
model_caller = call_qwen_single_turn
else:
logger.warning('This model is currently not supported and will call the default model - qwen.')
model_caller = call_qwen_single_turn
model_name = 'qwen'
logger.info(f'The called model is: {model_name}.')
logger.info(f'The storage interval is: {interval}.')
file_list = get_file_list()
storage_counter = 0
storage_list = []
for file_path in file_list:
contents = get_txt_content(file_path, window_size=window_size, overlap_size=overlap_size)
storage_list = []
_, file_name = os.path.split(file_path)
storage_jsonl_path = os.path.join(
result_dir, f'{current_time}-{file_name}-{model_name}.jsonl')
logger.info(f'The generated QA will be stored in {storage_jsonl_path}.')
# 基于并发个数切分 contents 内容
contents_array = np.array(contents)
chunks = np.array_split(contents_array, multi_process_num)
# 构建并发参数 list
parameters_list = list()
for thread_num, chunk in enumerate(chunks):
parameters_list.append(
[thread_num, interval, model_caller, storage_jsonl_path + f'-{thread_num}', list(chunk)]
)
# 并发生成 QA 对
with concurrent.futures.ThreadPoolExecutor(max_workers=multi_process_num) as executor:
# 创建一个Future列表它们将对应每个worker_function的结果
futures = [executor.submit(single_thread_generate, *parameters) for parameters in parameters_list]
for future in concurrent.futures.as_completed(futures):
try:
future.result()
except Exception as exc:
logger.error("Thread generated an exception: %s" % (exc))
merge_sub_qa_generation(result_dir, storage_jsonl_path)
if __name__ == '__main__':
# 创建generated文件夹
os.makedirs('./data/generated', exist_ok=True)
generate_qa(interval=storage_interval)

View File

View File

View File

View File

View File

@ -0,0 +1,41 @@
import dashscope
from http import HTTPStatus
from dashscope import Generation
from dashscope.api_entities.dashscope_response import Role
from config.config import DASHSCOPE_API_KEY
from util.logger import get_logger
from util.prompt_loader import load_system_prompt
dashscope.api_key = DASHSCOPE_API_KEY
logger = get_logger()
def call_qwen_single_turn(query: str) -> str:
messages = [
{
'role': Role.SYSTEM,
'content': load_system_prompt()
},
{
'role': Role.USER,
'content': query
}
]
response = Generation.call(
model='qwen-max-1201',
messages=messages,
result_format='message',
stream=False,
incremental_output=False
)
if response.status_code == HTTPStatus.OK:
return response.output.choices[0]['message']['content']
else:
logger.error('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
return ""

View File

@ -0,0 +1,3 @@
dashscope>=1.10.0
loguru>=0.7.0
tqdm>=4.65.0

View File

@ -0,0 +1,24 @@
你是一名 QA 对生成机器人,你会根据我提供的【心理学书本内容】自动生成合适的 QA 对,要求如下:
- 对于我给的文本内容,你需要生成五条这样的 QA 对
- QA 对内容不能重复,答案不能过长
- 用简体中文回答
- 生成的 QA 对需要用 markdown 格式的 json 代码块包裹起来
以下是参考格式:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
以下是给定的文本内容:

View File

@ -0,0 +1,24 @@
You are a QA pair generator robot, you will automatically generate the appropriate QA pair according to the content of the psychology book provided by me, and the requirements are as follows:
- For the text I gave you, you need to generate five such QA pairs
- QA should not repeat the content, and the answer should not be too long
- Answer in Simplified Chinese
- The generated QA pair needs to be wrapped in json code blocks in markdown format
Here is the reference format:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
Here is the text given:

View File

@ -0,0 +1,26 @@
你是一名经验丰富的心理咨询师,熟悉心理学相关知识和心理咨询技术。请你请深呼吸并一步一步思考,根据我提供的【心理学文本内容】生成符合标准的 QA 对。
标准如下:
- 每段心理学文本生成5-10条 QA 对
- QA 对应根据心理学文本内容,选择"心理学知识; 具体咨询方法; 心理疾病特征; 心理疾病治疗方法"中最合适的主题生成
- QA 对内容不能重复,答案不能过长
- QA 对为简体中文
- 生成的 QA 对需要用 markdown 格式的 json 代码块包裹起来
参考格式如下:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
以下是给定的心理学文本内容:

View File

@ -0,0 +1,26 @@
You are an experienced psychologist, familiar with psychological knowledge and psychological counseling techniques. Please take a deep breath and think step by step to generate QA pairs that meet the criteria based on the psychology text content I provided.
The criteria are as follows:
- Generate 5-10 QA pairs per psychology text
- QA should select "Psychology Knowledge" according to the psychology text content; Specific consulting methods; Mental illness characteristics; The most suitable topic generated in "Methods of Treatment for Mental illness"
- QA should not repeat the content, and the answer should not be too long
- QA pairs are simplified Chinese
- The generated QA pair needs to be wrapped in json code blocks in markdown format
The reference format is as follows:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
The following is the content of a given psychology text:

View File

View File

@ -0,0 +1,106 @@
import os
import re
import json
import glob
from typing import List, Dict
from config.config import data_dir
from util.logger import get_logger
logger = get_logger()
"""
递归获取 data_dir 下的所有 .txt 文件列表
"""
def get_file_list() -> List[str]:
txt_files = []
txt_exist_flag = False
for root, dirs, files in os.walk(data_dir):
for file in files:
if file.endswith('.txt'):
txt_exist_flag = True
txt_files.append(os.path.join(root, file))
if not txt_exist_flag:
logger.warning(f'No txt text found in {data_dir}, please check!')
return txt_files
"""
获取 txt 文本的所有内容按句子返回 List
file_path: txt 文本路径
window_size: 滑窗大小单位为句子数
overlap_size: 重叠大小单位为句子数
"""
def get_txt_content(
file_path: str,
window_size: int = 6,
overlap_size: int = 2
) -> List[str]:
with open(file_path, 'r', encoding='utf-8') as f:
content = f.read().strip()
# 简单实现:按句号、感叹号、问号分割,并去除句内空白符
sentences = re.split(r'(?<=[。!?])\s+', content)
sentences = [s.replace(' ', '').replace('\t', '') for s in sentences]
# 滑窗
res = []
sentences_amount = len(sentences)
start_index, end_index = 0, sentences_amount - window_size
## check length
if window_size < overlap_size:
logger.error("window_size must be greater than or equal to overlap_size")
return None
if window_size >= sentences_amount:
logger.warning("window_size exceeds the amount of sentences, and the complete text content will be returned")
return ['\n'.join(sentences)]
for i in range(start_index, end_index + 1, overlap_size):
res.append('\n'.join(sentences[i : i + window_size]))
return res
"""
提取返回的 QA
"""
def capture_qa(content: str) -> List[Dict]:
# 只捕获第一个 json 块
match = re.search(r'```json(.*?)```', content, re.DOTALL)
if match:
parsed_data = None
block = match.group(1)
try:
parsed_data = json.loads(block)
except:
logger.warning('Unable to parse JSON properly.')
finally:
return parsed_data
else:
logger.warning("No JSON block found.")
return None
"""
storage_list 存入到 storage_jsonl_path
"""
def save_to_file(storage_jsonl_path, storage_list):
with open(storage_jsonl_path, 'a', encoding='utf-8') as f:
for item in storage_list:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
"""
将并发产生的文件合并成为一个文件
"""
def merge_sub_qa_generation(directory, storage_jsonl_path):
# 查找以指定前缀开始的所有文件
matching_files = glob.glob(os.path.join(directory, storage_jsonl_path + "*"))
file_contents = []
for file_path in matching_files:
with open(file_path, 'r', encoding='utf-8') as f:
for line in f:
file_contents.append(json.loads(line))
os.remove(file_path)
save_to_file(storage_jsonl_path, file_contents)

View File

@ -0,0 +1,14 @@
from loguru import logger
from config.config import log_file_path
def get_logger():
return logger
logger.add(log_file_path, rotation="500 MB")
logger.configure(
handlers=[
dict(sink=log_file_path, rotation="500 MB", format="{time} {level} {message}"),
]
)

View File

@ -0,0 +1,7 @@
from config.config import system_prompt_file_path
def load_system_prompt() -> str:
with open(system_prompt_file_path, 'r', encoding='utf-8') as f:
system_prompt = f.read()
return system_prompt

View File

@ -0,0 +1,231 @@
# ChatGLM3-6B
## Environment Preparation
In practice, we have two platforms available for selection.
* Rent a machine with a 3090 GPU and 24G memory on the [autodl](https://www.autodl.com/) platform. Select the image as shown: `PyTorch` --> `2.0.0` --> `3.8(ubuntu20.04)` --> `11.8`.
![autodl](Images/autodl.png)
* On the [InternStudio](https://studio.intern-ai.org.cn/) platform, choose the configuration of A100(1/4). Select the image as shown: `Cuda11.7-conda`.
![internstudio](Images/internstudio.png)
In the Terminal, update pip and install dependencies.
```shell
# Upgrade pip
python -m pip install --upgrade pip
pip install modelscope==1.9.5
pip install transformers==4.35.2
pip install streamlit==1.24.0
pip install sentencepiece==0.1.99
pip install accelerate==0.24.1
pip install peft==0.4.0
pip install datasets==2.10.1
```
## Download Models
Use the `modelscope` function `snapshot_download` to download the model. The first parameter is the model name, and the parameter `cache_dir` is the download path of the model.
Create a `download.py` file in the `/root/autodl-tmp` directory and enter the following content. After pasting the code, remember to save the file as shown in the figure. Run python `/root/autodl-tmp/download.py` to execute the download. The model size is 14 GB, and the download will take about 10~20 minutes.
```python
import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os
model_dir = snapshot_download('ZhipuAI/chatglm3-6b', cache_dir='/root/autodl-tmp', revision='master')
```
For more information about ChatLLM, please refer to [self-llm](https://github.com/datawhalechina/self-llm)
## Construction of Instruction Set
Fine-tuning of LLM generally refers to the process of instruction tuning. Instruction tuning means that the fine-tuning data we use is in the form of:
```json
{
"system":"Answer the following user question and only output the answer.",
"input":"1+1 equals what?",
"output":"2"
}
```
Among them, `system` is the user's instruction, informing the model of the task it needs to complete; `input` is the user's input, which is necessary input for completing the user's instruction; `output` is the output the model should give.
That is, our core training goal is to enable the model to have the ability to understand and follow user instructions. Therefore, when constructing the instruction set, we should build task instruction sets in a targeted manner according to our own tasks and objectives. Our goal is to build a personalized LLM that can simulate a psychological doctor solving psychological problems, so we construct instructions like this:
```json
{
"system": "Now you are a psychological expert, I have some psychological problems, please help me solve them with professional knowledge.",
"input":"I feel like I always procrastinate and have trouble concentrating on completing tasks.",
"output":"Procrastination is a common problem, but it can be overcome. First, you can try to set yourself a clear goal and break it down into smaller tasks. This way, you will find the task easier to manage. In addition, maintaining a tidy and orderly work environment also helps improve work efficiency. Finally, when you complete a task, don't forget to give yourself some rewards to motivate yourself to keep moving forward."
}
```
## Data Formatting
Data for `Lora` training needs to be formatted and encoded before being input into the model for training. As those familiar with the training process of `Pytorch` models know, we usually need to encode the input text as `input_ids` and the output text as `labels`, and the results of encoding are multi-dimensional vectors. We first define a preprocessing function that encodes the input and output text for each sample and returns an encoded dictionary:
```python
def process_func(example):
MAX_LENGTH = 512
input_ids, labels = [], []
instruction = tokenizer.encode(text="\n".join(["<|system|>", "Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.", "<|user|>",
example["system"] + example["input"] + "<|assistant|>"]).strip() + "\n",
add_special_tokens=True, truncation=True, max_length=MAX_LENGTH)
response = tokenizer.encode(text=example["output"], add_special_tokens=False, truncation=True,
max_length=MAX_LENGTH)
input_ids = instruction + response + [tokenizer.eos_token_id]
labels = [tokenizer.pad_token_id] * len(instruction) + response + [tokenizer.eos_token_id]
pad_len = MAX_LENGTH - len(input_ids)
input_ids += [tokenizer.pad_token_id] * pad_len
labels += [tokenizer.pad_token_id] * pad_len
labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
return {
"input_ids": input_ids,
"labels": labels
}
```
After formatting, each piece of data sent into the model is a dictionary containing two key-value pairs: `input_ids` and `labels`. `input_ids` is the encoding of the input text, and `labels` is the encoding of the output text. The decoded result should appear as follows:
```text
[gMASK]sop <|system|>
Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.
<|user|>
My team atmosphere is great, and all my colleagues are very friendly. Moreover, we often go out together to play, feeling like a big family.\n <|assistant|>
This is a great working environment, and having good interpersonal relationships and teamwork can indeed bring a lot of happiness. However, I also understand that you may encounter some challenges in your work, such as task pressure or conflicts with colleagues. Have you ever thought about how to deal with these issues?
```
Why is it in this form? That's a great question! Different models have different formatting requirements for their inputs, so we need to refer to the training source code of our deep model to check the specific format. Since fine-tuning Lora based on the original model instructions should yield the best results, we still follow the input format of the original model. Ok, here is the link to the source code for those who are interested in exploring it on their own:
[hugging face ChatGLM3 repository](https://github.com/THUDM/ChatGLM3/blob/main/finetune_chatmodel_demo/preprocess_utils.py): The `InputOutputDataset` class can be found here.
Additionally, you can refer to this repository for data processing of ChatGLM [LLaMA-Factory](https://github.com/KMnO4-zx/LLaMA-Factory/blob/main/src/llmtuner/data/template.py).
## Loading the tokenizer and half-precision model
The model is loaded in half-precision format. If you have a newer graphics card, you can use `torch.bfolat` to load it. For custom models, always specify the `trust_remote_code` parameter as `True`.
```python
tokenizer = AutoTokenizer.from_pretrained('./model/chatglm3-6b', use_fast=False, trust_remote_code=True)
# The model is loaded in half-precision format. If you have a relatively new GPU, you can load it in torch.bfloat format.
model = AutoModelForCausalLM.from_pretrained('./model/chatglm3-6b', trust_remote_code=True, torch_dtype=torch.half, device_map="auto")
```
## Defining LoraConfig
The `LoraConfig` class allows you to set many parameters, but there are only a few main ones. I'll briefly explain them; those interested can directly refer to the source code.
- `task_type`: The type of the model.
- `target_modules`: The names of the model layers that need to be trained, mainly the layers in the `attention` part. The names of these layers differ for different models. They can be passed as an array, a string, or a regular expression.
- `r`: The rank of `lora`, which can be seen in the principles of `Lora`.
- `lora_alpha`: The `Lora alaph`, the specific role of which can be referred to in the principles of `Lora`.
- `modules_to_save` specifies which modules, besides those split into `lora`, can be fully specified for training.
-
What's this scaling of `Lora` about? Obviously, it's not `r` (rank). This scaling is actually `lora_alpha/r`, and in this `LoraConfig`, the scaling is 4 times.
This scaling does not change the size of the parameters of LoRa; it essentially involves broadcasting the parameter values and performing linear scaling.
```python
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["query_key_value"],
inference_mode=False, # training mode
r=8, # Lora Rank
lora_alpha=32, # Lora alaphfor specific details and functionality, please refer to the Lora principle.
lora_dropout=0.1 # Dropout ratio
)
```
## Customizing TrainingArguments Parameters
The source code of the `TrainingArguments` class also introduces the specific functions of each parameter. Of course, everyone is encouraged to explore it on their own, but I'll mention a few commonly used ones here.
- `output_dir`: The output path for the model.
- `per_device_train_batch_size`: As the name suggests, `batch_size`.
- `gradient_accumulation_steps`: Gradient accumulation. If you have a smaller GPU memory, you can set a smaller `batch_size` and increase the gradient accumulation.
- `logging_steps`: How many steps to output a `log`.
- `num_train_epochs`: As the name suggests, `epoch`.
- `gradient_checkpointing`: Gradient checking. Once enabled, the model must execute `model.enable_input_require_grads()`. The principle behind this can be explored by yourselves, so I won't go into details here.
```python
# The GLM source repository has restructured its own data_collator and we will continue to use it here.
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=-100,
pad_to_multiple_of=None,
padding=False
)
args = TrainingArguments(
output_dir="./output/ChatGLM",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
logging_steps=10,
num_train_epochs=3,
gradient_checkpointing=True,
save_steps=100,
learning_rate=1e-4,
)
```
### Training with Trainer
Put the model in, put the parameters set above in, and put the dataset in. OK! Start training!
```python
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_id,
data_collator=data_collator,
)
trainer.train()
```
## Inference
You can use this more classic method for inference.
```python
while True:
# 推理
model = model.cuda()
input_text = input("User >>>")
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
```
## Reloading
Models fine-tuned through PEFT can be reloaded and inferred using the following methods:
- Load the source model and tokenizer;
- Use `PeftModel` to merge the source model with the parameters fine-tuned by PEFT.
```python
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("./model/chatglm3-6b", trust_remote_code=True, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained("./model/chatglm3-6b", use_fast=False, trust_remote_code=True)
# Load the LoRa weights obtained from training.
p_model = PeftModel.from_pretrained(model, model_id="./output/ChatGLM/checkpoint-1000/")
while True:
# inference
model = model.cuda()
input_text = input("User >>>")
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
```

View File

@ -0,0 +1,87 @@
# Fine-Tuning Guide
- This project has undergone fine-tuning not only on mental health datasets but also on self-awareness, and here is the detailed guide for fine-tuning.
## I. Fine-Tuning Based on Xtuner 🎉🎉🎉🎉🎉
### Environment Setup
```markdown
datasets==2.16.1
deepspeed==0.13.1
einops==0.7.0
flash_attn==2.5.0
mmengine==0.10.2
openxlab==0.0.34
peft==0.7.1
sentencepiece==0.1.99
torch==2.1.2
transformers==4.36.2
xtuner==0.1.11
```
You can also install them all at once by
```bash
cd xtuner_config/
pip3 install -r requirements.txt
```
---
### Fine-Tuning
```bash
cd xtuner_config/
xtuner train internlm2_7b_chat_qlora_e3.py --deepspeed deepspeed_zero2
```
---
### Convert the Obtained PTH Model to a HuggingFace Model
**That is: Generate the Adapter folder**
```bash
cd xtuner_config/
mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert pth_to_hf internlm2_7b_chat_qlora_e3.py ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_3.pth ./hf
```
---
### Merge the HuggingFace Adapter with the Large Language Model
```bash
xtuner convert merge ./internlm2-chat-7b ./hf ./merged --max-shard-size 2GB
# xtuner convert merge \
# ${NAME_OR_PATH_TO_LLM} \
# ${NAME_OR_PATH_TO_ADAPTER} \
# ${SAVE_PATH} \
# --max-shard-size 2GB
```
---
### Testing
```
cd demo/
python cli_internlm2.py
```
---
## II. Fine-Tuning Based on Transformers🎉🎉🎉🎉🎉
- Please refer to the [ChatGLM3-6b lora fine-tuning guide](ChatGLM3-6b-ft.md).
---
## Other
Feel free to give [xtuner](https://github.com/InternLM/xtuner) and [EmoLLM](https://github.com/aJupyter/EmoLLM) a star~
🎉🎉🎉🎉🎉

View File

@ -0,0 +1,217 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = './tiangou.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0.0001
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个拥有丰富心理学知识的舔狗艾仁医生我有一些心理问题请你用专业的知识和无条件付出、讨好、过度关心我、近乎病态的想得到我的认可的口吻帮我解决回答中可以包含一些可爱的Emoji表情符号或者文本符号。\n"
evaluation_inputs = [
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
#dataset=dict(type=load_dataset, path=alpaca_en_path),
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = True
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -1 +1 @@
此文件夹存放所有相关文件图片
此文件夹存放所有相关文件图片

View File

@ -0,0 +1 @@
This folder contains all related files and images.

View File

@ -0,0 +1,222 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
# /root/share/model_repos/internlm2-chat-7b
use_varlen_attn = False
# Data
data_path1 = './datasets/output.json'
data_path2 = './datasets/output2.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 4096
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 4
dataloader_num_workers = 1
max_epochs = 5
optim_type = AdamW
lr = 1e-6
betas = (0.9, 0.999)
weight_decay = 0.0001
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "你是心理健康助手EmoLLM由EmoLLM团队打造。你旨在通过专业心理咨询协助来访者完成心理诊断。请充分利用专业心理学知识与咨询技术一步步帮助来访者解决心理问题。"
evaluation_inputs = [
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
data1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path1)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
data2 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path2)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[data1, data2])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper, # AmpOptimWrapper
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='bfloat16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = '/root/Emollm/work_dirs/internlm2_chat_7b_full/iter_7000.pth'
# whether to resume training from the loaded checkpoint
resume = True
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)