Merge Main code (#2)

* Add files via upload

* 新增ENmd文档

* Update README.md

* Update README_EN.md

* Update LICENSE

* [docs] update lmdeploy file

* add ocr.md

* Update tutorial.md

* Update tutorial_EN.md

* Update General_evaluation_EN.md

* Update General_evaluation_EN.md

* Update README.md

Add InternLM2_7B_chat_full's professional evaluation results

* Update Professional_evaluation.md

* Update Professional_evaluation.md

* Update Professional_evaluation.md

* Update Professional_evaluation.md

* Update Professional_evaluation_EN.md

* Update README.md

* Update README.md

* Update README_EN.md

* Update README_EN.md

* Update README_EN.md

* [DOC] update readme

* Update LICENSE

* Update LICENSE

* update personal info and small format optimizations

* update personal info and translations for contents in a table

* Update RAG README

* Update demo link in README.md

* Update xlab app link

* Update xlab link

* add xlab model

* Update web_demo-aiwei.py

* add bitex

---------

Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
Co-authored-by: এ許我辞忧࿐♡ <127636623+Smiling-Weeping-zhr@users.noreply.github.com>
Co-authored-by: Vicky <vicky_3021@163.com>
Co-authored-by: MING_X <119648793+MING-ZCH@users.noreply.github.com>
Co-authored-by: Nobody-ML <1755309985@qq.com>
Co-authored-by: 8baby8 <3345710651@qq.com>
Co-authored-by: chaoke <101492509+8baby8@users.noreply.github.com>
Co-authored-by: aJupyter <ajupyter@163.com>
Co-authored-by: HongCheng <kwchenghong@gmail.com>
Co-authored-by: santiagoTOP <“1537211712top@gmail.com”>
This commit is contained in:
Anooyman 2024-03-15 19:51:04 +08:00 committed by GitHub
parent c01c67c33f
commit 000491f1be
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
23 changed files with 1243 additions and 215 deletions

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2024 aJupyter、Farewell、jujimeizuo、Smiling&Weeping、散步
Copyright (c) 2024 aJupyter、MING-ZCH、Farewell、jujimeizuo
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
@ -12,6 +12,11 @@ furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
For portions of the software that are derived from other repositories, the original licenses shall apply.
The specific portions are documented in the `./datasets/README.md` included with this distribution.
Users are responsible for complying with the terms and conditions of the original licenses
when using or distributing these portions of the software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE

187
README.md
View File

@ -1,53 +1,66 @@
<div align="center">
# EmoLLM-心理健康大模型
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
<br />
<!-- PROJECT LOGO -->
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![OpenXLab_App][OpenXLab_App-image]][OpenXLab_App-url]
[![OpenXLab_Model][OpenXLab_Model-image]][OpenXLab_Model-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
</div>
<h3 align="center">EmoLLM</h3>
<p align="center">
简体中文| <a href="README_EN.md" >English</a>
<div align="center">
简体中文| <a href="README_EN.md" >English</a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM/tree/main/demo">查看Demo</a>
<a href="https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0">体验EmoLLM 2.0</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">报告Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">提出新特性</a>
</p>
</div>
</p>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来欢迎大家star~⭐⭐。目前已经开源的 `LLM`微调配置如下:
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来欢迎大家star~⭐⭐。目前已经开源的 `LLM` 微调配置如下:
<div align="center">
| 模型 | 类型 |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Baichuan2_13B_chat | qlora |
| ChatGLM3_6B | lora |
| DeepSeek MoE_16B_chat | qlora |
| Mixtral 8x7B_instruct | qlora |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
欢迎大家为本项目做出贡献~
---
@ -63,10 +76,13 @@
- 预防和干预措施:心理健康大模型还包括预防心理问题和促进心理健康的策略,如心理教育、心理咨询、心理治疗和社会支持系统。
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
### 最近更新
- 【2024.3.9】 新增并发功能加速 QA 对生成
- 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)需要两块A100*80G更新专业评估详见[evaluate](./evaluate/)更新基于PaddleOCR的PDF转txt工具脚本详见[scripts](./scripts/)
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)。
### 🎇最近更新
- 【2024.3.12】在百度飞浆平台发布[艾薇](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 相比 EmoLLM V1.0 全面提升,已超越 Role-playing ChatGPT 在心理咨询任务上的能力!**[点击体验EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0),更新[数据集统计及详细信息](./datasets/)、[路线图](./assets/Roadmap_ZH.png)
- 【2024.3.9】 新增并发功能加速 [QA 对生成](./scripts/qa_generation/)、[RAG pipeline](./rag/)
- 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本EmoLLM V2.0开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)需要两块A100*80G更新专业评估详见[evaluate](./evaluate/)更新基于PaddleOCR的PDF转txt工具脚本详见[scripts](./scripts/)
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)
- 【2024.2.27】更新英文readme和一系列数据集舔狗和单轮对话
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei)[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py)[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)温柔御姐角色扮演专用带有Emoji表情即将推出 `温柔御姐心理医生艾薇`
@ -89,11 +105,11 @@
- 【2024.2.3】 [项目宣传视频](https://www.bilibili.com/video/BV1N7421N76X/)完成 😊
- 【2024.1.27】 完善数据构建文档、微调指南、部署指南、Readme等相关文档 👏
- 【2024.1.25】 完成EmoLLM第一版并部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
- 【2024.1.25】 EmoLLM V1.0 已部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
### 路线图
### 🎯路线图
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
@ -103,22 +119,23 @@
## 目录
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
- [最近更新](#最近更新)
- [🎇最近更新](#最近更新)
- [🎯路线图](#路线图)
- [目录](#目录)
- [开发前的配置要求](#开发前的配置要求)
- [**使用指南**](#使用指南)
- [文件目录说明](#文件目录说明)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG(检索增强生成)Pipeline](#rag检索增强生成pipeline)
- [使用到的框架](#使用到的框架)
- [如何参与本项目](#如何参与本项目)
- [版本控制](#版本控制)
- [作者(排名不分先后)](#作者排名不分先后)
- [版权说明](#版权说明)
- [特别鸣谢](#特别鸣谢)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [交流群](#交流群)
###### 开发前的配置要求
@ -133,31 +150,17 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
```
2. 依次阅读或者选择感兴趣的部分阅读:
- [文件目录说明](#文件目录说明)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG](#rag检索增强生成pipeline)
- 查看更多详情
### 文件目录说明
```
├─assets图像资源
├─datasets数据集
├─demodemo脚本
├─generate_data生成数据指南
│ └─xinghuo
├─scripts一些可用工具
└─xtuner_config微调指南
└─images
```
### 数据构建
请阅读[数据构建指南](generate_data/tutorial.md)查阅
- 请阅读[数据构建指南](generate_data/tutorial.md)查阅
本次微调用到的数据集见[datasets](datasets/data.json)
- 微调用到的数据集见[datasets](datasets/data.json)
### 微调指南
@ -165,16 +168,24 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
### 部署指南
详见[部署指南](demo/README.md)
- Demo部署详见[部署指南](demo/README.md)
- 基于[LMDeploy](https://github.com/InternLM/lmdeploy/)的量化部署:详见[deploy](./deploy/lmdeploy.md)
### RAG(检索增强生成)Pipeline
- 详见[RAG](./rag/)
<details>
<summary>更多详情</summary>
### 使用到的框架
- [Xtuner](https://github.com/InternLM/xtuner)
- [Xtuner](https://github.com/InternLM/xtuner):用于微调
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/):用于量化部署
- [Stremlit](https://streamlit.io/)用于构建Demo
- [DeepSpeed](https://github.com/microsoft/DeepSpeed):并行训练
- …
#### 如何参与本项目
@ -187,45 +198,44 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
### 版本控制
该项目使用Git进行版本管理。您可以在repository参看当前可用版本。
</details>
### 作者(排名不分先后)
[aJupyter](https://github.com/aJupyter)@datawhale成员、南开大学在读硕士
[jujimeizuo](https://github.com/jujimeizuo)@江南大学在读硕士
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@哈尔滨工业大学(威海)在读本科生
[Farewell](https://github.com/8baby8)@飞桨领航团区域主管、文心大模型核心开发者
[ZhouXinAo](https://github.com/zxazys)@南开大学在读硕士
[MING_X](https://github.com/MING-ZCH)@华中科技大学在读本科生
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
[MrCatAI](https://github.com/MrCatAI)@AI搬用工
[ZeyuBa](https://github.com/ZeyuBa)@自动化所在读硕士
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@宾夕法尼亚大学在读硕士
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
[Mxoder](https://github.com/Mxoder)@北京航空航天大学在读本科生
[Anooyman](https://github.com/Anooyman) @南京理工大学硕士
| 用户名 | 学校/组织 | 备注 | 贡献 |
| :----------: | :--------------------: | :-------------------: | :----------: |
| [aJupyter](https://github.com/aJupyter) | 南开大学在读硕士 | DataWhale成员 | 项目发起人 |
| [jujimeizuo](https://github.com/jujimeizuo) | 江南大学在读硕士 | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | 哈尔滨工业大学(威海)在读本科生 | | |
| [8baby8](https://github.com/8baby8) | 飞桨领航团区域主管 | 文心大模型核心开发者 | |
| [zxazys](https://github.com/zxazys) | 南开大学在读硕士 | | |
| [MING-ZCH](https://github.com/MING-ZCH) | 华中科技大学在读本科生 | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | swufe | | |
| [MrCatAI](https://github.com/MrCatAI) | AI搬用工 | | |
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护|数据清洗、文档翻译|
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士研0 | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |
### 版权说明
该项目签署了MIT 授权许可,详情请参阅 [LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)
### 引用
如果本项目对您的工作有所帮助,请使用以下格式引用:
```bibtex
@misc{EmoLLM,
title={EmoLLM},
author={EmoLLM},
url={https://github.com/SmartFlowAI/EmoLLM/},
year={2024}
}
```
### 特别鸣谢
@ -233,6 +243,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
- [上海人工智能实验室](https://www.shlab.org.cn/)
- [闻星大佬(小助手)](https://github.com/vansin)
- [扫地升(公众号宣传)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- 阿布(北大心理学硕士)
<!-- links -->
@ -240,7 +251,6 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
@ -259,14 +269,17 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
[license-url]: https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE
[OpenXLab_App-image]: https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg
[OpenXLab_Model-image]: https://cdn-static.openxlab.org.cn/header/openxlab_models.svg
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
## 交流群
- 如果失效请移步Issue区
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM官方交流群">
</p>

View File

@ -1,4 +1,15 @@
# EmoLLM - Large Languge Model for Mental Health
<div align="center">
# EmoLLM - Large Language Model for Mental Health
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
@ -6,13 +17,8 @@
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
<br />
<!-- PROJECT LOGO -->
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
</div>
<h3 align="center">EmoLLM</h3>
@ -35,24 +41,31 @@
<!-- 本篇README.md面向开发者 -->
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you can give it a star~⭐⭐. The open-sourced configuration is as follows:
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you could give it a star~⭐⭐. The open-sourced configuration is as follows:
| model | type |
<div align="center">
| Model | Type |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_7B_chat | full finetuning |
| InternLM2_1_8B_chat | full finetuning |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | full finetuning |
| Baichuan2_13B_chat | qlora |
| ChatGLM3_6B | lora |
| DeepSeek MoE_16B_chat | qlora |
| Mixtral 8x7B_instruct | qlora |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | full fine-tuning |
| InternLM2_1_8B_chat | full fine-tuning |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | full fine-tuning |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
Everyone is welcome to contribute to this project ~
---
The Model is aimed at fully understanding and promoting the mental health of individuals, groups, and society. This model typically includes the following key components:
The Model aims to fully understand and promote the mental health of individuals, groups, and society. This model typically includes the following key components:
- Cognitive factors: Involving an individual's thought patterns, belief systems, cognitive biases, and problem-solving abilities. Cognitive factors significantly impact mental health as they affect how individuals interpret and respond to life events.
- Emotional factors: Including emotion regulation, emotional expression, and emotional experiences. Emotional health is a crucial part of mental health, involving how individuals manage and express their emotions and how they recover from negative emotions.
@ -63,8 +76,10 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
### Recent Updates
- 【2024.3.9】New concurrency feature speeds up QA pair generation
- 【2024.3.3】 [Based on InternLM2-7B-chat full amount of fine-tuned version of open source](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
- 【2024.3.12】 Released on Baidu Flying Pulp Platform [aiwei](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 is greatly improved in all scores compared to EmoLLM V1.0. Surpasses the performance of Role-playing ChatGPT on counseling tasks!**, [Click to experience EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0), update [dataset statistics and details](./datasets/), [Roadmap](./assets/Roadmap_ZH.png)
- 【2024.3.9】 Add concurrency acceleration [QA pair generation](./scripts/qa_generation/), [RAG pipeline](./rag/)
- 【2024.3.3】 [Based on InternLM2-7B-chat full fine-tuned version EmoLLM V2.0 open sourced](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
@ -91,7 +106,7 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- 【2024.2.3】 [Project Vedio](https://www.bilibili.com/video/BV1N7421N76X/) at bilibili 😊
- 【2024.1.27】 Complete data construction documentation, fine-tuning guide, deployment guide, Readme, and other related documents 👏
- 【2024.1.25】 Complete the first version of EmoLLM and deploy it online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
- 【2024.1.25】 EmoLLM V1.0 has deployed online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
@ -104,9 +119,9 @@ The Model is aimed at fully understanding and promoting the mental health of ind
## Contents
- [EmoLLM - Large Languge Model for Mental Health](#emollm---large-languge-model-for-mental-health)
- [Everyone is welcome to contribute to this project ~](#everyone-is-welcome-to-contribute-to-this-project-)
- [EmoLLM - Large Language Model for Mental Health](#emollm---large-language-model-for-mental-health)
- [Recent Updates](#recent-updates)
- [Roadmap](#roadmap)
- [Contents](#contents)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
@ -114,6 +129,7 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- [RAG (Retrieval Augmented Generation) Pipeline](#rag-retrieval-augmented-generation-pipeline)
- [Frameworks Used](#frameworks-used)
- [How to participate in this project](#how-to-participate-in-this-project)
- [Version control](#version-control)
@ -122,6 +138,7 @@ The Model is aimed at fully understanding and promoting the mental health of ind
- [Acknowledgments](#acknowledgments)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [Communication group](#communication-group)
###### Pre-development Configuration Requirements.
@ -147,21 +164,21 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
### File Directory Explanation
```
├─assetsImage Resources
├─datasetsDataset
├─demodemo scripts
├─generate_dataData Generation Guide
├─assets: Image Resources
├─datasets: Dataset
├─demo: demo scripts
├─generate_data: Data Generation Guide
│ └─xinghuo
├─scriptsSome Available Tools
├─scripts: Some Available Tools
└─xtuner_configFine-tuning Guide
└─images
```
### Data Construction
Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
- Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
- The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
### Fine-tuning Guide
@ -169,7 +186,12 @@ For details, see the [fine-tuning guide](xtuner_config/README.md)
### Deployment Guide
For details, see the [deployment guide](demo/README.md)
- Demo deployment: see [deployment guide](./demo/README.md) for details.
- Quantitative deployment based on [LMDeploy](https://github.com/InternLM/lmdeploy/): see [deploy](./deploy/lmdeploy.md)
### RAG (Retrieval Augmented Generation) Pipeline
- See [RAG](./rag/)
<details>
<summary>Additional Details</summary>
@ -179,6 +201,9 @@ For details, see the [deployment guide](demo/README.md)
- [Xtuner](https://github.com/InternLM/xtuner)
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/): for quantitative deployment
- [Stremlit](https://streamlit.io/): for building demos
- [DeepSpeed](https://github.com/microsoft/DeepSpeed): for parallel training
- …
#### How to participate in this project
@ -193,39 +218,31 @@ Contributions make the open-source community an excellent place for learning, in
### Version control
This project uses Git for version control. You can see the current available versions in the repository.
This project uses Git for version control. You can see the currently available versions in the repository.
</details>
### Authors (in no particular order)
[aJupyter](https://github.com/aJupyter)@Datawhale member, Master's student at Nankai University
| Username | School/Organization | Remarks | Contributions |
| :-------: | :-------------------: | :------------------: | :--------: |
| [aJupyter](https://github.com/aJupyter) | Nankai University, Master's student | DataWhale member | Project initiator |
| [jujimeizuo](https://github.com/jujimeizuo) | Jiangnan University, Master's student | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | Harbin Institute of Technology (Weihai), Undergraduate student | | |
| [8baby8](https://github.com/8baby8) | PaddlePaddle Pilot Team Regional Director | Wenxin Large Model core developer | |
| [zxazys](https://github.com/zxazys) | Nankai University, Master's student | | |
| [MING-ZCH](https://github.com/MING-ZCH) | Huazhong University of Science and Technology, Undergraduate student | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | SWUFE (Southwestern University of Finance and Economics) | | |
| [MrCatAI](https://github.com/MrCatAI) | AI Mover | | |
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |
[jujimeizuo](https://github.com/jujimeizuo)@Master's student at Jiangnan University
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@Undergraduate student at Harbin Institute of Technology (Weihai)
[Farewell](https://github.com/8baby8)@
[ZhouXinAo](https://github.com/zxazys)@Master's student at Nankai University
[MING_X](https://github.com/MING-ZCH) @Undergraduate at Huazhong University of Science and Technology
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
[MrCatAI](https://github.com/MrCatAI)@AI Removal of Labour
[ZeyuBa](https://github.com/ZeyuBa)@Master's student at Institute of Automation
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@Master's student at University of Pennsylvania
[Nobody-ML](https://github.com/Nobody-ML)@Undergraduate at China University of Petroleum (East China)
[chg0901](https://github.com/chg0901)@PhD Candidate at Kwangwoon University
[Mxoder](https://github.com/Mxoder)@Undergraduate at Beihang University
[Anooyman](https://github.com/Anooyman) @Master of Nanjing University of Science and Technology
### Copyright Notice
@ -238,6 +255,7 @@ The project is licensed under the MIT License. Please refer to the details
- [Shanghai Artificial Intelligence Laboratory](https://www.shlab.org.cn/)
- [Vanin](https://github.com/vansin)
- [Bloom up (WeChat Official Account Promotion)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- Abu (M.A. in Psychology, Peking University)
<!-- links -->

BIN
assets/turbomind结构.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 698 KiB

View File

@ -26,3 +26,17 @@
| *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ |
| …… | …… | …… | …… |
## 数据集来源
**General**
* 数据集 data 来自本项目
* 数据集 data_pro 来自本项目
* 数据集 multi_turn_dataset_1 来源 [Smile](https://github.com/qiuhuachuan/smile)
* 数据集 multi_turn_dataset_2 来源 [CPsyCounD](https://github.com/CAS-SIAT-XinHai/CPsyCoun)
* 数据集 single_turn_dataset_1 来自本项目
* 数据集 single_turn_dataset_2 来自本项目
**Role-play**
* 数据集 aiwei 来自本项目
* 数据集 tiangou 来自本项目
* 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)

View File

@ -26,3 +26,18 @@
| *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ |
| …… | …… | …… | …… |
## Source
**General**
* dataset `data` from this repo
* dataset `data_pro` from this repo
* dataset `multi_turn_dataset_1` from [Smile](https://github.com/qiuhuachuan/smile)
* dataset `multi_turn_dataset_2` from [CPsyCounD](https://github.com/CAS-SIAT-XinHai/CPsyCoun)
* dataset `single_turn_dataset_1` from this repo
* dataset `single_turn_dataset_2` from this repo
**Role-play**
* dataset `aiwei` from this repo
* dataset `tiangou` from this repo
* dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)

227
deploy/Imdeploy_EN.md Normal file
View File

@ -0,0 +1,227 @@
![](../assets/emoxlmdeploy.png)
# Local deployment of LMDeploy
## 1.Environment configuration
<details>
<summary>Specific deployment environment</summary>
Package Version
------------------------- -----------
accelerate 0.27.2
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.14.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.2.0
async-timeout 4.0.3
attrs 23.2.0
blinker 1.7.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
contourpy 1.2.0
crcmod 1.7
cryptography 41.0.3
cycler 0.12.1
datasets 2.17.0
dill 0.3.8
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.109.2
ffmpy 0.3.2
filelock 3.13.1
fire 0.5.0
flash-attn 2.4.2
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2023.10.0
fuzzywuzzy 0.18.0
gitdb 4.0.11
GitPython 3.1.42
gmpy2 2.1.2
gradio 3.50.2
gradio_client 0.6.1
h11 0.14.0
httpcore 1.0.3
httpx 0.26.0
huggingface-hub 0.20.3
idna 3.4
importlib-metadata 6.11.0
importlib-resources 6.1.1
Jinja2 3.1.2
jmespath 0.10.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lmdeploy 0.2.4
markdown-it-py 3.0.0
MarkupSafe 2.1.1
matplotlib 3.8.3
mdurl 0.1.2
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mmengine-lite 0.10.3
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-runtime-cu11 11.8.89
nvidia-nccl-cu11 2.19.3
openxlab 0.0.34
orjson 3.9.14
oss2 2.17.0
packaging 23.2
pandas 2.2.0
peft 0.8.2
Pillow 9.5.0
pip 23.3.1
platformdirs 4.2.0
protobuf 4.25.3
psutil 5.9.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pybind11 2.11.1
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.1
pydantic_core 2.16.2
pydeck 0.8.1b0
pydub 0.25.1
Pygments 2.17.2
Pympler 1.0.1
pynvml 11.5.0
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.9
pytz 2023.4
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.28.2
rich 13.4.2
rpds-py 0.18.0
safetensors 0.4.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 60.2.0
shortuuid 1.0.11
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.36.3
streamlit 1.24.0
sudo 1.0.0
sympy 1.11.1
tenacity 8.2.3
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
tornado 6.4
tqdm 4.65.2
transformers 4.37.1
triton 2.2.0
typing_extensions 4.9.0
tzdata 2024.1
tzlocal 4.3.1
urllib3 1.26.18
uvicorn 0.27.1
validators 0.22.0
watchdog 4.0.0
websockets 11.0.3
wheel 0.41.2
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.17.0
</details>
lmdeploy has not been installed. We will install it manually next. It is recommended to install the latest stable version. If you use the InternStudio development environment, run the following command first. Otherwise, an error occurs.
```
# Resolved ModuleNotFoundError: No module named 'packaging' problem
pip install packaging
# Use flash_attn's precompiled package to solve slow installation problems
pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
```
Because the default installation is the runtime dependency package, but we also need to deploy and quantify here, so select [all] here. You can then examine the lmdeploy package again, as shown in the following figure
```
pip install 'lmdeploy[all]==v0.1.0'
```
However, the 0.1.0 version of lmdeploy does not support the Turbomind conversion of InternLM2-7B-chat
Note that the version of lmdeploy needs to be updated:
```
# We used version 0.2.4 of lmdeploy
pip install --upgrade lmdeploy
```
## 2.Model transformation
To use TurboMind inference model, it is necessary to convert the model into TurboMind format first, which supports online conversion and offline conversion. Online conversion can directly load the Huggingface model, and offline conversion needs to save the model before loading.
TurboMind is an efficient inference engine for LLM inference, based on Nvidia's FasterTransformer. Its main features include: LLaMa structural model support, persistent batch inference mode and scalable KV cache manager.
### 2.1 Online conversion
lmdeploy supports direct reading of Huggingface model weights. Currently, three types are supported:
The models quantified by lmdeploy on huggingface.co are llama2-70b-4bit and internlm-chat-20b-4bit
Other LM models on huggingface.co, such as Qwen/ QWEN-7B-chat
An example is as follows:
```
# Requires a network environment with access to Huggingface
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
```
The above two lines show how to directly load Huggingface's model, the first to load the version quantified using lmdeploy, and the second to load the other LLM models.
We can also launch the local Huggingface model directly, as shown below.
```
lmdeploy chat turbomind /EmoLLM --model-name internlm2-chat-7b
```
The preceding commands start a local dialog interface. You can use Bash to talk to LLM.
### 2.2 Offline conversion
The offline transformation requires converting the model to the lmdeploy TurboMind format before starting the service, as shown below.
```
# Transform modelFastTransformer格式 TurboMind
lmdeploy convert internlm2-chat-7b /EmoLLM
```
Upon completion, a workspace folder will be generated in the current directory. These are the files that TurboMind and Triton need for "model inference."
## 3.Run locally
### 3.1 TurboMind Inference + Command line local dialog
After the model transformation is complete, we have the conditions to use model inference, and then we can proceed to the real model inference.
Let's try Bash Local Chat first, and then use Local Chat to call TurboMind instead of API Server. In simple terms, TurboMind is executed directly by command line code. So, there is a difference between the actual architecture diagram and the previous one.
There are several ways to run it, such as Turbomind, PyTorch, DeepSpeed. But PyTorch and DeepSpeed are actually Huggingface's Transformers package, PyTorch means the native Transformer package, DeepSpeed means the use of DeepSpeed as an inference framework. Pytorch/DeepSpeed is currently weak and does not have production capacity, so it is not recommended to use.
Run the following command.
```
# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace
```
To exit, enter exit and press enter twice. At this point, the Server is the locally run model (TurboMind), and the command line can be seen as the front end.
### 3.2 TurboMind Inference + API service
In the above part, we tried to start the Client directly using the command line. Next, we tried how to use lmdepoy to service it.
First, start the service with the following command.
```
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
```
Details please see [documents](https://lmdeploy.readthedocs.io/zh-cn/stable/serving/restful_api.html)

View File

@ -1,6 +1,16 @@
![](../assets/emoxlmdeploy.png)
# LMDeploy本地部署
## 1.环境配置
## 0. LMDeploy简介
LMDeploy 由 [MMDeploy](https://github.com/open-mmlab/mmdeploy) 和 [MMRazor](https://github.com/open-mmlab/mmrazor) 团队联合开发,是涵盖了 LLM 任务的全套轻量化、部署和服务解决方案。 这个强大的工具箱提供以下核心功能:
- 高效的推理LMDeploy 开发了 Persistent Batch(即 Continuous Batch)Blocked K/V Cache动态拆分和融合张量并行高效的计算 kernel等重要特性。推理性能是 vLLM 的 1.8 倍
- 可靠的量化LMDeploy 支持权重量化和 k/v 量化。4bit 模型推理效率是 FP16 下的 2.4 倍。量化模型的可靠性已通过 OpenCompass 评测得到充分验证。
- 便捷的服务通过请求分发服务LMDeploy 支持多模型在多机、多卡上的推理服务。
- 有状态推理:通过缓存多轮对话过程中 attention 的 k/v记住对话历史从而避免重复处理历史会话。显著提升长文本多轮对话场景中的效率。
## 1. 环境配置
<details>
<summary>具体部署环境</summary>
@ -168,17 +178,19 @@ pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-
```
pip install 'lmdeploy[all]==v0.1.0'
```
但是lmdeploy的0.1.0版本并不支持InternLM2-7B-chat的Turbomind转化
EMOLLM 是由 InternLM2 训练而来,但是lmdeploy的0.1.0版本并不支持InternLM2-7B-chat的Turbomind转化
注意lmdeploy的版本要要进行更新:
```
# 我们使用的是0.2.4版本的lmdeploy
pip install --upgrade lmdeploy
```
## 2.模型转化
使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。
## 2. 模型转化
使用 LMDeploy 中的推理引擎 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需需要先保存模型再加载。
TurboMind 是一款关于 LLM 推理的高效推理引擎,基于英伟达的 FasterTransformer 研发而成。它的主要功能包括LLaMa 结构模型的支持persistent batch 推理模式和可扩展的 KV 缓存管理器。
TurboMind结构如下
![turbomind结构](../assets/turbomind结构.png)
### 2.1 在线转化
lmdeploy 支持直接读取 Huggingface 模型权重,目前共支持三种类型:
@ -200,7 +212,7 @@ lmdeploy chat turbomind /EmoLLM --model-name internlm2-chat-7b
### 2.2 离线转化
离线转换需要在启动服务之前,将模型转为 lmdeploy TurboMind 的格式,如下所示。
```
# 转换模型FastTransformer格式 TurboMind
# 转换模型FastTransformer格式 TurboMind
lmdeploy convert internlm2-chat-7b /EmoLLM
```
执行完成后将会在当前目录生成一个```workspace```的文件夹。这里面包含的就是 TurboMind 和 Triton “模型推理”需要到的文件。
@ -225,3 +237,164 @@ lmdeploy chat turbomind ./workspace
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
```
详细内容请见[文档](https://lmdeploy.readthedocs.io/zh-cn/stable/serving/restful_api.html)
## 4. 模型量化
模型量化主要包括 KV Cache 量化和模型参数量化。量化是一种以参数或计算中间结果精度下降换空间节省(以及同时带来的性能提升)的策略。
前置概念:
- 计算密集compute-bound: 指推理过程中,绝大部分时间消耗在数值计算上;针对计算密集型场景,可以通过使用更快的硬件计算单元来提升计算速。
- 访存密集memory-bound: 指推理过程中,绝大部分时间消耗在数据读取上;针对访存密集型场景,一般通过减少访存次数、提高计算访存比或降低访存量来优化。
常见的 LLM 模型由于 Decoder Only 架构的特性,实际推理时大多数的时间都消耗在了逐 Token 生成阶段Decoding 阶段),是典型的访存密集型场景。
对于优化 LLM 模型推理中的访存密集问题,我们可以使用 **KV Cache 量化**和 **4bit Weight Only 量化W4A16**。KV Cache 量化是指将逐 TokenDecoding生成过程中的上下文 K 和 V 中间结果进行 INT8 量化计算时再反量化以降低生成过程中的显存占用。4bit Weight 量化,将 FP16 的模型权重量化为 INT4Kernel 计算时,访存量直接降为 FP16 模型的 1/4大幅降低了访存成本。Weight Only 是指仅量化权重,数值计算依然采用 FP16需要将 INT4 权重反量化)。
### 4.1 KV Cache 量化
#### 4.1.1 量化步骤
KV Cache 量化是将已经生成序列的 KV 变成 Int8使用过程一共包括三步
第一步:计算 minmax。主要思路是通过计算给定输入样本在每一层不同位置处计算结果的统计情况。
- 对于 Attention 的 K 和 V取每个 Head 各自维度在所有Token的最大、最小和绝对值最大值。对每一层来说上面三组值都是 `(num_heads, head_dim)` 的矩阵。这里的统计结果将用于本小节的 KV Cache。
- 对于模型每层的输入:取对应维度的最大、最小、均值、绝对值最大和绝对值均值。每一层每个位置的输入都有对应的统计值,它们大多是 `(hidden_dim, )` 的一维向量,当然在 FFN 层由于结构是先变宽后恢复,因此恢复的位置维度并不相同。这里的统计结果用于下个小节的模型参数量化,主要用在缩放环节。
第一步执行命令如下:
```bash
# 计算 minmax
lmdeploy lite calibrate \
--model /EmoLLM \
--calib_dataset "c4" \
--calib_samples 128 \
--calib_seqlen 2048 \
--work_dir ./quant_output
```
在这个命令行中,会选择 128 条输入样本,每条样本长度为 2048数据集选择 C4输入模型后就会得到上面的各种统计值。值得说明的是如果显存不足可以适当调小 samples 的数量或 sample 的长度。
> 这一步需要从 Huggingface 下载 "c4" 数据集国内经常不成功。对于在InternStudio 上的用户,需要对读取数据集的代码文件做一下替换。共包括两步:
>
> - 第一步:复制 `calib_dataloader.py` 到安装目录替换该文件:`cp /root/share/temp/datasets/c4/calib_dataloader.py /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/utils/`
> - 第二步将用到的数据集c4复制到下面的目录`cp -r /root/share/temp/datasets/c4/ /root/.cache/huggingface/datasets/`
第二步:通过 minmax 获取量化参数。主要就是利用下面这个公式,获取每一层的 K V 中心值zp和缩放值scale
```bash
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp
```
有这两个值就可以进行量化和解量化操作了。具体来说,就是对历史的 K 和 V 存储 quant 后的值,使用时在 dequant。
第二步的执行命令如下:
```bash
# 通过 minmax 获取量化参数
lmdeploy lite kv_qparams \
--work_dir ./quant_output \
--turbomind_dir workspace/triton_models/weights/ \
--kv_sym False \
--num_tp 1
```
在这个命令中,`num_tp` 的含义前面介绍过,表示 Tensor 的并行数。每一层的中心值和缩放值会存储到 `workspace` 的参数目录中以便后续使用。`kv_sym` 为 `True` 时会使用另一种(对称)量化方法,它用到了第一步存储的绝对值最大值,而不是最大值和最小值。
第三步:修改配置。也就是修改 `weights/config.ini` 文件KV int8 开关),只需要把 `quant_policy` 改为 4 即可。
这一步需要额外说明的是,如果用的是 TurboMind1.0,还需要修改参数 `use_context_fmha`,将其改为 0。
接下来就可以正常运行前面的各种服务了,只不过咱们现在可是用上了 KV Cache 量化,能更省(运行时)显存了。
### 4.2 W4A16 量化
#### 4.2.1 量化步骤
W4A16中的A是指Activation保持FP16只对参数进行 4bit 量化。使用过程也可以看作是三步。
第一步:同 4.1.1,不再赘述。
第二步:量化权重模型。利用第一步得到的统计值对参数进行量化,具体又包括两小步:
- 缩放参数。
- 整体量化。
第二步的执行命令如下:
```bash
# 量化权重模型
lmdeploy lite auto_awq \
--model /EmoLLM \
--w_bits 4 \
--w_group_size 128 \
--work_dir ./quant_output
```
命令中 `w_bits` 表示量化的位数,`w_group_size` 表示量化分组统计的尺寸,`work_dir` 是量化后模型输出的位置。这里需要特别说明的是,因为没有 `torch.int4`所以实际存储时8个 4bit 权重会被打包到一个 int32 值中。所以,如果你把这部分量化后的参数加载进来就会发现它们是 int32 类型的。
最后一步:转换成 TurboMind 格式。
```bash
# 转换模型的layout存放在默认路径 ./workspace 下
lmdeploy convert internlm2-chat-7b ./quant_output \
--model-format awq \
--group-size 128
```
这个 `group-size` 就是上一步的那个 `w_group_size`。如果不想和之前的 `workspace` 重复,可以指定输出目录:`--dst_path`,比如:
```bash
lmdeploy convert internlm2-chat-7b ./quant_output \
--model-format awq \
--group-size 128 \
--dst_path ./workspace_quant
```
接下来和上一节一样,可以正常运行前面的各种服务了,不过咱们现在用的是量化后的模型。
最后再补充一点,量化模型和 KV Cache 量化也可以一起使用,以达到最大限度节省显存。
### 4.3 最佳实践
首先我们需要明白一点,服务部署和量化是没有直接关联的,量化的最主要目的是降低显存占用,主要包括两方面的显存:模型参数和中间过程计算结果。前者对应 W4A16 量化,后者对应 KV Cache 量化。
量化在降低显存的同时,一般还能带来性能的提升,因为更小精度的浮点数要比高精度的浮点数计算效率高,而整型要比浮点数高很多。
所以我们的建议是:在各种配置下尝试,看效果能否满足需要。这一般需要在自己的数据集上进行测试。具体步骤如下。
- Step1优先尝试正常非量化版本评估效果。
- 如果效果不行,需要尝试更大参数模型或者微调。
- 如果效果可以,跳到下一步。
- Step2尝试正常版本+KV Cache 量化,评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,跳到下一步。
- Step3尝试量化版本评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,跳到下一步。
- Step4尝试量化版本+ KV Cache 量化,评估效果。
- 如果效果不行,回到上一步。
- 如果效果可以,使用方案。
另外需要补充说明的是,使用哪种量化版本、开启哪些功能,除了上述流程外,**还需要考虑框架、显卡的支持情况**,比如有些框架可能不支持 W4A16 的推理,那即便转换好了也用不了。
根据实践经验,一般情况下:
- 精度越高,显存占用越多,推理效率越低,但一般效果较好。
- Server 端推理一般用非量化版本或半精度、BF16、Int8 等精度的量化版本,比较少使用更低精度的量化版本。
- 端侧推理一般都使用量化版本,且大多是低精度的量化版本。这主要是因为计算资源所限。
以上是针对项目开发情况,如果是自己尝试(玩儿)的话:
- 如果资源足够有GPU卡很重要那就用非量化的正常版本。
- 如果没有 GPU 卡,只有 CPU不管什么芯片那还是尝试量化版本。
- 如果生成文本长度很长,显存不够,就开启 KV Cache。
建议大家根据实际情况灵活选择方案。
更多细节查看 [LMDeploy 官方文档](https://lmdeploy.readthedocs.io/zh-cn/latest/quantization/w4a16.html)

View File

@ -0,0 +1,50 @@
# EmoLLM's general evaluation
## Introduction
This document provides instructions on how to use the 'eval.py' and 'metric.py' scripts. These scripts are used to evaluate the generation results of EmoLLM- a large model of mental health.
## Installation
- Python 3.x
- PyTorch
- Transformers
- Datasets
- NLTK
- Rouge
- Jieba
It can be installed using the following command:
```bash
pip install torch transformers datasets nltk rouge jieba
```
## Usage
### convert.py
Convert raw multi-round conversation data into single round data for evaluation.
### eval.py
The `eval.py` script is used to generate the doctor's response and evaluate it, mainly divided into the following parts:
1. Load the model and word divider.
2. Set test parameters, such as the number of test data and batch size.
3. Obtain data.
4. Generate responses and evaluate.
### metric.py
The `metric.py` script contains functions to calculate evaluation metrics, which can be set to evaluate by character level or word level, currently including BLEU and ROUGE scores.
## Results
Test the data in data.json with the following results:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |

View File

@ -14,19 +14,21 @@
## 评测结果
* 评测模型: [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2_7B_chat_qlora)
* 评测模型:
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) (InternLM2_7B_chat_qlora)
* [EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0) (InternLM2_7B_chat_full)
* 得分:
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |
| Model | Comprehensiveness | Professionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
## 比较
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 在 InternLM2_7B_Chat 基础上提升较大;相比 Role-playing ChatGPT 在心理咨询任务上能力相近
* EmoLLM V2.0 相比 EmoLLM V1.0 在指标上全面提升!已超越 Role-playing ChatGPT 在心理咨询任务上的能力!
* EmoLLM V1.0 在 InternLM2_7B_Chat 基础上提升较大;相比 Role-playing ChatGPT 在心理咨询任务上能力相近
* 对比结果图片来源于论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
![image](https://github.com/MING-ZCH/EmoLLM/assets/119648793/abc9f626-11bc-4ec8-84a4-427c4600a720)

View File

@ -14,19 +14,21 @@ The evaluation method, metric, and dataset from the paper《CPsyCoun: A Report-b
## Result
* Model: [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2_7B_chat_qlora)
* Model:
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) (InternLM2_7B_chat_qlora)
* [EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0) (InternLM2_7B_chat_full)
* Score
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |
| Model | Comprehensiveness | Professionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
## Comparison
* [EmoLLM V1.0](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) is greatly improved on InternLM2_7B_Chat; Performance on the counseling task was similar compared to ChatGPT(Role-playing)
* EmoLLM V2.0 is greatly improved in all scores compared to EmoLLM V1.0! Surpasses the performance of Role-playing ChatGPT on counseling tasks!
* EmoLLM V1.0 is greatly improved on InternLM2_7B_Chat; Performance on the counseling task was similar compared to ChatGPT(Role-playing)
* The comparison results are from the paper《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
![image](https://github.com/MING-ZCH/EmoLLM/assets/119648793/abc9f626-11bc-4ec8-84a4-427c4600a720)

View File

@ -9,11 +9,13 @@
| Qwen1_5-0_5B-chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat_qlora | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
| InternLM2_7B_chat_full | 32.45% | 10.82% | 20.17% | 30.48% | 15.67% | 8.84% | 5.02% |
## 专业指标评测
* 具体评测指标和评测方法见 [Professional_evaluation.md](./Professional_evaluation.md)
| Model | Comprehensiveness | rofessionalism | Authenticity | Safety |
| Model | Comprehensiveness | Professionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |

122
generate_data/OCR.md Normal file
View File

@ -0,0 +1,122 @@
# 使用PaddleOCR批量处理识别心理相关PDF说明文档
## 1. 简介
本项目利用PaddleOCR库实现对PDF文件中的文字进行批量识别和处理。PaddleOCR是一个基于PaddlePaddle深度学习框架开发的开源光学字符识别OCR工具库具备高效、准确、易用的特点。本说明文档旨在指导用户如何使用PaddleOCR对PDF文件进行批量处理提取其中的文字信息。
## 2. 环境准备
### 2.1 安装PaddlePaddle
首先确保已经安装了PaddlePaddle深度学习框架。可以通过pip进行安装
```bash
pip install paddlepaddle
```
注意根据你的操作系统和硬件环境如是否使用GPU可能需要安装特定版本的PaddlePaddle。请参考PaddlePaddle官方文档进行安装。
### 2.2 安装PaddleOCR
接下来安装PaddleOCR库
```bash
pip install paddleocr
```
### 2.3 安装其他依赖库
项目可能还需要安装一些其他依赖库如PDF处理库等请根据项目实际需求进行安装。
## 3. PDF文件预处理
由于PaddleOCR主要处理图像数据因此我们需要将PDF文件转换为图像格式。可以使用一些PDF处理库如PyMuPDF、PDFMiner等将PDF文件中的每一页转换为图像文件并保存为单独的图片。
## 4. 批量识别处理
### 4.1 加载PaddleOCR模型
首先导入PaddleOCR库并加载预训练模型
```python
from paddleocr import PaddleOCR, draw_ocr
# 初始化OCR模型使用默认的英文模型
ocr = PaddleOCR(use_angle_cls=True, lang='en')
```
注意:`lang`参数用于指定识别语言,可以根据需要选择相应的语言模型。
### 4.2 批量识别PDF图像
接下来编写代码遍历所有转换后的PDF图像文件并使用PaddleOCR进行文字识别
```python
import os
import glob
# 假设PDF图像文件保存在"pdf_images"文件夹中
image_dir = "pdf_images"
image_list = glob.glob(os.path.join(image_dir, "*.jpg")) # 根据实际情况修改文件扩展名
results = []
for img_path in image_list:
# 读取图像
img = cv2.imread(img_path)
# 使用OCR模型进行识别
result = ocr.ocr(img, use_gpu=False)
# 将识别结果添加到列表中
results.append((img_path, result))
```
### 4.3 处理识别结果
识别结果`result`是一个列表,其中每个元素是一个包含文本信息和位置信息的元组。你可以根据需要对这些结果进行进一步处理,如提取文本、保存为文件等。
## 5. 注意事项
- 确保PDF文件中的文字清晰可辨以提高识别准确率。
- 根据实际情况调整PaddleOCR模型的参数和配置以获得更好的识别效果。
- 如果需要处理大量PDF文件请确保系统资源充足避免因内存或计算资源不足导致的问题。
## 6. 示例代码
以下是一个简单的示例代码展示了如何使用PaddleOCR对PDF文件进行批量识别处理
```python
# 导入必要的库
import cv2
from paddleocr import PaddleOCR, draw_ocr
import os
import glob
# 初始化OCR模型
ocr = PaddleOCR(use_angle_cls=True, lang='en')
# 设置PDF图像文件夹路径
image_dir = "pdf_images"
# 获取PDF图像文件列表
image_list = glob.glob(os.path.join(image_dir, "*.jpg")) # 根据实际情况修改文件扩展名
# 批量识别处理
results = []
for img_path in image_list:
# 读取图像
img = cv2.imread(img_path)
# 使用OCR模型进行识别
result = ocr.ocr(img, use_gpu=False)
# 可视化识别结果(可选)
# image_show = draw_ocr(img, result, font_path='./doc/fonts/simfang.ttf')
# cv2.imshow('ocr_result', image_show)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
# 将识别结果添加到列表中
results.append((img_path, result))
# 处理识别结果

View File

@ -1,20 +1,22 @@
# EMO 心理大模型 微调数据生成教程
# EmoLLM 微调数据生成教程
**一、目标与背景**
为了使我们的心理大模型有更好的表达效果,我们必须要有高质量的数据集。为了达到这一目标,我们决定利用四种强大的人工智能大模型文心一言、通义千问、讯飞星火和智浦AI来生成对话数据。此外,我们还将增强数据集的认知深度,通过加入少量自我认知数据集来提高模型的泛化能力。
为了使我们的心理大模型有更好的表达效果,我们必须要有高质量的数据集。为了达到这一目标,我们决定利用四种强大的中文大模型:文心一言、通义千问、讯飞星火 和 智谱GLM 来生成对话数据。此外,我们还将增强数据集的认知深度,通过加入少量自我认知数据集来提高模型的泛化能力。
**二、数据集生成方法**
1. **模型选择与数据准备**
选择文心一言、通义千问、讯飞星火和智浦这四种大语言模型获取调用相应接口的API并准备用于生成对话数据。
2. **单轮与多轮对话数据生成**
选择文心一言、通义千问、讯飞星火和智谱GLM这四种大语言模型获取调用相应接口的API并准备用于生成对话数据。
3. **单轮与多轮对话数据生成**
利用这四种模型我们生成了10000条单轮和多轮对话数据。在这一过程中我们确保了数据的多样性、复杂性和有效性。
因为心理活动往往是复杂的为了保证数据的多样性。我们选择了16 * 28 共 `448`个场景进行数据集生成具体场景名称请参考config.yml中的 `emotions_list 和 areas_of_life`两个参数的配置。
3. **自我认知数据集的加入**
5. **自我认知数据集的加入**
为了增强模型的认知能力,我们特意加入了一部分自我认知数据集。这些数据集有助于模型更好地理解上下文,提高对话的自然度和连贯性。
@ -22,40 +24,40 @@
1. **初始化**
* 安装所需的软件和库
* 安装所需的软件和库
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
* 准备输入数据和配置参数
* 准备输入数据和配置参数
可参见 `config.yml`均有注释
2. **模型选择与配置**
* 根据需求选择适合的模型
* 根据需求选择适合的模型
为了使大家都能够玩上大模型我们选用InterLLM2-7B作为我们的基线模型消费级显卡也可部署微调的哦
* 对模型进行必要的配置和调整
* 对模型进行必要的配置和调整
根据我们的数据集以及配置策略使用XTuner进行微调
3. **数据生成**
* 使用通义千问大模型进行数据生成
* 使用通义千问大模型进行数据生成
```bash
# 终端运行
bash run_qwen.bash
```
* 使用百度文心大模型进行数据生成
* 使用百度文心大模型进行数据生成
```bash
# 终端运行
python ernie_gen_data.py
```
* 使用智浦AI大模型进行数据生成。
* 使用智谱GLM大模型进行数据生成
```bash
# 终端运行
python zhipuai_gen_data.py
```
* 使用讯飞星火大模型进行数据生成
* 使用讯飞星火大模型进行数据生成
```bash
# 终端运行
python ./xinghuo/gen_data.py
@ -63,7 +65,7 @@
4. **自我认知数据集的整合**
* 自我认知数据集这个就需要按照格式手动生成的哈~,如下格式即可
* 自我认知数据集需要按照格式手动生成,如下格式即可
```json
[
{
@ -85,16 +87,17 @@
]
```
5. **数据集整合**
5. **数据集整合**
在进行数据集整合之前我们要检查生成的数据是否存在格式错误类型不符合等情况。我们需要check.py进行检查数据。最后再使用merge_json.py将所有的json整合为一个总的json文件。
6. **评估与优化**
7. **评估与优化**
* 使用适当的评估指标对生成的数据集进行评估
* 根据评估结果进行必要的优化和调整
* 使用适当的评估指标对生成的数据集进行评估
* 根据评估结果进行必要的优化和调整
7. **测试与部署**
* 使用独立测试集对训练好的模型进行评估
* 根据测试结果进行必要的调整和优化
* 将最终的模型部署到实际应用中
* 使用独立测试集对训练好的模型进行评估
* 根据测试结果进行必要的调整和优化
* 将最终的模型部署到实际应用中

View File

@ -0,0 +1,106 @@
# EmoLLM fine-tuning data generation tutorial
**I. Objectives and Background**
In order to have a better representation of our large mental models, we must have high quality datasets. To achieve this goal, we decided to use four powerful AI grand models: **Wenxin Yiyan**, **Tongyi Qianwen**, **Feifei Spark**, and **Zhipu GLM** to generate conversation data. In addition, we will enhance the cognitive depth of the dataset and improve the generalization ability of the model by adding a small number of self-cognitive datasets.
**II. dataset generation method**
1. **Model selection and data preparation**
Choose four big language models, namely Wenxin Yiyan, Tongyi Qianwen, IFei Spark and Zhipu GLM, obtain the API to call the corresponding interface, and prepare to generate dialogue data.
3. **Single-turn and multi-turn dialogue data generation**
Using these four models, we generated 10,000 single and multi-turn conversation data. In doing so, we ensure the diversity, complexity and validity of our data.
Because mental activity is often complex, in order to ensure the diversity of data. We selected a total of 16 * 28 `448` scenarios for dataset generation. For specific scenario names, please refer to the configuration of the two parameters`emotions_list and areas_of_life`in config.yml.
4. **Inclusion of self-perception datasets**
In order to enhance the cognitive ability of the model, we specially added a part of self-cognitive dataset. These datasets help the model better understand the context and improve the naturalness and coherence of the conversation.
**III. Practical steps**
1. **Initialize**
* Install the required software and libraries
```bash
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
```
* Prepare input data and configuration parameters
See `config.yml` for annotations
2. **Model selection and configuration**
* Select the right model for your needs
In order to enable everyone to play with the large model, we chose the InterLLM2-7B as our baseline model (consumer graphics cards can also be deployed fine-tuned oh).
* Make necessary configurations and adjustments to the model
Use XTuner for fine-tuning based on our dataset and configuration strategy.
3. **Data generation**
* Data generation using Tongyi Qianwen
```bash
# Terminal operation
bash run_qwen.bash
```
* Data generation using Wenxin Yiyan
```bash
# Terminal operation
python ernie_gen_data.py
```
* Data generation using Zhipu GLM
```bash
# Terminal operation
python zhipuai_gen_data.py
```
* Data generation using IFlystar Fire
```bash
# Terminal operation
python ./xinghuo/gen_data.py
```
4. **Integration of self-cognition datasets**
* Self-cognition dataset this needs to be manually generated in accordance with the format, the following format can be
```json
[
{
"conversation": [
{
"input": "请介绍一下你自己",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
},
{
"conversation": [
{
"input": "请做一下自我介绍",
"output": "我是大佬的emo小助手可以帮助你解决心理上的问题哦"
}
]
}
]
```
5. **dataset integration**
Before dataset integration, we need to check whether the generated data has formatting errors, type mismatches, etc. We need check.py to check the data. Finally, merge_json.py is used to combine all the json into one overall json file.
6. **Evaluation and optimization**
* Evaluate the generated dataset using appropriate evaluation metrics
* Make necessary optimizations and adjustments based on the evaluation results
7. **Testing and deployment**
* Evaluate the trained model using an independent test set
* Make necessary adjustments and optimizations based on test results
* Deploy the final model into a real application
*

View File

@ -0,0 +1,10 @@
# Introduction
* gen_Chat is used to generate the ChatGLM3-6B dataset
* gen_data is used to generate the data set required for InternLM
⭐Precautions~
Spark Big Model V1.5 generates a specific topic with a **security limit**, the model will refuse to answer, be aware of similar data processing.
Example{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "xxx", "output": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。"}

0
rag/README_EN.md Normal file
View File

62
rag/src/util/text_seg.py Normal file
View File

@ -0,0 +1,62 @@
# 对文本类的非QA对数据做切分 --- 使用qwen的api对书籍进行语义分割
import json
import random
import argparse
import yaml
import re
import copy
from tqdm import tqdm
# config.yml文件由自己定义
with open('config.yml', 'r', encoding='utf-8') as f:
configs = yaml.load(f.read(), Loader=yaml.FullLoader)
def qwen_api(content):
import dashscope
from http import HTTPStatus
Input = '''我们的分割要求是每一个划分占一行请你帮我将下列txt文本按照书本的内容比如事件的背景心理学名词的定义特点阶段划分实验内容等进行划分要求文本内容不能缩减也可以按照语义分割比如某几句话都是讲的一回事就划分一行要求划分之后的文本内容详细主题明确要求每一个划分仅用一行表示。以下为要求分割的txt文本{}
'''.format(content)
dashscope.api_key = configs['dashscope_api_key']
response = dashscope.Generation.call(
model='qwen-max',
prompt=Input,
history=[],
)
if response.status_code == HTTPStatus.OK:
result = response.output.text
print(result)
else:
result = 'ERROR'
return result
def save_jsonl(data_lis, file_path):
import json
# 将字典列表写入文件,每一行一个字典
with open(file_path, 'at', encoding='utf-8') as file:
for item in data_lis:
json_string = json.dumps(item, ensure_ascii=False) + '\n'
file.write(json_string)
if __name__ == '__main__':
file_name = 'a0.jsonl'
conversations = []
path = configs['txt_path']
f = open(path, 'r', encoding='utf-8')
str = f.read()
f.close()
for i in tqdm(range(0, len(str), 2500)):
# 保证所有文本都能按照完整的语义进行分割
content = str[i:i+3500]
print(content)
answer = qwen_api(content)
f2 = open('seg1.txt', 'a', encoding='utf-8')
f2.write(answer)
f2.close()

View File

@ -1,37 +1,95 @@
# QA Generation Pipeline
# RAG数据库构建流程
## 1. 使用方法
## **构建目的**
1. 检查 `requirements.txt` 中的依赖是否满足。
2. 调整代码中 `system_prompt`确保与repo最新版本一致保证生成QA的多样性和稳定性。
3. 将txt文件放到与 `model`同级目录 `data`文件夹中.
4. 在 `config/config.py` 配置所需的 API KEY`main.py` 启动即可。生成的 QA 对会以 jsonl 的格式存在 `data/generated` 下。
利用心理学专业的书籍构建QA知识对为RAG提供心理咨询知识库使我们的EmoLLM的回答更加专业可靠。为了实现这个目标我们利用几十本心理学书籍来构建这个RAG知识库。主要的构建流程如下
### 1.1 API KEY 获取方法
## **构建流程**
目前仅包含了 qwen。
## **步骤一PDF to TXT**
#### 1.1.1 Qwen
- 目的
- 将收集到的PDF版本的心理学书籍转化为TXT文本文件方便后续的信息提取。
前往[模型服务灵积-API-KEY管理 (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),点击”创建新的 API-KEY“将获取的 API KEY 填至 `config/config.py` 中的 `DASHSCOPE_API_KEY` 即可。
- 所需工具
## 2. 注意事项
- [pdf2txt](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/pdf2txt.py)
### 2.1 系统提示 System Prompt
- [PaddleORC处理PDF用法参考](https://github.com/SmartFlowAI/EmoLLM/blob/main/generate_data/OCR.md)
- 安装必要的python库
```python
pip install paddlepaddle
pip install opencv-python
pip install paddleocr
```
注意,目前的解析方案是基于模型会生成 markdown 包裹的 json 块的前提的,更改 system prompt 时需要保证这一点不变。
- 注意
- 如果无法使用**pip install paddleocr**安装paddleocr可以考虑采用whl文件安装[下载地址](https://pypi.org/project/paddleocr/#files)
- 脚本启动方式采用命令行启动python pdf2txt.py [PDF存放的文件名]
### 2.2 滑动窗口 Sliding Window
## **步骤二筛选PDF**
滑动窗口的 `window_size``overlap_size` 都可以在 `util/data_loader.py` 中的 `get_txt_content` 函数中更改。目前是按照句子分割的滑动窗口。
- 筛选目的
### 2.3 书本文件格式 Corpus Format
- 利用LLM去除非专业心理学书籍
目前仅支持了 txt 格式,可以将清洗好的书籍文本放在 `data` 文件夹下,程序会递归检索该文件夹下的所有 txt 文件。
- 筛选标准,包含心理咨询相关内容,如:
## TODO
- 心理咨询流派 - 具体咨询方法
- 心理疾病 - 疾病特征
- 心理疾病 - 治疗方法
1. 支持更多模型Gemini、GPT、ChatGLM……
2. 支持多线程调用模型
3. 支持更多文本格式PDF……
4. 支持更多切分文本的方式
- 筛选方式:
- 根据标题初筛
- 若无法判断属于心理咨询相关书籍利用kimi/GLM-4查询是否包含心理咨询相关知识建议一次仅查询一本书
- ```markdown
参考prompt:
你是一位经验丰富的心理学教授,熟悉心理学知识和心理咨询。我需要你协助我完成"识别书籍是否包含心理咨询知识"任务请深呼吸并一步步思考给出你的答案。如果你的答案让我满意我将给你10w小费
具体任务如下:
判断该书籍中是否包含以下心理咨询相关知识:
'''
心理咨询流派 - 具体咨询方法
心理疾病 - 疾病特征
心理疾病 - 治疗方法
'''
请深呼吸并一步步查看该书籍,认真完成任务。
```
## **步骤三提取QA对**
- 根据书籍内容利用LLM高效构造QA知识对
- 提取流程
- 准备处理好的txt文本数据
- 按要求配置[脚本文件](https://github.com/SmartFlowAI/EmoLLM/tree/main/scripts/qa_generation)
- 根据自己的需求或者提取的结果合理修改window_size和overlap_size
- 使用方法
- 检查 `requirements.txt` 中的依赖是否满足。
- 调整代码中 `system_prompt`确保与repo最新版本一致保证生成QA的多样性和稳定性。
- 将txt文件放到与 `model`同级目录 `data`文件夹中.
- 在 `config/config.py` 配置所需的 API KEY`main.py` 启动即可。生成的 QA 对会以 jsonl 的格式存在 `data/generated` 下。
- API KEY 获取方法
- 目前仅包含了 qwen。
- Qwen
- 前往[模型服务灵积-API-KEY管理 (aliyun.com)](https://dashscope.console.aliyun.com/apiKey),点击”创建新的 API-KEY“将获取的 API KEY 填至 `config/config.py` 中的 `DASHSCOPE_API_KEY` 即可。
- 注意事项
- 系统提示 System Prompt
- 注意,目前的解析方案是基于模型会生成 markdown 包裹的 json 块的前提的,更改 system prompt 时需要保证这一点不变。
- 滑动窗口 Sliding Window
- 滑动窗口的 `window_size``overlap_size` 都可以在 `util/data_loader.py` 中的 `get_txt_content` 函数中更改。目前是按照句子分割的滑动窗口。
- 书本文件格式 Corpus Format
- 目前仅支持了 txt 格式,可以将清洗好的书籍文本放在 `data` 文件夹下,程序会递归检索该文件夹下的所有 txt 文件。
## **步骤四清洗QA对**
- 清洗目的

View File

@ -0,0 +1,95 @@
# RAG Database Building Process
## **Constructive purpose**
Using books specialized in psychology to build QA knowledge pairs for RAG to provide a counseling knowledge base to make our EmoLLM answers more professional and reliable. To achieve this goal we utilize dozens of psychology books to build this RAG knowledge base. The main building process is as follows:
## **Build process**
## **Step 1: PDF to TXT**
- purpose
- Convert the collected PDF versions of psychology books into TXT text files to facilitate subsequent information extraction
- Tools required
- [pdf2txt](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/pdf2txt.py)
- [PaddleORC Processing PDF Usage Reference](https://github.com/SmartFlowAI/EmoLLM/blob/main/generate_data/OCR.md)
- Install necessary python libraries
```python
pip install paddlepaddle
pip install opencv-python
pip install paddleocr
```
- precautionary
- If you are unable to install paddleocr using **pip install paddleocr**, consider using the whl file installation, [download address](https://pypi.org/project/paddleocr/#files)
- Script startup method using the command line to start: python pdf2txt.py [PDF file name stored in the]
## **Step 2: Screening PDF**
- Purpose of screening
- Using the LLM to go to non-professional psychology books
- Screening criteria that include counseling related content such as:
- Schools of Counseling - Specific Counseling Methods
- Mental Illness - Characteristics of the Disease
- Mental Illness - Treatment
- Screening method:
- Initial screening based on title
- If you can't tell if it is a counseling-related book, use kimi/GLM-4 to check if it contains counseling-related knowledge (it is recommended to check only one book at a time)
- ```markdown
Reference prompt.
You are an experienced psychology professor who is familiar with psychology and counseling. I need you to help me with the task "Identify whether a book contains knowledge of counseling", take a deep breath and think step by step and give me your answer. If your answer satisfies me, I will give you a 10w tip!
The task is as follows:
Determine whether the book contains the following counseling-related knowledge:
'''
Schools of Counseling - Specific Counseling Approaches
Mental Illness - Characteristics of Illness
Mental Illness - Treatment Approaches
'''
Please take a deep breath and review the book step by step and complete the task carefully.
```
## **Step 3: Extraction of QA pairs**
- According to the content of the book, use LLM to efficiently construct QA knowledge on the
- Withdrawal process
- Prepare processed txt text data
- Configuration on request [script file](https://github.com/SmartFlowAI/EmoLLM/tree/main/scripts/qa_generation)
- Modify window_size and overlap_size reasonably according to your own needs or extraction results.
- Usage
- Checks if the dependencies in `requirements.txt` are satisfied.
- Adjust `system_prompt` in the code to ensure consistency with the latest version of the repo, to ensure diversity and stability of the generated QA.
- Place the txt file in the `data` folder in the same directory as the `model`.
- Configure the required API KEYs in `config/config.py` and start from `main.py`. The generated QA pairs are stored in jsonl format under `data/generated`.
- API KEY Getting Methods
- Currently only qwen is included.
- Qwen
- Go to [Model Service LingJi - API-KEY Management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey), click "Create New API-KEY", and fill in the obtained API KEY into the Click "Create new API-KEY", fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.
- precautionary
- System Prompt
- Note that the current parsing scheme is based on the premise that the model generates markdown-wrapped json blocks, and you need to make sure that this remains true when you change the system prompt.
- Sliding Window
- The `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py`. Currently the sliding window is split by sentence.
- Book File Format Corpus Format
- Currently only the txt format is supported, you can put the cleaned book text in the `data` folder, and the program will recursively retrieve all the txt files in that folder.
## **Step 4: Cleaning of QA pairs**
- Purpose of cleaning

View File

@ -0,0 +1,24 @@
You are a QA pair generator robot, you will automatically generate the appropriate QA pair according to the content of the psychology book provided by me, and the requirements are as follows:
- For the text I gave you, you need to generate five such QA pairs
- QA should not repeat the content, and the answer should not be too long
- Answer in Simplified Chinese
- The generated QA pair needs to be wrapped in json code blocks in markdown format
Here is the reference format:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
Here is the text given:

View File

@ -0,0 +1,26 @@
You are an experienced psychologist, familiar with psychological knowledge and psychological counseling techniques. Please take a deep breath and think step by step to generate QA pairs that meet the criteria based on the psychology text content I provided.
The criteria are as follows:
- Generate 5-10 QA pairs per psychology text
- QA should select "Psychology Knowledge" according to the psychology text content; Specific consulting methods; Mental illness characteristics; The most suitable topic generated in "Methods of Treatment for Mental illness"
- QA should not repeat the content, and the answer should not be too long
- QA pairs are simplified Chinese
- The generated QA pair needs to be wrapped in json code blocks in markdown format
The reference format is as follows:
```json
[
{
"question": "...",
"answer": "..."
},
{
"question": "...",
"answer": "..."
},
...
]
```
The following is the content of a given psychology text:

View File

@ -218,7 +218,8 @@ def main():
user_avator = "assets/user.png"
robot_avator = "assets/robot.jpeg"
st.title("EmoLLM-温柔御姐艾薇aiwei")
# st.title("EmoLLM-温柔御姐艾薇aiwei")
st.title("EmoLLM-艾薇aiweiAI心理咨询")
generation_config = prepare_generation_config()
@ -264,4 +265,4 @@ def main():
if __name__ == "__main__":
main()
main()