Merge pull request #1 from SmartFlowAI/main

同步
This commit is contained in:
HongCheng 2024-03-18 22:23:25 +09:00 committed by GitHub
commit 72a7746d8b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
19 changed files with 1092 additions and 609 deletions

2
.gitignore vendored
View File

@ -3,6 +3,8 @@ ESConv.json
tmp/
zhipuai/
data/
pdf/
.idea/
*.jsonl
*.json

575
README.md
View File

@ -1,287 +1,288 @@
<div align="center">
# EmoLLM-心理健康大模型
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![OpenXLab_App][OpenXLab_App-image]][OpenXLab_App-url]
[![OpenXLab_Model][OpenXLab_Model-image]][OpenXLab_Model-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
</div>
<h3 align="center">EmoLLM</h3>
<div align="center">
简体中文| <a href="README_EN.md" >English</a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
<br />
<br />
<a href="https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0">体验EmoLLM 2.0</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">报告Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">提出新特性</a>
</div>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来欢迎大家star~⭐⭐。目前已经开源的 `LLM` 微调配置如下:
<div align="center">
| 模型 | 类型 |
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
欢迎大家为本项目做出贡献~
---
心理健康大模型Mental Health Grand Model是一个综合性的概念它旨在全面理解和促进个体、群体乃至整个社会的心理健康状态。这个模型通常包含以下几个关键组成部分
- 认知因素:涉及个体的思维模式、信念系统、认知偏差以及解决问题的能力。认知因素对心理健康有重要影响,因为它们影响个体如何解释和应对生活中的事件。
- 情感因素:包括情绪调节、情感表达和情感体验。情感健康是心理健康的重要组成部分,涉及个体如何管理和表达自己的情感,以及如何从负面情绪中恢复。
- 行为因素:涉及个体的行为模式、习惯和应对策略。这包括应对压力的技巧、社交技能以及自我效能感,即个体对自己能力的信心。
- 社会环境:包括家庭、工作、社区和文化背景等外部因素,这些因素对个体的心理健康有着直接和间接的影响。
- 生理健康:身体健康与心理健康紧密相关。良好的身体健康可以促进心理健康,反之亦然。
- 心理韧性:指个体在面对逆境时的恢复力和适应能力。心理韧性强的人更能够从挑战中恢复,并从中学习和成长。
- 预防和干预措施:心理健康大模型还包括预防心理问题和促进心理健康的策略,如心理教育、心理咨询、心理治疗和社会支持系统。
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
### 🎇最近更新
- 【2024.3.12】在百度飞浆平台发布[艾薇](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 相比 EmoLLM V1.0 全面提升,已超越 Role-playing ChatGPT 在心理咨询任务上的能力!**[点击体验EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0),更新[数据集统计及详细信息](./datasets/)、[路线图](./assets/Roadmap_ZH.png)
- 【2024.3.9】 新增并发功能加速 [QA 对生成](./scripts/qa_generation/)、[RAG pipeline](./rag/)
- 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本EmoLLM V2.0开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)需要两块A100*80G更新专业评估详见[evaluate](./evaluate/)更新基于PaddleOCR的PDF转txt工具脚本详见[scripts](./scripts/)
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)
- 【2024.2.27】更新英文readme和一系列数据集舔狗和单轮对话
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei)[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py)[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)温柔御姐角色扮演专用带有Emoji表情即将推出 `温柔御姐心理医生艾薇`
- 【2024.2.18】 [基于Qwen1_5-0_5B-Chat全量微调版本开源](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary),算力有限的道友可以玩起来~
<details>
<summary>查看更多</summary>
- 【2024.2.6】 EmoLLM在[**Openxlab** ](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 平台下载量高达18.7k,欢迎大家体验!
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
- 【2024.2.5】 项目荣获公众号**NLP工程化**推文宣传[推文链接](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A),为博主推广一波,欢迎大家关注!!🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [项目宣传视频](https://www.bilibili.com/video/BV1N7421N76X/)完成 😊
- 【2024.1.27】 完善数据构建文档、微调指南、部署指南、Readme等相关文档 👏
- 【2024.1.25】 EmoLLM V1.0 已部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
### 🎯路线图
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_ZH.png" alt="Roadmap_ZH">
</a>
## 目录
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
- [🎇最近更新](#最近更新)
- [🎯路线图](#路线图)
- [目录](#目录)
- [开发前的配置要求](#开发前的配置要求)
- [**使用指南**](#使用指南)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG(检索增强生成)Pipeline](#rag检索增强生成pipeline)
- [使用到的框架](#使用到的框架)
- [如何参与本项目](#如何参与本项目)
- [作者(排名不分先后)](#作者排名不分先后)
- [版权说明](#版权说明)
- [特别鸣谢](#特别鸣谢)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [交流群](#交流群)
###### 开发前的配置要求
- 硬件A100 40G仅针对InternLM2_7B_chat+qlora微调+deepspeed zero2优化
###### **使用指南**
1. Clone the repo
```sh
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
2. 依次阅读或者选择感兴趣的部分阅读:
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG](#rag检索增强生成pipeline)
- 查看更多详情
### 数据构建
- 请阅读[数据构建指南](generate_data/tutorial.md)查阅
- 微调用到的数据集见[datasets](datasets/data.json)
### 微调指南
详见[微调指南](xtuner_config/README.md)
### 部署指南
- Demo部署详见[部署指南](demo/README.md)
- 基于[LMDeploy](https://github.com/InternLM/lmdeploy/)的量化部署:详见[deploy](./deploy/lmdeploy.md)
### RAG(检索增强生成)Pipeline
- 详见[RAG](./rag/)
<details>
<summary>更多详情</summary>
### 使用到的框架
- [Xtuner](https://github.com/InternLM/xtuner):用于微调
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/):用于量化部署
- [Stremlit](https://streamlit.io/)用于构建Demo
- [DeepSpeed](https://github.com/microsoft/DeepSpeed):并行训练
- …
#### 如何参与本项目
贡献使开源社区成为一个学习、激励和创造的绝佳场所。你所作的任何贡献都是**非常感谢**的。
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
</details>
### 作者(排名不分先后)
| 用户名 | 学校/组织 | 备注 | 贡献 |
| :----------: | :--------------------: | :-------------------: | :----------: |
| [aJupyter](https://github.com/aJupyter) | 南开大学在读硕士 | DataWhale成员 | 项目发起人 |
| [jujimeizuo](https://github.com/jujimeizuo) | 江南大学在读硕士 | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | 哈尔滨工业大学(威海)在读本科生 | | |
| [8baby8](https://github.com/8baby8) | 飞桨领航团区域主管 | 文心大模型核心开发者 | |
| [zxazys](https://github.com/zxazys) | 南开大学在读硕士 | | |
| [MING-ZCH](https://github.com/MING-ZCH) | 华中科技大学在读本科生 | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | swufe | | |
| [MrCatAI](https://github.com/MrCatAI) | AI搬用工 | | |
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护|数据清洗、文档翻译|
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士研0 | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |
### 版权说明
该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)
### 引用
如果本项目对您的工作有所帮助,请使用以下格式引用:
```bibtex
@misc{EmoLLM,
title={EmoLLM},
author={EmoLLM},
url={https://github.com/SmartFlowAI/EmoLLM/},
year={2024}
}
```
### 特别鸣谢
- [Sanbu](https://github.com/sanbuphy)
- [上海人工智能实验室](https://www.shlab.org.cn/)
- [闻星大佬(小助手)](https://github.com/vansin)
- [扫地升(公众号宣传)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- 阿布(北大心理学硕士)
<!-- links -->
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE
[OpenXLab_App-image]: https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg
[OpenXLab_Model-image]: https://cdn-static.openxlab.org.cn/header/openxlab_models.svg
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
## 交流群
- 如果失效请移步Issue区
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM官方交流群">
</p>
<div align="center">
# EmoLLM-心理健康大模型
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![OpenXLab_App][OpenXLab_App-image]][OpenXLab_App-url]
[![OpenXLab_Model][OpenXLab_Model-image]][OpenXLab_Model-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
</div>
<h3 align="center">EmoLLM</h3>
<div align="center">
简体中文| <a href="README_EN.md" >English</a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
<br />
<br />
<a href="https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0">体验EmoLLM 2.0</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">报告Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">提出新特性</a>
</div>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来欢迎大家star~⭐⭐。目前已经开源的 `LLM` 微调配置如下:
<div align="center">
| 模型 | 类型 |
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | 全量微调 |
| InternLM2_1_8B_chat | 全量微调 |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
欢迎大家为本项目做出贡献~
---
心理健康大模型Mental Health Grand Model是一个综合性的概念它旨在全面理解和促进个体、群体乃至整个社会的心理健康状态。这个模型通常包含以下几个关键组成部分
- 认知因素:涉及个体的思维模式、信念系统、认知偏差以及解决问题的能力。认知因素对心理健康有重要影响,因为它们影响个体如何解释和应对生活中的事件。
- 情感因素:包括情绪调节、情感表达和情感体验。情感健康是心理健康的重要组成部分,涉及个体如何管理和表达自己的情感,以及如何从负面情绪中恢复。
- 行为因素:涉及个体的行为模式、习惯和应对策略。这包括应对压力的技巧、社交技能以及自我效能感,即个体对自己能力的信心。
- 社会环境:包括家庭、工作、社区和文化背景等外部因素,这些因素对个体的心理健康有着直接和间接的影响。
- 生理健康:身体健康与心理健康紧密相关。良好的身体健康可以促进心理健康,反之亦然。
- 心理韧性:指个体在面对逆境时的恢复力和适应能力。心理韧性强的人更能够从挑战中恢复,并从中学习和成长。
- 预防和干预措施:心理健康大模型还包括预防心理问题和促进心理健康的策略,如心理教育、心理咨询、心理治疗和社会支持系统。
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
### 🎇最近更新
- 【2024.3.12】在百度飞浆平台发布[艾薇](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 相比 EmoLLM V1.0 全面提升,已超越 Role-playing ChatGPT 在心理咨询任务上的能力!**[点击体验EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0),更新[数据集统计及详细信息](./datasets/)、[路线图](./assets/Roadmap_ZH.png)
- 【2024.3.9】 新增并发功能加速 [QA 对生成](./scripts/qa_generation/)、[RAG pipeline](./rag/)
- 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本EmoLLM V2.0开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)需要两块A100*80G更新专业评估详见[evaluate](./evaluate/)更新基于PaddleOCR的PDF转txt工具脚本详见[scripts](./scripts/)
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)
- 【2024.2.27】更新英文readme和一系列数据集舔狗和单轮对话
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei)[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py)[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)温柔御姐角色扮演专用带有Emoji表情即将推出 `温柔御姐心理医生艾薇`
- 【2024.2.18】 [基于Qwen1_5-0_5B-Chat全量微调版本开源](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary),算力有限的道友可以玩起来~
<details>
<summary>查看更多</summary>
- 【2024.2.6】 EmoLLM在[**Openxlab** ](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 平台下载量高达18.7k,欢迎大家体验!
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
- 【2024.2.5】 项目荣获公众号**NLP工程化**推文宣传[推文链接](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A),为博主推广一波,欢迎大家关注!!🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [项目宣传视频](https://www.bilibili.com/video/BV1N7421N76X/)完成 😊
- 【2024.1.27】 完善数据构建文档、微调指南、部署指南、Readme等相关文档 👏
- 【2024.1.25】 EmoLLM V1.0 已部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
### 🎯路线图
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_ZH.png" alt="Roadmap_ZH">
</a>
## 目录
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
- [🎇最近更新](#最近更新)
- [🎯路线图](#路线图)
- [目录](#目录)
- [开发前的配置要求](#开发前的配置要求)
- [**使用指南**](#使用指南)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG(检索增强生成)Pipeline](#rag检索增强生成pipeline)
- [使用到的框架](#使用到的框架)
- [如何参与本项目](#如何参与本项目)
- [作者(排名不分先后)](#作者排名不分先后)
- [版权说明](#版权说明)
- [特别鸣谢](#特别鸣谢)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [交流群](#交流群)
###### 开发前的配置要求
- 硬件A100 40G仅针对InternLM2_7B_chat+qlora微调+deepspeed zero2优化
###### **使用指南**
1. Clone the repo
```sh
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
2. 依次阅读或者选择感兴趣的部分阅读:
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- [RAG](#rag检索增强生成pipeline)
- 查看更多详情
### 数据构建
- 请阅读[数据构建指南](generate_data/tutorial.md)查阅
- 微调用到的数据集见[datasets](datasets/data.json)
### 微调指南
详见[微调指南](xtuner_config/README.md)
### 部署指南
- Demo部署详见[部署指南](demo/README.md)
- 基于[LMDeploy](https://github.com/InternLM/lmdeploy/)的量化部署:详见[deploy](./deploy/lmdeploy.md)
### RAG(检索增强生成)Pipeline
- 详见[RAG](./rag/)
<details>
<summary>更多详情</summary>
### 使用到的框架
- [Xtuner](https://github.com/InternLM/xtuner):用于微调
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/):用于量化部署
- [Stremlit](https://streamlit.io/)用于构建Demo
- [DeepSpeed](https://github.com/microsoft/DeepSpeed):并行训练
- …
#### 如何参与本项目
贡献使开源社区成为一个学习、激励和创造的绝佳场所。你所作的任何贡献都是**非常感谢**的。
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
</details>
### 作者(排名不分先后)
| 用户名 | 学校/组织 | 备注 | 贡献 |
| :----------: | :--------------------: | :-------------------: | :----------: |
| [aJupyter](https://github.com/aJupyter) | 南开大学在读硕士 | DataWhale成员 | 项目发起人 |
| [jujimeizuo](https://github.com/jujimeizuo) | 江南大学在读硕士 | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | 哈尔滨工业大学(威海)在读本科生 | | |
| [8baby8](https://github.com/8baby8) | 飞桨领航团区域主管 | 文心大模型核心开发者 | |
| [zxazys](https://github.com/zxazys) | 南开大学在读硕士 | | |
| [MING-ZCH](https://github.com/MING-ZCH) | 华中科技大学在读本科生 | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | swufe | | |
| [MrCatAI](https://github.com/MrCatAI) | AI搬用工 | | |
| [ZeyuBa](https://github.com/ZeyuBa) | 自动化所在读硕士 | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | 宾夕法尼亚大学在读硕士 | | |
| [Nobody-ML](https://github.com/Nobody-ML) | 中国石油大学(华东)在读本科生 | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora/) |MiniSora主要维护|数据清洗、文档翻译|
| [Mxoder](https://github.com/Mxoder) | 北京航空航天大学在读本科生 | | |
| [Anooyman](https://github.com/Anooyman) | 南京理工大学硕士 | | |
| [Vicky-3021](https://github.com/Vicky-3021) | 西安电子科技大学硕士研0 | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | 太原理工大学在读硕士 | | |
| [zealot52099](https://github.com/zealot52099) | AI搬用工 | |清洗数据、RAG|
### 版权说明
该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE)
### 引用
如果本项目对您的工作有所帮助,请使用以下格式引用:
```bibtex
@misc{EmoLLM,
title={EmoLLM},
author={EmoLLM},
url={https://github.com/SmartFlowAI/EmoLLM/},
year={2024}
}
```
### 特别鸣谢
- [Sanbu](https://github.com/sanbuphy)
- [上海人工智能实验室](https://www.shlab.org.cn/)
- [闻星大佬(小助手)](https://github.com/vansin)
- [扫地升(公众号宣传)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- 阿布(北大心理学硕士)
<!-- links -->
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE
[OpenXLab_App-image]: https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg
[OpenXLab_Model-image]: https://cdn-static.openxlab.org.cn/header/openxlab_models.svg
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
## 交流群
- 如果失效请移步Issue区
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM官方交流群">
</p>

View File

@ -1,300 +1,300 @@
<div align="center">
# EmoLLM - Large Language Model for Mental Health
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![OpenXLab_App][OpenXLab_App-image]][OpenXLab_App-url]
[![OpenXLab_Model][OpenXLab_Model-image]][OpenXLab_Model-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
</div>
<h3 align="center">EmoLLM</h3>
<p align="center">
<a href="README.md">简体中文</a> | English
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
<br />
<br />
<a href="https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0">EmoLLM 2.0 Demo</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Report a Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Propose a New Feature</a>
</p>
</p>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you could give it a star~⭐⭐. The open-sourced configuration is as follows:
<div align="center">
| Model | Type |
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | full fine-tuning |
| InternLM2_1_8B_chat | full fine-tuning |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | full fine-tuning |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
Everyone is welcome to contribute to this project ~
---
The Model aims to fully understand and promote the mental health of individuals, groups, and society. This model typically includes the following key components:
- Cognitive factors: Involving an individual's thought patterns, belief systems, cognitive biases, and problem-solving abilities. Cognitive factors significantly impact mental health as they affect how individuals interpret and respond to life events.
- Emotional factors: Including emotion regulation, emotional expression, and emotional experiences. Emotional health is a crucial part of mental health, involving how individuals manage and express their emotions and how they recover from negative emotions.
- Behavioral factors: Concerning an individual's behavior patterns, habits, and coping strategies. This includes stress management skills, social skills, and self-efficacy, which is the confidence in one's abilities.
- Social environment: Comprising external factors such as family, work, community, and cultural background, which have direct and indirect impacts on an individual's mental health.
- Physical health: There is a close relationship between physical and mental health. Good physical health can promote mental health and vice versa.
- Psychological resilience: Refers to an individual's ability to recover from adversity and adapt. Those with strong psychological resilience can bounce back from challenges and learn and grow from them.
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
### Recent Updates
- 【2024.3.12】 Released on Baidu Flying Pulp Platform [aiwei](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 is greatly improved in all scores compared to EmoLLM V1.0. Surpasses the performance of Role-playing ChatGPT on counseling tasks!** [Click to experience EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0), update [dataset statistics and details](./datasets/), [Roadmap](./assets/Roadmap_ZH.png)
- 【2024.3.9】 Add concurrency acceleration [QA pair generation](./scripts/qa_generation/), [RAG pipeline](./rag/)
- 【2024.3.3】 [Based on InternLM2-7B-chat full fine-tuned version EmoLLM V2.0 open sourced](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】Updated [several fine-tuning configurations](/xtuner_config/), added [data_pro.json](/datasets/data_pro.json) (more quantity, more comprehensive scenarios, richer content) and [aiwei.json](/datasets/aiwei.json) (dedicated to the gentle lady role-play, featuring Emoji expressions), the "Gentle Lady Psychologist Ai Wei" is coming soon.
- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
<details>
<summary>View More</summary>
- 【2024.2.6】 [Open-sourced based on the Qwen1_5-0_5B-Chat full-scale fine-tuned version](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary), friends with limited computing power can start experimenting~
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
- 【2024.2.5】 The project has been promoted by the official WeChat account NLP Engineering. Here's the [link](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A) to the article. Welcome everyone to follow!! 🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [Project Vedio](https://www.bilibili.com/video/BV1N7421N76X/) at bilibili 😊
- 【2024.1.27】 Complete data construction documentation, fine-tuning guide, deployment guide, Readme, and other related documents 👏
- 【2024.1.25】 EmoLLM V1.0 has deployed online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
### Roadmap
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_EN.png" alt="Roadmap_EN">
</a>
## Contents
- [EmoLLM - Large Language Model for Mental Health](#emollm---large-language-model-for-mental-health)
- [Recent Updates](#recent-updates)
- [Roadmap](#roadmap)
- [Contents](#contents)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- [RAG (Retrieval Augmented Generation) Pipeline](#rag-retrieval-augmented-generation-pipeline)
- [Frameworks Used](#frameworks-used)
- [How to participate in this project](#how-to-participate-in-this-project)
- [Version control](#version-control)
- [Authors (in no particular order)](#authors-in-no-particular-order)
- [Copyright Notice](#copyright-notice)
- [Acknowledgments](#acknowledgments)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [Communication group](#communication-group)
###### Pre-development Configuration Requirements.
- A100 40G (specifically for InternLM2_7B_chat + qlora fine-tuning + deepspeed zero2 optimization)
###### **User Guide**
1. Clone the repo
```sh
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
1. Read in sequence or read sections you're interested in
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- View More Details
### File Directory Explanation
```
├─assets: Image Resources
├─datasets: Dataset
├─demo: demo scripts
├─generate_data: Data Generation Guide
│ └─xinghuo
├─scripts: Some Available Tools
└─xtuner_configFine-tuning Guide
└─images
```
### Data Construction
- Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
- The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
### Fine-tuning Guide
For details, see the [fine-tuning guide](xtuner_config/README.md)
### Deployment Guide
- Demo deployment: see [deployment guide](./demo/README.md) for details.
- Quantitative deployment based on [LMDeploy](https://github.com/InternLM/lmdeploy/): see [deploy](./deploy/lmdeploy.md)
### RAG (Retrieval Augmented Generation) Pipeline
- See [RAG](./rag/)
<details>
<summary>Additional Details</summary>
### Frameworks Used
- [Xtuner](https://github.com/InternLM/xtuner)
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/): for quantitative deployment
- [Stremlit](https://streamlit.io/): for building demos
- [DeepSpeed](https://github.com/microsoft/DeepSpeed): for parallel training
- …
#### How to participate in this project
Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
### Version control
This project uses Git for version control. You can see the currently available versions in the repository.
</details>
### Authors (in no particular order)
| Username | School/Organization | Remarks | Contributions |
| :-------: | :-------------------: | :------------------: | :--------: |
| [aJupyter](https://github.com/aJupyter) | Nankai University, Master's student | DataWhale member | Project initiator |
| [jujimeizuo](https://github.com/jujimeizuo) | Jiangnan University, Master's student | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | Harbin Institute of Technology (Weihai), Undergraduate student | | |
| [8baby8](https://github.com/8baby8) | PaddlePaddle Pilot Team Regional Director | Wenxin Large Model core developer | |
| [zxazys](https://github.com/zxazys) | Nankai University, Master's student | | |
| [MING-ZCH](https://github.com/MING-ZCH) | Huazhong University of Science and Technology, Undergraduate student | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | SWUFE (Southwestern University of Finance and Economics) | | |
| [MrCatAI](https://github.com/MrCatAI) | AI Mover | | |
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |
### Copyright Notice
The project is licensed under the MIT License. Please refer to the details
[LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
### Acknowledgments
- [Sanbu](https://github.com/sanbuphy)
- [Shanghai Artificial Intelligence Laboratory](https://www.shlab.org.cn/)
- [Vanin](https://github.com/vansin)
- [Bloom up (WeChat Official Account Promotion)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- Abu (M.A. in Psychology, Peking University)
<!-- links -->
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
<!-- 太少了,没必要放 -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
[OpenXLab_App-image]: https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg
[OpenXLab_Model-image]: https://cdn-static.openxlab.org.cn/header/openxlab_models.svg
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
## Communication group
- If it fails, go to the Issue section.
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM official communication group">
</p>
<div align="center">
# EmoLLM - Large Language Model for Mental Health
</div>
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<div align="center">
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![OpenXLab_App][OpenXLab_App-image]][OpenXLab_App-url]
[![OpenXLab_Model][OpenXLab_Model-image]][OpenXLab_Model-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
</div>
<h3 align="center">EmoLLM</h3>
<p align="center">
<a href="README.md">简体中文</a> | English
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
<br />
<br />
<a href="https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0">EmoLLM 2.0 Demo</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Report a Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Propose a New Feature</a>
</p>
</p>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you could give it a star~⭐⭐. The open-sourced configuration is as follows:
<div align="center">
| Model | Type |
| :-------------------: | :------: |
| InternLM2_7B_chat | QLORA |
| InternLM2_7B_chat | full fine-tuning |
| InternLM2_1_8B_chat | full fine-tuning |
| InternLM2_20B_chat | LORA |
| Qwen_7b_chat | QLORA |
| Qwen1_5-0_5B-Chat | full fine-tuning |
| Baichuan2_13B_chat | QLORA |
| ChatGLM3_6B | LORA |
| DeepSeek MoE_16B_chat | QLORA |
| Mixtral 8x7B_instruct | QLORA |
| …… | …… |
</div>
Everyone is welcome to contribute to this project ~
---
The Model aims to fully understand and promote the mental health of individuals, groups, and society. This model typically includes the following key components:
- Cognitive factors: Involving an individual's thought patterns, belief systems, cognitive biases, and problem-solving abilities. Cognitive factors significantly impact mental health as they affect how individuals interpret and respond to life events.
- Emotional factors: Including emotion regulation, emotional expression, and emotional experiences. Emotional health is a crucial part of mental health, involving how individuals manage and express their emotions and how they recover from negative emotions.
- Behavioral factors: Concerning an individual's behavior patterns, habits, and coping strategies. This includes stress management skills, social skills, and self-efficacy, which is the confidence in one's abilities.
- Social environment: Comprising external factors such as family, work, community, and cultural background, which have direct and indirect impacts on an individual's mental health.
- Physical health: There is a close relationship between physical and mental health. Good physical health can promote mental health and vice versa.
- Psychological resilience: Refers to an individual's ability to recover from adversity and adapt. Those with strong psychological resilience can bounce back from challenges and learn and grow from them.
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
### Recent Updates
- 【2024.3.12】 Released on Baidu Flying Pulp Platform [aiwei](https://aistudio.baidu.com/community/app/63335)
- 【2024.3.11】 **EmoLLM V2.0 is greatly improved in all scores compared to EmoLLM V1.0. Surpasses the performance of Role-playing ChatGPT on counseling tasks!** [Click to experience EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0), update [dataset statistics and details](./datasets/), [Roadmap](./assets/Roadmap_ZH.png)
- 【2024.3.9】 Add concurrency acceleration [QA pair generation](./scripts/qa_generation/), [RAG pipeline](./rag/)
- 【2024.3.3】 [Based on InternLM2-7B-chat full fine-tuned version EmoLLM V2.0 open sourced](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】Updated [several fine-tuning configurations](/xtuner_config/), added [data_pro.json](/datasets/data_pro.json) (more quantity, more comprehensive scenarios, richer content) and [aiwei.json](/datasets/aiwei.json) (dedicated to the gentle lady role-play, featuring Emoji expressions), the "Gentle Lady Psychologist Ai Wei" is coming soon.
- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
<details>
<summary>View More</summary>
- 【2024.2.6】 [Open-sourced based on the Qwen1_5-0_5B-Chat full-scale fine-tuned version](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary), friends with limited computing power can start experimenting~
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
- 【2024.2.5】 The project has been promoted by the official WeChat account NLP Engineering. Here's the [link](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A) to the article. Welcome everyone to follow!! 🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [Project Vedio](https://www.bilibili.com/video/BV1N7421N76X/) at bilibili 😊
- 【2024.1.27】 Complete data construction documentation, fine-tuning guide, deployment guide, Readme, and other related documents 👏
- 【2024.1.25】 EmoLLM V1.0 has deployed online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
### Roadmap
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/Roadmap_EN.png" alt="Roadmap_EN">
</a>
## Contents
- [EmoLLM - Large Language Model for Mental Health](#emollm---large-language-model-for-mental-health)
- [Recent Updates](#recent-updates)
- [Roadmap](#roadmap)
- [Contents](#contents)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- [RAG (Retrieval Augmented Generation) Pipeline](#rag-retrieval-augmented-generation-pipeline)
- [Frameworks Used](#frameworks-used)
- [How to participate in this project](#how-to-participate-in-this-project)
- [Version control](#version-control)
- [Authors (in no particular order)](#authors-in-no-particular-order)
- [Copyright Notice](#copyright-notice)
- [Acknowledgments](#acknowledgments)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
- [Communication group](#communication-group)
###### Pre-development Configuration Requirements.
- A100 40G (specifically for InternLM2_7B_chat + qlora fine-tuning + deepspeed zero2 optimization)
###### **User Guide**
1. Clone the repo
```sh
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
1. Read in sequence or read sections you're interested in
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- View More Details
### File Directory Explanation
```
├─assets: Image Resources
├─datasets: Dataset
├─demo: demo scripts
├─generate_data: Data Generation Guide
│ └─xinghuo
├─scripts: Some Available Tools
└─xtuner_configFine-tuning Guide
└─images
```
### Data Construction
- Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
- The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
### Fine-tuning Guide
For details, see the [fine-tuning guide](xtuner_config/README.md)
### Deployment Guide
- Demo deployment: see [deployment guide](./demo/README.md) for details.
- Quantitative deployment based on [LMDeploy](https://github.com/InternLM/lmdeploy/): see [deploy](./deploy/lmdeploy.md)
### RAG (Retrieval Augmented Generation) Pipeline
- See [RAG](./rag/)
<details>
<summary>Additional Details</summary>
### Frameworks Used
- [Xtuner](https://github.com/InternLM/xtuner)
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- [LMDeploy](https://github.com/InternLM/lmdeploy/): for quantitative deployment
- [Stremlit](https://streamlit.io/): for building demos
- [DeepSpeed](https://github.com/microsoft/DeepSpeed): for parallel training
- …
#### How to participate in this project
Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
### Version control
This project uses Git for version control. You can see the currently available versions in the repository.
</details>
### Authors (in no particular order)
| Username | School/Organization | Remarks | Contributions |
| :-------: | :-------------------: | :------------------: | :--------: |
| [aJupyter](https://github.com/aJupyter) | Nankai University, Master's student | DataWhale member | Project initiator |
| [jujimeizuo](https://github.com/jujimeizuo) | Jiangnan University, Master's student | | |
| [Smiling-Weeping-zhr](https://github.com/Smiling-Weeping-zhr) | Harbin Institute of Technology (Weihai), Undergraduate student | | |
| [8baby8](https://github.com/8baby8) | PaddlePaddle Pilot Team Regional Director | Wenxin Large Model core developer | |
| [zxazys](https://github.com/zxazys) | Nankai University, Master's student | | |
| [MING-ZCH](https://github.com/MING-ZCH) | Huazhong University of Science and Technology, Undergraduate student | | |
| [JasonLLLLLLLLLLL](https://github.com/JasonLLLLLLLLLLL) | SWUFE (Southwestern University of Finance and Economics) | | |
| [MrCatAI](https://github.com/MrCatAI) | AI Mover | | |
| [ZeyuBa](https://github.com/ZeyuBa) | Institute of Automation, Master's student | | |
| [aiyinyuedejustin](https://github.com/aiyinyuedejustin) | University of Pennsylvania, Master's student | | |
| [Nobody-ML](https://github.com/Nobody-ML) | China University of Petroleum (East China), Undergraduate student | | |
| [chg0901](https://github.com/chg0901) | [MiniSora](https://github.com/mini-sora/minisora) |Maintainer and Admin|Data Cleaning and Docs Translation|
| [Mxoder](https://github.com/Mxoder) | Beihang University, Undergraduate student | | |
| [Anooyman](https://github.com/Anooyman) | Nanjing University of Science and Technology, Master's student | | |
| [Vicky-3021](https://github.com/Vicky-3021) | Xidian University, Master's student (Research Year 0) | | |
| [SantiagoTOP](https://github.com/santiagoTOP) | Taiyuan University of Technology, Master's student | | |
| [zealot52099](https://github.com/zealot52099) | AI Mover | |Data Processing and RAG|
### Copyright Notice
The project is licensed under the MIT License. Please refer to the details
[LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
### Acknowledgments
- [Sanbu](https://github.com/sanbuphy)
- [Shanghai Artificial Intelligence Laboratory](https://www.shlab.org.cn/)
- [Vanin](https://github.com/vansin)
- [Bloom up (WeChat Official Account Promotion)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
- Abu (M.A. in Psychology, Peking University)
<!-- links -->
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
<!-- 太少了,没必要放 -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=SmartFlowAI/EmoLLM&type=Date)](https://star-history.com/#SmartFlowAI/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
[OpenXLab_App-image]: https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg
[OpenXLab_Model-image]: https://cdn-static.openxlab.org.cn/header/openxlab_models.svg
[OpenXLab_App-url]: https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0
[OpenXLab_Model-url]: https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full
## Communication group
- If it fails, go to the Issue section.
<p align="center">
<img width="30%" src="https://github.com/SmartFlowAI/EmoLLM/assets/62385492/55ecd0aa-4832-4269-ad57-4c26f9aa286b" alt="EmoLLM official communication group">
</p>

View File

@ -23,7 +23,14 @@ def qwen_api(data, emo):
病人病人的咨询或陈述
医生医生的安抚和建议
'''
response = dashscope.Generation.call(
try:
response = dashscope.Generation.call(
model='qwen-max',
prompt=prompt,
history=[],
)
except:
response = dashscope.Generation.call(
model='qwen-max',
prompt=prompt,
history=[],
@ -54,14 +61,17 @@ if __name__ == '__main__':
emotions_lis = configs['emotions_list']
areas_of_life = configs['areas_of_life']
ai_tool = 'qwen'
save_interval = 5
total_num_each_emo_area = 5
conversation_lis = []
for emo in emotions_lis:
for area in areas_of_life:
for area in areas_of_life:
for emo in emotions_lis:
gen_path = f'./{ai_tool}/{area}/{emo}.jsonl'
for i in tqdm(range(100), desc='{emo}, {area}'.format(emo=emo, area=area)):
for i in tqdm(range(total_num_each_emo_area), desc='{emo}, {area}'.format(emo=emo, area=area)):
one_conversation = {
"conversation": []
}
@ -98,8 +108,7 @@ if __name__ == '__main__':
)
conversation_lis.append(one_conversation)
# 每生成10条数据存储一次
if ((i+1) % 10 == 0):
if ((i+1) % save_interval == 0):
save_jsonl(data_lis=conversation_lis, file_path=gen_path)
print(f'generate {gen_path}')
conversation_lis = [] # 清空

View File

@ -100,7 +100,10 @@
5. **数据集整合**
在进行数据集整合之前我们要检查生成的数据是否存在格式错误类型不符合等情况。我们需要check.py进行检查数据。最后再使用merge_json.py将所有的json整合为一个总的json文件。
在进行数据集整合之前,我们要检查生成的数据是否存在格式错误,类型不符合等情况。
* 首先使用`check.py`进行数据检查。
* 然后使用`merge_json.py`将所有的json整合为一个总的json文件。
6. **评估与优化**

View File

@ -82,12 +82,15 @@ if __name__ == '__main__':
areas_of_life = configs['areas_of_life']
ai_tool = 'zhipuai'
save_interval = 5
total_num_each_emo_area = 5
conversation_lis = []
for emo in emotions_lis:
for area in areas_of_life:
for area in areas_of_life:
for emo in emotions_lis:
gen_path = f'./{ai_tool}/{area}/{emo}.jsonl'
for i in tqdm(range(100), desc='{emo}, {area}'.format(emo=emo, area=area)):
for i in tqdm(range(total_num_each_emo_area), desc='{emo}, {area}'.format(emo=emo, area=area)):
res = zhipu_api(area, emo)
print(res)
if res == 'null':
@ -95,7 +98,7 @@ if __name__ == '__main__':
continue
conversation_lis.append(convert(res))
if ((i+1) % 10 == 0):
if ((i+1) % save_interval == 0):
# path = f'./{args.data}.jsonl'
save_jsonl(data_lis=conversation_lis, file_path=gen_path)
print(f'generate {gen_path}')

View File

@ -0,0 +1,66 @@
# EmoLLM RAG
## **Module purpose**
Based on the customer's questions, the corresponding information is retrieved to enhance the professionalism of the answer, making EmoLLM's answer more professional and reliable. Search content includes but is not limited to the following:
- Psychology related theories
- Psychology methodology
- Classic Case
- Customer background knowledge
## **Datasets**
- Cleaned QA pairs: Each QA pair is embedding as a sample
- Filtered TXT texts
- Directly generate embedding for TXT text (segmented based on token length)
- Filter out irrelevant information such as directories and generate embedding for TXT text (segmented based on token length)
- After filtering irrelevant information such as directories, the TXT is semantically segmented to generate embedding.
- Split TXT according to the directory structure, and generate embeddings based on the architecture hierarchy.
For details on data collection construction, please refer to [qa_generation_README](https://github.com/SmartFlowAI/EmoLLM/blob/ccfa75c493c4685e84073dfbc53c50c09a2988e3/scripts/qa_generation/README.md)
## **Components**
### [BCEmbedding](https://github.com/netease-youdao/BCEmbedding?tab=readme-ov-file)
- [bce-embedding-base_v1](https://hf-mirror.com/maidalun1020/bce-embedding-base_v1): embedding model, used to build vector DB
- [bce-reranker-base_v1](https://hf-mirror.com/maidalun1020/bce-reranker-base_v1): rerank model, used to rerank retrieved documents
### [Langchain](https://python.langchain.com/docs/get_started)
LangChain is an open source framework for building large language model (LLM) based applications. LangChain provides a variety of tools and abstractions to increase the customization, accuracy, and relevance of the information generated by your models.
### [FAISS](https://faiss.ai/)
FAISS is a library for efficient similarity search and dense vector clustering. It contains algorithms that can search sets of vectors of any size. Since langchain has integrated FAISS, this project will no longer be developed based on native documents. [FAISS in Langchain](https://python.langchain.com/docs/integrations/vectorstores/faiss)
### [RAGAS](https://github.com/explodinggradients/ragas)
RAGs classic evaluation framework is evaluated through the following three aspects:
- Faithfulness: The answers given should be generated based on the given context.
- Answer Relevance: The generated answer should solve the actual question asked.
- Context Relevance: The retrieved information should be highly concentrated and contain as little irrelevant information as possible.
Later, more evaluation indicators were added, such as: context recall, etc.
## **Detials**
### RAG pipeline
- Build vector DB based on data set
- Embedding questions entered by customers
- Search in vector database based on embedding results
- Reorder recall data
- Generate final results based on user questions and recall data
**Noted**: The above process will only be carried out when the user chooses to use RAG
### Follow-up actions
- Add RAGAS evaluation results to the generation process. For example, when the generated results cannot solve the user's problem, it needs to be regenerated.
- Add web retrieval to deal with the problem that the corresponding information cannot be retrieved in vector DB
- Add multi-channel retrieval to increase recall rate. That is, multiple similar queries are generated based on user input for retrieval.

View File

@ -1,4 +1,6 @@
sentence_transformers
transformers
numpy
loguru
loguru
langchain
torch

View File

@ -3,6 +3,7 @@ import os
cur_dir = os.path.dirname(os.path.abspath(__file__)) # config
src_dir = os.path.dirname(cur_dir) # src
base_dir = os.path.dirname(src_dir) # base
model_repo = 'ajupyter/EmoLLM_aiwei'
# model
model_dir = os.path.join(base_dir, 'model') # model
@ -17,3 +18,6 @@ knowledge_pkl_path = os.path.join(data_dir, 'knowledge.pkl') # pickle
# log
log_dir = os.path.join(base_dir, 'log') # log
log_path = os.path.join(log_dir, 'log.log') # file
select_num = 3
retrieval_num = 10

View File

@ -5,8 +5,19 @@ import numpy as np
from typing import Tuple
from sentence_transformers import SentenceTransformer
from config.config import knowledge_json_path, knowledge_pkl_path
from config.config import knowledge_json_path, knowledge_pkl_path, model_repo
from util.encode import load_embedding, encode_qa
from util.pipeline import EmoLLMRAG
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import streamlit as st
from openxlab.model import download
download(
model_repo=model_repo,
output='model'
)
"""
@ -62,6 +73,19 @@ def main():
## 2. 将 contents 拼接为 prompt传给 LLM作为 {已知内容}
## 3. 要求 LLM 根据已知内容回复
@st.cache_resource
def load_model():
model = (
AutoModelForCausalLM.from_pretrained("model", trust_remote_code=True)
.to(torch.bfloat16)
.cuda()
)
tokenizer = AutoTokenizer.from_pretrained("model", trust_remote_code=True)
return model, tokenizer
if __name__ == '__main__':
main()
#main()
query = ''
model, tokenizer = load_model()
rag_obj = EmoLLMRAG(model)
response = rag_obj.main(query)

114
rag/src/util/pipeline.py Normal file
View File

@ -0,0 +1,114 @@
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from transformers.utils import logging
from config.config import retrieval_num, select_num
logger = logging.get_logger(__name__)
class EmoLLMRAG(object):
"""
EmoLLM RAG Pipeline
1. 根据 query 进行 embedding
2. vector DB 中检索数据
3. rerank 检索后的结果
4. query 和检索回来的 content 传入 LLM
"""
def __init__(self, model) -> None:
"""
输入 Model 进行初始化
DataProcessing obj: 进行数据处理包括数据 embedding/rerank
vectorstores: 加载vector DB如果没有应该重新创建
system prompt: 获取预定义的 system prompt
prompt template: 定义最后的输入到 LLM 中的 template
"""
self.model = model
self.vectorstores = self._load_vector_db()
self.system_prompt = self._get_system_prompt()
self.prompt_template = self._get_prompt_template()
# 等待 embedding team 封装对应接口
#self.data_process_obj = DataProcessing()
def _load_vector_db(self):
"""
调用 embedding 模块给出接口 load vector DB
"""
return
def _get_system_prompt(self) -> str:
"""
加载 system prompt
"""
return ''
def _get_prompt_template(self) -> str:
"""
加载 prompt template
"""
return ''
def get_retrieval_content(self, query, rerank_flag=False) -> str:
"""
Input: 用户提问, 是否需要rerank
ouput: 检索后并且 rerank 的内容
"""
content = ''
documents = self.vectorstores.similarity_search(query, k=retrieval_num)
# 如果需要rerank调用接口对 documents 进行 rerank
if rerank_flag:
pass
# 等后续调用接口
#documents = self.data_process_obj.rerank_documents(documents, select_num)
for doc in documents:
content += doc.page_content
return content
def generate_answer(self, query, content) -> str:
"""
Input: 用户提问 检索返回的内容
Output: 模型生成结果
"""
# 构建 template
# 第一版不涉及 history 信息,因此将 system prompt 直接纳入到 template 之中
prompt = PromptTemplate(
template=self.prompt_template,
input_variables=["query", "content", "system_prompt"],
)
# 定义 chain
# output格式为 string
rag_chain = prompt | self.model | StrOutputParser()
# Run
generation = rag_chain.invoke(
{
"query": query,
"content": content,
"system_prompt": self.system_prompt
}
)
return generation
def main(self, query) -> str:
"""
Input: 用户提问
output: LLM 生成的结果
定义整个 RAG pipeline 流程调度各个模块
TODO:
加入 RAGAS 评分系统
"""
content = self.get_retrieval_content(query)
response = self.generate_answer(query, content)
return response

View File

@ -0,0 +1,111 @@
import os
import json
import time
from tqdm import tqdm
import concurrent.futures
from datetime import datetime
import numpy as np
from config.config import result_dir, clean_dir, storage_interval, window_size, overlap_size, multi_process_num
from model.qwen import call_qwen_single_turn, call_qwen_Psychology_QA_Pairs
from util.logger import get_logger
from util.data_loader import get_jsonl_file_paths, get_file_list, get_QA_pairs, get_txt_content, capture_qa, merge_sub_qa_generation, save_to_file
logger = get_logger()
def single_thread_generate(thread_num, interval, model_caller, storage_jsonl_path, contents):
storage_counter = 0
judge_list = []
for content in tqdm(contents):
# print('content: ', content)
try:
# model_caller 函数的作用是调用某个预训练的问答生成模型,传递输入内容 content 给模型,然后获取模型的输出 response
response = model_caller(content)
# print('response: ', response)
if response == '1':
content = json.loads(content)
judge_list.append(content)
storage_counter += 1
else:
continue
# 在达到指定的 interval 后,将 storage_list 中的内容保存到指定的文件 storage_jsonl_path 中
if storage_counter % interval == 0:
save_to_file(storage_jsonl_path, judge_list)
storage_counter = 0
judge_list = []
except Exception as exc:
logger.error("QA generation error : %s" % (exc))
# 最后,如果 storage_list 中还有剩余内容,也会将其保存到文件中。
if judge_list:
save_to_file(storage_jsonl_path, judge_list)
judge_list = []
"""
生成 QA
model_name: 可调用的模型名称暂时只实现了 qwen
interval: 存储间隔即每隔多少条存一次文件过密的间隔会增大 IO 开销
"""
def clean_qa(
model_name: str = 'qwen',
interval: int = 10,
):
# current_time = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
if model_name == 'qwen':
model_caller = call_qwen_Psychology_QA_Pairs
else:
logger.warning('This model is currently not supported and will call the default model - qwen.')
model_caller = call_qwen_Psychology_QA_Pairs
model_name = 'qwen'
logger.info(f'The called model is: {model_name}.')
logger.info(f'The storage interval is: {interval}.')
file_lists = get_jsonl_file_paths() # 数据整合文件夹下所有.jsonl文件的地址
for file_path in file_lists:
# 一个jsonl文件的所有QA Pairs
contents = get_QA_pairs(file_path)
# print(contents)
file_name = os.path.basename(file_path)
print(file_name)
storage_jsonl_path = os.path.join(
clean_dir, f'{file_name}')
logger.info(f'The generated QA will be stored in {storage_jsonl_path}.')
contents_array = np.array(contents)
chunks = np.array_split(contents_array, multi_process_num)
# 构建并发参数 list
parameters_list = list()
for thread_num, chunk in enumerate(chunks):
parameters_list.append(
[thread_num, interval, model_caller, storage_jsonl_path, list(chunk)]
)
with concurrent.futures.ThreadPoolExecutor(max_workers=multi_process_num) as executor:
# 循环调用 single_thread_generate 函数,每次赋予参数 parameters
futures = [executor.submit(single_thread_generate, *parameters) for parameters in parameters_list]
for future in concurrent.futures.as_completed(futures):
try:
future.result()
except Exception as exc:
logger.error("Thread generated an exception: %s" % (exc))
merge_sub_qa_generation(result_dir, storage_jsonl_path)
if __name__ == '__main__':
# 创建washed文件夹
os.makedirs('./data/cleaned', exist_ok=True)
clean_qa(interval=storage_interval)

View File

@ -93,3 +93,34 @@
## **步骤四清洗QA对**
- 清洗目的
- 提高提取的QA数据质量清理掉与心理学无关的QA对
- 清洗方法
- 使用Prompt方法驱动LLM对给出的QA对进行判断
- **参考Prompt**
- ```markdown
你是一名经验丰富的心理咨询师,熟悉心理学相关知识。根据我提供的 QA 对,来判断这个 QA 对是否属于心理学范畴。
标准如下:
- 若当前 QA 对属于心理学范畴则返回1
- 若当前 QA 对不属于心理学范畴则返回0
以下是给定的心理学 QA 对内容:
```
- 清洗工具
- 配置`config/config.py` 中的 `DASHSCOPE_API_KEY`,`API_KEY`获取方法见步骤三
- 使用提供的清洗脚本[QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py)
- 使用方法
- 准备好需要清洗的 QA 对数据
- 将该数据放进 model 同级 data 文件夹下
- 根据文件夹名去修改 `config/config.py` 中的 `judge_dir`
- 如存储数据的文件名为`xxx`,则`judge_dir`是 `judge_dir = os.path.join(data_dir, 'xxx')`
- 清洗完的 QA 对会以 `jsonl` 的格式存在 `data/cleaned`

View File

@ -93,3 +93,40 @@ Using books specialized in psychology to build QA knowledge pairs for RAG to pro
## **Step 4: Cleaning of QA pairs**
- Purpose of cleaning
- Improve the quality of extracted QA data and clean out QA pairs that are not relevant to psychology
- Cleaning Methods
- Use the Prompt method to drive the LLM to make a judgment on the given QA pairs
- **Reference to Prompt**
- ```markdown
You are an experienced counselor and are familiar with psychology. Based on the QA pair I have provided, determine if this QA pair is psychological in nature.
The criteria are as follows:
- If the current QA pair belongs to the category of psychology, then return 1
- If the current QA pair does not belong to the category of psychology, then return 0.
The following is the content of the given psychology QA pair:
```
- Cleaning Tools
- Configure `DASHSCOPE_API_KEY` in `config/config.py`, see step 3 for how to get `API_KEY`.
- Use the provided cleaning script [QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py)
- How to use
- Prepare the QA pair data to be cleaned
- Put the data into the data folder of the same level as the model.
- Modify `judge_dir` in `config/config.py` according to the folder name.
- If the file name of the stored data is `xxx`, then `judge_dir` is `judge_dir = os.path.join(data_dir, 'xxx')`.
- The cleaned QA pairs are stored as `jsonl` under `data/cleaned`.

View File

@ -0,0 +1,8 @@
你是一名经验丰富的心理咨询师,熟悉心理学相关知识。根据我提供的 QA 对,来判断这个 QA 对是否属于心理学范畴。
标准如下:
- 若当前 QA 对属于心理学范畴则返回1
- 若当前 QA 对不属于心理学范畴则返回0
以下是给定的心理学 QA 对内容:

View File

@ -10,7 +10,9 @@ base_dir = os.path.dirname(cur_dir) # ba
model_dir = os.path.join(base_dir, 'model') # model
# data
data_dir = os.path.join(base_dir, 'data') # data
data_dir = os.path.join(base_dir, 'data')
clean_dir = os.path.join(data_dir, 'cleaned')
judge_dir = os.path.join(data_dir, '数据整合')
result_dir = os.path.join(data_dir, 'generated') # result
# log
@ -18,7 +20,9 @@ log_dir = os.path.join(base_dir, 'log') # lo
log_file_path = os.path.join(log_dir, 'log.log') # file
# system prompt
# Prompt内容
system_prompt_file_path = os.path.join(base_dir, 'system_prompt_v2.md') # system prompt
wash_prompt_file_path = os.path.join(base_dir, 'choose_prompt.md')
"""
@ -28,7 +32,6 @@ system_prompt_file_path = os.path.join(base_dir, 'system_prompt_v2.md') # sy
DASHSCOPE_API_KEY = ''
"""
控制参数
"""
@ -36,3 +39,4 @@ storage_interval = 10
window_size = 8
overlap_size = 2
multi_process_num = 3

View File

@ -5,7 +5,7 @@ from dashscope.api_entities.dashscope_response import Role
from config.config import DASHSCOPE_API_KEY
from util.logger import get_logger
from util.prompt_loader import load_system_prompt
from util.prompt_loader import load_system_prompt, load_wash_prompt
dashscope.api_key = DASHSCOPE_API_KEY
@ -39,3 +39,31 @@ def call_qwen_single_turn(query: str) -> str:
response.code, response.message
))
return ""
def call_qwen_Psychology_QA_Pairs(query: str) -> str:
messages = [
{
'role': Role.SYSTEM,
'content': load_wash_prompt()
},
{
'role': Role.USER,
'content': query
}
]
response = Generation.call(
model='qwen-max-1201',
messages=messages,
result_format='message',
stream=False,
incremental_output=False
)
if response.status_code == HTTPStatus.OK:
return response.output.choices[0]['message']['content']
else:
logger.error('Request id: %s, Status code: %s, error code: %s, error message: %s' % (
response.request_id, response.status_code,
response.code, response.message
))
return ""

View File

@ -4,11 +4,39 @@ import json
import glob
from typing import List, Dict
from config.config import data_dir
from config.config import data_dir, judge_dir
from util.logger import get_logger
logger = get_logger()
"""
递归获取 数据整合 下的所有 .jsonl 文件列表
"""
def get_jsonl_file_paths() -> List[str]:
json_file_paths = []
# 遍历根目录及其所有子目录
for dirpath, dirnames, filenames in os.walk(judge_dir):
# 对每个文件进行检查
for filename in filenames:
# 使用正则表达式匹配以.jsonl结尾的文件名
if re.search(r'\.jsonl$', filename):
# 构建完整的文件路径并添加到列表中
json_file_path = os.path.join(dirpath, filename)
json_file_paths.append(json_file_path)
return json_file_paths
def get_QA_pairs(json_path):
with open(json_path, 'r', encoding='utf-8') as f:
content = f.read().strip()
# 按照换行符分割字符串
QA_Pairs = content.split('\n')
return QA_Pairs
"""
递归获取 data_dir 下的所有 .txt 文件列表
"""
@ -47,7 +75,7 @@ def get_txt_content(
res = []
sentences_amount = len(sentences)
start_index, end_index = 0, sentences_amount - window_size
## check length
# check length
if window_size < overlap_size:
logger.error("window_size must be greater than or equal to overlap_size")
return None
@ -56,7 +84,7 @@ def get_txt_content(
return ['\n'.join(sentences)]
for i in range(start_index, end_index + 1, overlap_size):
res.append('\n'.join(sentences[i : i + window_size]))
res.append('\n'.join(sentences[i: i + window_size]))
return res
@ -80,6 +108,7 @@ def capture_qa(content: str) -> List[Dict]:
logger.warning("No JSON block found.")
return None
"""
storage_list 存入到 storage_jsonl_path
"""
@ -88,6 +117,7 @@ def save_to_file(storage_jsonl_path, storage_list):
for item in storage_list:
f.write(json.dumps(item, ensure_ascii=False) + '\n')
"""
将并发产生的文件合并成为一个文件
"""
@ -102,5 +132,4 @@ def merge_sub_qa_generation(directory, storage_jsonl_path):
for line in f:
file_contents.append(json.loads(line))
os.remove(file_path)
save_to_file(storage_jsonl_path, file_contents)
save_to_file(storage_jsonl_path, file_contents)

View File

@ -1,7 +1,14 @@
from config.config import system_prompt_file_path
from config.config import wash_prompt_file_path
def load_system_prompt() -> str:
with open(system_prompt_file_path, 'r', encoding='utf-8') as f:
system_prompt = f.read()
return system_prompt
def load_wash_prompt() -> str:
with open(wash_prompt_file_path, 'r', encoding='utf-8') as f:
wash_prompt = f.read()
return wash_prompt