Merge pull request #40 from chg0901/main

[Update] README files translation
This commit is contained in:
xzw 2024-03-03 19:27:49 +08:00 committed by GitHub
commit 88b1794f7e
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
47 changed files with 704926 additions and 100 deletions

187
README.md
View File

@ -1,16 +1,13 @@
# EmoLLM-心理健康大模型
[Contributors][contributors-url]
[Forks][forks-url]
[Stargazers][stars-url]
[Issues][issues-url]
[MIT License][license-url]
<!-- [![LinkedIn][linkedin-shield]][linkedin-url] -->
<!-- PROJECT LOGO -->
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
<br />
<!-- PROJECT LOGO -->
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
@ -18,7 +15,10 @@
</a>
<h3 align="center">EmoLLM</h3>
<p align="center">
简体中文| <a href="README_English_version.md" >English</a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
<br />
@ -34,7 +34,20 @@
<!-- 本篇README.md面向开发者 -->
**EmoLLM**是一个能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由[InternLM2](https://github.com/InternLM/InternLM)指令微调而来欢迎大家star~⭐⭐
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来欢迎大家star~⭐⭐。目前已经开源的 `LLM`微调配置如下:
| 模型 | 类型 |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_1_8B_chat | 全量微调 |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | 全量微调 |
| Baichuan2_13B_chat | qlora |
| ChatGLM3_6B | lora |
| DeepSeek MoE_16B_chat | qlora |
| Mixtral 8x7B_instruct | qlora |
| …… | …… |
欢迎大家为本项目做出贡献~
---
@ -49,61 +62,86 @@
- 预防和干预措施:心理健康大模型还包括预防心理问题和促进心理健康的策略,如心理教育、心理咨询、心理治疗和社会支持系统。
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
### 最近更新
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)。
- 【2024.2.27】更新英文readme和一系列数据集舔狗和单轮对话
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei)[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py)[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)温柔御姐角色扮演专用带有Emoji表情即将推出 `温柔御姐心理医生艾薇`
- 【2024.2.18】 [基于Qwen1_5-0_5B-Chat全量微调版本开源](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary),算力有限的道友可以玩起来~
- 【2024.2.6】 EmoLLM在[**Openxlab** ](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 平台下载量高达18.7k,欢迎大家体验!
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
<details>
<summary>查看更多</summary>
- 【2024.2.5】 项目荣获公众号**NLP工程化**推文宣传[推文链接](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A),为博主推广一波,欢迎大家关注!!🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [项目宣传视频](https://www.bilibili.com/video/BV1N7421N76X/)完成 😊
- 【2024.1.27】 完善数据构建文档、微调指南、部署指南、Readme等相关文档 👏
- 【2024.1.25】 完成EmoLLM第一版并部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
## 目录
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
- [最近更新](#最近更新)
- [目录](#目录)
- [开发前的配置要求](#开发前的配置要求)
- [**安装步骤**](#安装步骤)
- [**使用指南**](#使用指南)
- [文件目录说明](#文件目录说明)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [demo部署](#demo部署)
- [部署指南](#部署指南)
- [使用到的框架](#使用到的框架)
- [贡献者](#贡献者)
- [如何参与开源项目](#如何参与开源项目)
- [如何参与本项目](#如何参与本项目)
- [版本控制](#版本控制)
- [作者](#作者)
- [作者(排名不分先后)](#作者排名不分先后)
- [版权说明](#版权说明)
- [鸣谢](#鸣谢)
- [特别鸣谢](#特别鸣谢)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
###### 开发前的配置要求
详见[部署要求](https://github.com/aJupyter/EmoLLM/tree/main/%E9%85%8D%E7%BD%AE%E8%A6%81%E6%B1%82)
- 硬件A100 40G仅针对InternLM2_7B_chat+qlora微调+deepspeed zero2优化
###### **安装步骤**
###### **使用指南**
1. Get a free API Key at [https://example.com](https://example.com)
2. Clone the repo
1. Clone the repo
```sh
git clone https://github.com/aJupyter/EmoLLM.git
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
2. 依次阅读或者选择感兴趣的部分阅读:
- [文件目录说明](#文件目录说明)
- [数据构建](#数据构建)
- [微调指南](#微调指南)
- [部署指南](#部署指南)
- 查看更多详情
<details>
<summary>更多详情</summary>
### 文件目录说明
eg:
```
filetree
├── ARCHITECTURE.md
├── LICENSE.txt
├── README.md
├── /account/
├── /bbs/
├── /docs/
│ ├── /rules/
│ │ ├── backend.txt
│ │ └── frontend.txt
├── manage.py
├── /oa/
├── /static/
├── /templates/
├── useless.md
└── /util/
├─assets图像资源
├─datasets数据集
├─demodemo脚本
├─generate_data生成数据指南
│ └─xinghuo
├─scripts一些可用工具
└─xtuner_config微调指南
└─images
```
### 数据构建
@ -116,21 +154,18 @@ filetree
详见[微调指南](xtuner_config/README.md)
### demo部署
### 部署指南
详见[demo](https://github.com/aJupyter/EmoLLM/demo)
详见[部署指南](demo/README.md)
### 使用到的框架
- [xxxxxxx](https://getbootstrap.com)
- [xxxxxxx](https://jquery.com)
- [xxxxxxx](https://laravel.com)
- [Xtuner](https://github.com/InternLM/xtuner)
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- …
### 贡献者
请阅读**CONTRIBUTING.md** 查阅为该项目做出贡献的开发者。
#### 如何参与开源项目
#### 如何参与本项目
贡献使开源社区成为一个学习、激励和创造的绝佳场所。你所作的任何贡献都是**非常感谢**的。
@ -144,15 +179,33 @@ filetree
该项目使用Git进行版本管理。您可以在repository参看当前可用版本。
</details>
### 作者(排名不分先后)
[aJupyter](https://github.com/aJupyter)@datawhale成员、南开大学在读硕士
[jujimeizup](https://github.com/jujimeizuo)@
[jujimeizuo](https://github.com/jujimeizuo)@江南大学在读硕士
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@哈尔滨工业大学(威海)在读本科生
[Farewell](https://github.com/8baby8)@
[Farewell](https://github.com/8baby8)@飞桨领航团区域主管、文心大模型核心开发者
[ZhouXinAo](https://github.com/zxazys)@南开大学在读硕士
[MING_X](https://github.com/MING-ZCH)@华中科技大学在读本科生
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
[MrCatAI](https://github.com/MrCatAI)@AI搬用工
[ZeyuBa](https://github.com/ZeyuBa)@自动化所在读硕士
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@宾夕法尼亚大学在读硕士
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
### 版权说明
@ -163,6 +216,7 @@ filetree
- [Sanbu](https://github.com/sanbuphy)
- [上海人工智能实验室](https://www.shlab.org.cn/)
- [闻星大佬(小助手)](https://github.com/vansin)
- [扫地升(公众号宣传)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
<!-- links -->
@ -170,22 +224,23 @@ filetree
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=aJupyter/EmoLLM&type=Date)](https://star-history.com/#aJupyter/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=aJupyter/EmoLLM&max=50)](https://github.com/aJupyter/EmoLLM/graphs/contributors)
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: aJupyter/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/aJupyter/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/aJupyter/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/aJupyter/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/aJupyter/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/aJupyter/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/aJupyter/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/aJupyter/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/aJupyter/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/aJupyter/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/aJupyter/EmoLLM/blob/main/LICENSE
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE

251
README_EN.md Normal file
View File

@ -0,0 +1,251 @@
# EmoLLM - Large Languge Model for Mental Health
<!-- PROJECT SHIELDS -->
[![Contributors][contributors-shield]][contributors-url]
[![Forks][forks-shield]][forks-url]
[![Issues][issues-shield]][issues-url]
[![MIT License][license-shield]][license-url]
[![Stargazers][stars-shield]][stars-url]
<br />
<!-- PROJECT LOGO -->
<p align="center">
<a href="https://github.com/aJupyter/EmoLLM/">
<img src="assets/logo.jpeg" alt="Logo" width="30%">
</a>
<h3 align="center">EmoLLM</h3>
<p align="center">
<a href="README.md">简体中文</a> | English
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
<br />
<br />
<a href="https://github.com/aJupyter/EmoLLM/tree/main/demo">View the Demo</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Report a Bug</a>
·
<a href="https://github.com/aJupyter/EmoLLM/issues">Propose a New Feature</a>
</p>
</p>
<!-- 本篇README.md面向开发者 -->
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you can give it a star~⭐⭐. The open-sourced configuration is as follows:
| model | type |
| :-------------------: | :------: |
| InternLM2_7B_chat | qlora |
| InternLM2_1_8B_chat | full finetuning |
| Qwen_7b_chat | qlora |
| Qwen1_5-0_5B-Chat | full finetuning |
| Baichuan2_13B_chat | qlora |
| ChatGLM3_6B | lora |
| DeepSeek MoE_16B_chat | qlora |
| Mixtral 8x7B_instruct | qlora |
| …… | …… |
Everyone is welcome to contribute to this project ~
---
The Model is aimed at fully understanding and promoting the mental health of individuals, groups, and society. This model typically includes the following key components:
- Cognitive factors: Involving an individual's thought patterns, belief systems, cognitive biases, and problem-solving abilities. Cognitive factors significantly impact mental health as they affect how individuals interpret and respond to life events.
- Emotional factors: Including emotion regulation, emotional expression, and emotional experiences. Emotional health is a crucial part of mental health, involving how individuals manage and express their emotions and how they recover from negative emotions.
- Behavioral factors: Concerning an individual's behavior patterns, habits, and coping strategies. This includes stress management skills, social skills, and self-efficacy, which is the confidence in one's abilities.
- Social environment: Comprising external factors such as family, work, community, and cultural background, which have direct and indirect impacts on an individual's mental health.
- Physical health: There is a close relationship between physical and mental health. Good physical health can promote mental health and vice versa.
- Psychological resilience: Refers to an individual's ability to recover from adversity and adapt. Those with strong psychological resilience can bounce back from challenges and learn and grow from them.
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
### Recent Updates
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
- 【2024.2.23】Updated [several fine-tuning configurations](/xtuner_config/), added [data_pro.json](/datasets/data_pro.json) (more quantity, more comprehensive scenarios, richer content) and [aiwei.json](/datasets/aiwei.json) (dedicated to the gentle lady role-play, featuring Emoji expressions), the "Gentle Lady Psychologist Ai Wei" is coming soon.
- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
- 【2024.2.6】 [Open-sourced based on the Qwen1_5-0_5B-Chat full-scale fine-tuned version](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary), friends with limited computing power can start experimenting~
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
</p>
<details>
<summary>View More</summary>
- 【2024.2.5】 The project has been promoted by the official WeChat account NLP Engineering. Here's the [link](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A) to the article. Welcome everyone to follow!! 🥳🥳
<p align="center">
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
</p>
- 【2024.2.3】 [Project Vedio](https://www.bilibili.com/video/BV1N7421N76X/) at bilibili 😊
- 【2024.1.27】 Complete data construction documentation, fine-tuning guide, deployment guide, Readme, and other related documents 👏
- 【2024.1.25】 Complete the first version of EmoLLM and deploy it online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
</details>
## Contents
- [EmoLLM - Large Languge Model for Mental Health](#emollm---large-languge-model-for-mental-health)
- [Everyone is welcome to contribute to this project ~](#everyone-is-welcome-to-contribute-to-this-project-)
- [Recent Updates](#recent-updates)
- [Contents](#contents)
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
- [**User Guide**](#user-guide)
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- [Frameworks Used](#frameworks-used)
- [How to participate in this project](#how-to-participate-in-this-project)
- [Version control](#version-control)
- [Authors (in no particular order)](#authors-in-no-particular-order)
- [Copyright Notice](#copyright-notice)
- [Acknowledgments](#acknowledgments)
- [Star History](#star-history)
- [🌟 Contributors](#-contributors)
###### Pre-development Configuration Requirements.
- A100 40G (specifically for InternLM2_7B_chat + qlora fine-tuning + deepspeed zero2 optimization)
###### **User Guide**
1. Clone the repo
```sh
git clone https://github.com/SmartFlowAI/EmoLLM.git
```
1. Read in sequence or read sections you're interested in
- [File Directory Explanation](#file-directory-explanation)
- [Data Construction](#data-construction)
- [Fine-tuning Guide](#fine-tuning-guide)
- [Deployment Guide](#deployment-guide)
- View More Details
<details>
<summary>Additional Details</summary>
### File Directory Explanation
```
├─assetsImage Resources
├─datasetsDataset
├─demodemo scripts
├─generate_dataData Generation Guide
│ └─xinghuo
├─scriptsSome Available Tools
└─xtuner_configFine-tuning Guide
└─images
```
### Data Construction
Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
### Fine-tuning Guide
For details, see the [fine-tuning guide](xtuner_config/README.md)
### Deployment Guide
For details, see the [deployment guide](demo/README.md)
### Frameworks Used
- [Xtuner](https://github.com/InternLM/xtuner)
- [Transformers](https://github.com/huggingface/transformers)
- [Pytorch](https://pytorch.org/)
- …
#### How to participate in this project
Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
### Version control
This project uses Git for version control. You can see the current available versions in the repository.
</details>
### Authors (in no particular order)
[aJupyter](https://github.com/aJupyter)@Datawhale member, Master's student at Nankai University
[jujimeizuo](https://github.com/jujimeizuo)@Master's student at Jiangnan University
[Smiling&amp;Weeping](https://github.com/Smiling-Weeping-zhr)@Undergraduate student at Harbin Institute of Technology (Weihai)
[Farewell](https://github.com/8baby8)@
[ZhouXinAo](https://github.com/zxazys)@Master's student at Nankai University
[MING_X](https://github.com/MING-ZCH) @Undergraduate at Huazhong University of Science and Technology
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
[MrCatAI](https://github.com/MrCatAI)@AI Removal of Labour
[ZeyuBa](https://github.com/ZeyuBa)@Master's student at Institute of Automation
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@Master's student at University of Pennsylvania
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
### Copyright Notice
The project is licensed under the MIT License. Please refer to the details
[LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
### Acknowledgments
- [Sanbu](https://github.com/sanbuphy)
- [Shanghai Artificial Intelligence Laboratory](https://www.shlab.org.cn/)
- [Vanin](https://github.com/vansin)
- [Bloom up (WeChat Official Account Promotion)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
<!-- links -->
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
<!-- 太少了,没必要放 -->
## Star History
[![Star History Chart](https://api.star-history.com/svg?repos=aJupyter/EmoLLM&type=Date)](https://star-history.com/#aJupyter/EmoLLM&Date)
## 🌟 Contributors
[![EmoLLM contributors](https://contrib.rocks/image?repo=SmartFlowAI/EmoLLM&max=50)](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
[your-project-path]: SmartflowAI/EmoLLM
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE

3
app.py
View File

@ -1,2 +1,3 @@
import os
os.system('streamlit run web_internlm2.py --server.address=0.0.0.0 --server.port 7860')
# os.system('streamlit run web_internlm2.py --server.address=0.0.0.0 --server.port 7860')
os.system('streamlit run web_demo-aiwei.py --server.address=0.0.0.0 --server.port 7860')

BIN
assets/aiwei_logo.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 133 KiB

BIN
assets/deploy_1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 766 KiB

BIN
assets/deploy_2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 315 KiB

BIN
assets/deploy_3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 61 KiB

BIN
assets/deploy_4.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 643 KiB

101142
datasets/SoulStar_data.json Normal file

File diff suppressed because one or more lines are too long

24870
datasets/aiwei.json Normal file

File diff suppressed because it is too large Load Diff

138359
datasets/data_pro.json Normal file

File diff suppressed because one or more lines are too long

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

20386
datasets/tiangou.json Normal file

File diff suppressed because it is too large Load Diff

59
demo/README.md Normal file
View File

@ -0,0 +1,59 @@
# EmoLLM 部署指南
## 本地部署
- Clone repo
```bash
git clone https://github.com/aJupyter/EmoLLM.git
```
- 安装依赖
```bash
pip install -r requirements.txt
```
- 下载模型
- 模型权重https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
- 通过 openxlab.model.download 下载,详情请看 [cli_internlm2](./cli_internlm2.py)
```bash
from openxlab.model import download
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
```
- 可以手动下载,放在 `./model` 目录下,然后把上面的代码删掉
- cli_demo
```bash
python ./demo/cli_internlm2.py
```
- web_demo
```bash
python ./app.py
```
如果在服务器上部署,需要配置本地端口映射
## OpenXLab 上部署
- 登陆 OpenXLab创建 Gradio 应用
![Login OpenXLab](../assets/deploy_1.png)
- 选择配置,立即创建
![config](../assets/deploy_2.png)
- 等待构建、启动
![wait a minutes](../assets/deploy_3.png)
- 项目体验
![enjoy](../assets/deploy_4.png)

59
demo/README_EN.md Normal file
View File

@ -0,0 +1,59 @@
# Deploying Guide for EmoLLM
## Local Deployment
- Clone repo
```bash
git clone https://github.com/aJupyter/EmoLLM.git
```
- Install dependencies
```bash
pip install -r requirements.txt
```
- Download the model
- Model weightshttps://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
- Download via openxlab.model.download, see [cli_internlm2](./cli_internlm2.py) for details
```bash
from openxlab.model import download
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
```
- You can also download manually and place it in the `./model` directory, then delete the above code.
- cli_demo
```bash
python ./demo/cli_internlm2.py
```
- web_demo
```bash
python ./app.py
```
If deploying on a server, you need to configure local port mapping.
## Deploy on OpenXLab
- Log in to OpenXLab and create a Gradio application
![Login OpenXLab](../assets/deploy_1.png)
- Select configurations and create the project
![config](../assets/deploy_2.png)
- Wait for the build and startup
![wait a minutes](../assets/deploy_3.png)
- Try your own project
![enjoy](../assets/deploy_4.png)

View File

@ -1,8 +1,11 @@
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from openxlab.model import download
download(model_repo='jujimeizuo/EmoLLM_Model',
output='model')
model_name_or_path = "./model"
model_name_or_path = "model"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')

View File

@ -0,0 +1,49 @@
# EmoLLM通用指标评估
## 简介
本文档提供了关于如何使用 `eval.py``metric.py` 两个脚本的指导。这些脚本用于评估 EmoLLM-心理健康大模型的生成结果。
## 安装
- Python 3.x
- PyTorch
- Transformers
- Datasets
- NLTK
- Rouge
- Jieba
可以使用以下命令安装:
```bash
pip install torch transformers datasets nltk rouge jieba
```
## 用法
### convert.py
将原始多轮对话数据转换为测评用的单轮数据。
### eval.py
`eval.py` 脚本用于生成医生的回复并进行评估,主要分为以下几部分:
1. 加载模型和分词器。
2. 设置测试参数,如测试数据数量和批处理大小。
3. 准备数据。
4. 生成响应并评估。
### metric.py
`metric.py` 脚本包含计算评估指标的函数,可设置按字符级别或按词级别进行评估,目前包含 BLEU 和 ROUGE 分数。
## 测试结果
对data.json中的数据进行测试结果如下
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-Chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |

View File

@ -0,0 +1,111 @@
from transformers import AutoModelForCausalLM, AutoTokenizer,DataCollatorWithPadding
from qwen_generation_utils import decode_tokens
import torch
import datasets
model_dir = './model'
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", padding_side='left',trust_remote_code=True)
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto",pad_token_id=tokenizer.eos_token_id, trust_remote_code=True, torch_dtype=torch.float16)
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
# InternLM 7B in 4bit will cost nearly 8GB GPU memory.
# pip install -U bitsandbytes
# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
model = model.eval()
# # convert data
# import ujson
# def transform_conversation_data(raw_data):
# try:
# instruction = '<|im_start|>system\n'+raw_data.get("conversation", "")[0]['system'] + "<|im_end|>\n"
# conversation = raw_data.get("conversation", [])
# for i, dialog in enumerate(conversation):
# instruction += "<|im_start|>user\n来访者" + dialog["input"]+ "<|im_end|>\n"
# if i < len(conversation) - 1:
# instruction += "<|im_start|>assistant\n医生" + dialog["output"]+"<|im_end|>\n"
# response = conversation[-1]["output"] if conversation else ""
# instruction +="<|im_start|>assistant\n医生"
# return {"instruction": instruction, "output": response}
# except Exception as e:
# pass
# with open(f'./data_dir/data.json', 'r', encoding='utf-8') as f1:
# data = ujson.load(f1)
# with open(f'./data_dir/converted.json', 'w', encoding='utf-8') as f:
# for j, item in enumerate(data):
# temp=transform_conversation_data(item)
# if temp:
# transformed_data =ujson.dumps(temp, ensure_ascii=False)
# f.write(transformed_data+'\n')
#set test params
#set test params
test_num=1596 #测试数据条数
batch_size=12
#prepare data and dataloader
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
references =dataset['output'][:test_num]
hypotheses = []
def preprocess(data):
length = list(map(len, data['instruction']))
model_inputs=tokenizer(data['instruction'], max_length=512, truncation=True )
labels=tokenizer(data['output'], padding=True,max_length=128, truncation=True )
model_inputs['labels']=labels['input_ids']
model_inputs['length'] = length
return model_inputs
preprocessed_dataset = dataset.map(preprocess, batched=True,remove_columns=['instruction', 'output',])
collator=DataCollatorWithPadding(tokenizer=tokenizer,)
from torch.utils.data import DataLoader
dataloader = DataLoader(preprocessed_dataset, batch_size=batch_size, collate_fn=collator)
#generate responses
stop_word="<|im_end|>"
for batch in dataloader:
batch_input_ids = torch.LongTensor(batch['input_ids']).to(model.device)
batch_labels = batch['labels']
attention_mask = batch['attention_mask']
length = batch['length']
batch_out_ids = model.generate(
batch_input_ids.to(model.device),
return_dict_in_generate=False,
max_new_tokens=256,
do_sample=True,
temperature=0.1,
eos_token_id=92542
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
context_length=0,
raw_text_len=length[i],
chat_format="raw",
verbose=False,
errors='replace'
).replace("医生:","") for i in range(batch_size)]
hypotheses.extend([r.replace(stop_word," ").split()[0] for r in batch_response if stop_word in r])
# Load metric
from metric import compute_metrics
print(compute_metrics((hypotheses,references)))

View File

@ -0,0 +1,30 @@
# EmoLLM专业指标评估
## 简介
本文档介绍一种专业评测方法,并提供 EmoLLM 在专业指标的得分。
## 评测方法
本评测方法采用论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》提出的评测指标与方法。
* 指标Comprehensiveness, Professionalism, Authenticity, Safety
* 方法Turn-Based Dialogue Evaluation
* 数据集CPsyCounE
## 评测结果
评测模型: [EmoLLM](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2-7B-chat + qlora), 得分:
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |
## 比较
* [EmoLLM](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 在 InternLM2-7B-Chat 基础上提升较大;相比 Role-playing ChatGPT 在心理咨询任务上能力相近
* 对比结果图片来源于论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
![image](https://github.com/MING-ZCH/EmoLLM/assets/119648793/abc9f626-11bc-4ec8-84a4-427c4600a720)

View File

@ -0,0 +1,80 @@
from transformers import AutoModelForCausalLM, AutoTokenizer,DataCollatorWithPadding
from qwen_generation_utils import decode_tokens
import torch
import datasets
#load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(
'./EmoLLM_Qwen1_5-0_5B-Chat_full_sft',
pad_token='<|extra_0|>',
eos_token='<|endoftext|>',
padding_side='left',
trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
'./EmoLLM_Qwen1_5-0_5B-Chat_full_sft',
pad_token_id=tokenizer.pad_token_id,
device_map="cuda:0",
trust_remote_code=True
).eval()
#set test params
test_num=1596 #测试数据条数
batch_size=12
#prepare data and dataloader
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
references =dataset['output'][:test_num]
hypotheses = []
def preprocess(data):
length = list(map(len, data['instruction']))
model_inputs=tokenizer(data['instruction'], max_length=512, truncation=True )
labels=tokenizer(data['output'], padding=True,max_length=128, truncation=True )
model_inputs['labels']=labels['input_ids']
model_inputs['length'] = length
return model_inputs
preprocessed_dataset = dataset.map(preprocess, batched=True,remove_columns=['instruction', 'output',])
collator=DataCollatorWithPadding(tokenizer=tokenizer,)
from torch.utils.data import DataLoader
dataloader = DataLoader(preprocessed_dataset, batch_size=batch_size, collate_fn=collator)
#generate responses
for batch in dataloader:
batch_input_ids = torch.LongTensor(batch['input_ids']).to(model.device)
batch_labels = batch['labels']
attention_mask = batch['attention_mask']
length = batch['length']
batch_out_ids = model.generate(
batch_input_ids.to(model.device),
return_dict_in_generate=False,
max_new_tokens=256,
temperature=0.1,
pad_token_id=tokenizer.eos_token_id
)
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
batch_response = [
decode_tokens(
batch_out_ids[i][padding_lens[i]:],
tokenizer,
context_length=0,
raw_text_len=length[i],
chat_format="raw",
verbose=False,
errors='replace'
) for i in range(batch_size)
]
hypotheses.extend(batch_response)
# Load metric
from metric import compute_metrics
print(compute_metrics((hypotheses,references)))

21
evaluate/README.md Normal file
View File

@ -0,0 +1,21 @@
# EmoLLM评测
## 通用指标评测
* 具体指标、方法见 see [General_evaluation.md](./General_evaluation.md)
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|----------|---------|---------|---------|---------|---------|---------|---------|
| Qwen1_5-0_5B-Chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
| InternLM2_7B_chat | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
## 专业指标评测
* 具体指标、方法见 [Professional_evaluation.md](./Professional_evaluation.md)
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |

26
evaluate/README_EN.md Normal file
View File

@ -0,0 +1,26 @@
# EmoLLM Evaluation
## General Metrics Evaluation
* For specific metrics and methods, see [General_evaluation.md](./General_evaluation_EN.md)
| Metric | Value |
|---------|----------------------|
| ROUGE-1 | 27.23% |
| ROUGE-2 | 8.55% |
| ROUGE-L | 17.05% |
| BLEU-1 | 26.65% |
| BLEU-2 | 13.11% |
| BLEU-3 | 7.19% |
| BLEU-4 | 4.05% |
## Professional Metrics Evaluation
* For specific metrics and methods, see [Professional_evaluation_EN.md](./Professional_evaluation_EN.md)
| Metric | Value |
|-------------------|------------|
| Comprehensiveness | 1.32 |
| Professionalism | 2.20 |
| Authenticity | 2.10 |
| Safety | 1.00 |

View File

@ -0,0 +1,31 @@
import ujson
def transform_conversation_data(raw_data):
try:
instruction = raw_data.get("conversation", "")[0]['system'] + "\n\n对话:"
conversation = raw_data.get("conversation", [])
for i, dialog in enumerate(conversation):
instruction += "\n来访者:" + dialog["input"]
if i < len(conversation) - 1:
instruction += "\n医生:" + dialog["output"]
response = conversation[-1]["output"] if conversation else ""
instruction += "\n医生:"
return {"instruction": instruction, "output": response}
except Exception as e:
pass
with open(f'./train_dir/data.json', 'r', encoding='utf-8') as f1:
data = ujson.load(f1)
with open(f'./train_dir/converted.json', 'w', encoding='utf-8') as f:
for j, item in enumerate(data):
temp=transform_conversation_data(item)
if temp:
transformed_data =ujson.dumps(temp, ensure_ascii=False)
f.write(transformed_data+'\n')
print('********')

File diff suppressed because it is too large Load Diff

28282
evaluate/data_dir/data.json Normal file

File diff suppressed because it is too large Load Diff

33
evaluate/metric.py Normal file
View File

@ -0,0 +1,33 @@
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge import Rouge
import numpy as np
import jieba
def compute_metrics(eval_pred):
predictions, labels = eval_pred
# 字符级别
# decoded_preds = [" ".join((pred.replace(" ", ""))) for pred in predictions]
# decoded_labels = [" ".join((label.replace(" ", ""))) for label in labels]
# 词级别
decoded_preds = [" ".join(jieba.cut(pred.replace(" ", ""))) for pred in predictions]
decoded_labels = [" ".join(jieba.cut(label.replace(" ", ""))) for label in labels]
rouge = Rouge()
bleu =np.array([0.,0.,0.,0.])
weights = [(1.,0.,0.,0.),(1./2., 1./2.),(1./3., 1./3., 1./3.),(1./4., 1./4., 1./4., 1./4.)]
for decoded_label, decoded_pred in zip(decoded_labels, decoded_preds):
bleu +=np.array( sentence_bleu(
references=[decoded_label.split(' ')],
hypothesis=decoded_pred.split(' '),
smoothing_function=SmoothingFunction().method1,weights=weights
))
bleu /= len(decoded_labels)
result = rouge.get_scores(decoded_preds, decoded_labels, avg=True)
result = {key: value['f'] * 100 for key, value in result.items()}
result["bleu"] = {'bleu_1':bleu[0] * 100,'bleu_2':bleu[1] * 100,'bleu_3':bleu[2] * 100,'bleu_4':bleu[3] * 100}
return result

View File

@ -0,0 +1,416 @@
# Copyright (c) Alibaba Cloud.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
"""Generation support."""
from typing import Tuple, List, Union, Iterable
import numpy as np
import torch
import torch.nn.functional as F
from transformers import PreTrainedTokenizer
from transformers import logging
from transformers.generation import LogitsProcessor
logger = logging.get_logger(__name__)
# Types.
HistoryType = List[Tuple[str, str]]
TokensType = List[int]
BatchTokensType = List[List[int]]
def pad_batch(batch: BatchTokensType, pad_id: int, seq_length: int) -> BatchTokensType:
for tokens in batch:
context_length = len(tokens)
if context_length < seq_length:
tokens.extend([pad_id] * (seq_length - context_length))
return batch
def get_ltor_masks_and_position_ids(
data,
eod_token,
reset_position_ids,
reset_attention_mask,
eod_mask_loss,
):
"""Build masks and position id for left to right model."""
# Extract batch size and sequence length.
micro_batch_size, seq_length = data.size()
# Attention mask (lower triangular).
if reset_attention_mask:
att_mask_batch = micro_batch_size
else:
att_mask_batch = 1
attention_mask = torch.tril(
torch.ones((att_mask_batch, seq_length, seq_length), device=data.device)
).view(att_mask_batch, 1, seq_length, seq_length)
# Loss mask.
loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
if eod_mask_loss:
loss_mask[data == eod_token] = 0.0
# Position ids.
position_ids = torch.arange(seq_length, dtype=torch.long, device=data.device)
position_ids = position_ids.unsqueeze(0).expand_as(data)
# We need to clone as the ids will be modifed based on batch index.
if reset_position_ids:
position_ids = position_ids.clone()
if reset_position_ids or reset_attention_mask:
# Loop through the batches:
for b in range(micro_batch_size):
# Find indecies where EOD token is.
eod_index = position_ids[b, data[b] == eod_token]
# Detach indecies from positions if going to modify positions.
if reset_position_ids:
eod_index = eod_index.clone()
# Loop through EOD indecies:
prev_index = 0
for j in range(eod_index.size()[0]):
i = eod_index[j]
# Mask attention loss.
if reset_attention_mask:
attention_mask[b, 0, (i + 1) :, : (i + 1)] = 0
# Reset positions.
if reset_position_ids:
position_ids[b, (i + 1) :] -= i + 1 - prev_index
prev_index = i + 1
# Convert attention mask to binary:
attention_mask = attention_mask < 0.5
return attention_mask, loss_mask, position_ids
def get_batch(context_tokens: torch.LongTensor, eod_id: int):
"""Generate batch from context tokens."""
# Move to GPU.
tokens = context_tokens.contiguous().to(context_tokens.device)
# Get the attention mask and postition ids.
attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
tokens,
eod_id,
reset_position_ids=False,
reset_attention_mask=False,
eod_mask_loss=False,
)
return tokens, attention_mask, position_ids
def get_stop_words_ids(chat_format, tokenizer):
if chat_format == "raw":
stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
elif chat_format == "chatml":
stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
else:
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
return stop_words_ids
def make_context(
tokenizer: PreTrainedTokenizer,
query: str,
history: List[Tuple[str, str]] = None,
system: str = "",
max_window_size: int = 6144,
chat_format: str = "chatml",
):
if history is None:
history = []
if chat_format == "chatml":
im_start, im_end = "<|im_start|>", "<|im_end|>"
im_start_tokens = [tokenizer.im_start_id]
im_end_tokens = [tokenizer.im_end_id]
nl_tokens = tokenizer.encode("\n")
def _tokenize_str(role, content):
return f"{role}\n{content}", tokenizer.encode(
role, allowed_special=set()
) + nl_tokens + tokenizer.encode(content, allowed_special=set())
system_text, system_tokens_part = _tokenize_str("system", system)
system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
raw_text = ""
context_tokens = []
for turn_query, turn_response in reversed(history):
query_text, query_tokens_part = _tokenize_str("user", turn_query)
query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
response_text, response_tokens_part = _tokenize_str(
"assistant", turn_response
)
response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
prev_chat = (
f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
)
current_context_size = (
len(system_tokens) + len(next_context_tokens) + len(context_tokens)
)
if current_context_size < max_window_size:
context_tokens = next_context_tokens + context_tokens
raw_text = prev_chat + raw_text
else:
break
context_tokens = system_tokens + context_tokens
raw_text = f"{im_start}{system_text}{im_end}" + raw_text
context_tokens += (
nl_tokens
+ im_start_tokens
+ _tokenize_str("user", query)[1]
+ im_end_tokens
+ nl_tokens
+ im_start_tokens
+ tokenizer.encode("assistant")
+ nl_tokens
)
raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
elif chat_format == "raw":
raw_text = query
context_tokens = tokenizer.encode(raw_text)
else:
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
return raw_text, context_tokens
def _decode_default(
tokens: List[int],
*,
stop_words: List[str],
eod_words: List[str],
tokenizer: PreTrainedTokenizer,
raw_text_len: int,
verbose: bool = False,
return_end_reason: bool = False,
errors: str='replace',
):
trim_decode_tokens = tokenizer.decode(tokens, errors=errors)[raw_text_len:]
if verbose:
print("\nRaw Generate: ", trim_decode_tokens)
end_reason = f"Gen length {len(tokens)}"
for stop_word in stop_words:
trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
for eod_word in eod_words:
if eod_word in trim_decode_tokens:
end_reason = f"Gen {eod_word!r}"
trim_decode_tokens = trim_decode_tokens.split(eod_word)[0]
trim_decode_tokens = trim_decode_tokens.strip()
if verbose:
print("\nEnd Reason:", end_reason)
print("\nGenerate: ", trim_decode_tokens)
if return_end_reason:
return trim_decode_tokens, end_reason
else:
return trim_decode_tokens
def _decode_chatml(
tokens: List[int],
*,
stop_words: List[str],
eod_token_ids: List[int],
tokenizer: PreTrainedTokenizer,
raw_text_len: int,
context_length: int,
verbose: bool = False,
return_end_reason: bool = False,
errors: str='replace'
):
end_reason = f"Gen length {len(tokens)}"
eod_token_idx = context_length
for eod_token_idx in range(context_length, len(tokens)):
if tokens[eod_token_idx] in eod_token_ids:
end_reason = f"Gen {tokenizer.decode([tokens[eod_token_idx]])!r}"
break
trim_decode_tokens = tokenizer.decode(tokens[:eod_token_idx], errors=errors)[raw_text_len:]
if verbose:
print("\nRaw Generate w/o EOD:", tokenizer.decode(tokens, errors=errors)[raw_text_len:])
print("\nRaw Generate:", trim_decode_tokens)
print("\nEnd Reason:", end_reason)
for stop_word in stop_words:
trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
trim_decode_tokens = trim_decode_tokens.strip()
if verbose:
print("\nGenerate:", trim_decode_tokens)
if return_end_reason:
return trim_decode_tokens, end_reason
else:
return trim_decode_tokens
def decode_tokens(
tokens: Union[torch.LongTensor, TokensType],
tokenizer: PreTrainedTokenizer,
raw_text_len: int,
context_length: int,
chat_format: str,
verbose: bool = False,
return_end_reason: bool = False,
errors: str="replace",
) -> str:
if torch.is_tensor(tokens):
tokens = tokens.cpu().numpy().tolist()
if chat_format == "chatml":
return _decode_chatml(
tokens,
stop_words=[],
eod_token_ids=[tokenizer.im_start_id, tokenizer.im_end_id],
tokenizer=tokenizer,
raw_text_len=raw_text_len,
context_length=context_length,
verbose=verbose,
return_end_reason=return_end_reason,
errors=errors,
)
elif chat_format == "raw":
return _decode_default(
tokens,
stop_words=["<|endoftext|>"],
eod_words=["<|endoftext|>"],
tokenizer=tokenizer,
raw_text_len=raw_text_len,
verbose=verbose,
return_end_reason=return_end_reason,
errors=errors,
)
else:
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
class StopWordsLogitsProcessor(LogitsProcessor):
"""
:class:`transformers.LogitsProcessor` that enforces that when specified sequences appear, stop geration.
Args:
stop_words_ids (:obj:`List[List[int]]`):
List of list of token ids of stop ids. In order to get the tokens of the words
that should not appear in the generated text, use :obj:`tokenizer(bad_word,
add_prefix_space=True).input_ids`.
eos_token_id (:obj:`int`):
The id of the `end-of-sequence` token.
"""
def __init__(self, stop_words_ids: Iterable[Iterable[int]], eos_token_id: int):
if not isinstance(stop_words_ids, List) or len(stop_words_ids) == 0:
raise ValueError(
f"`stop_words_ids` has to be a non-emtpy list, but is {stop_words_ids}."
)
if any(not isinstance(bad_word_ids, list) for bad_word_ids in stop_words_ids):
raise ValueError(
f"`stop_words_ids` has to be a list of lists, but is {stop_words_ids}."
)
if any(
any(
(not isinstance(token_id, (int, np.integer)) or token_id < 0)
for token_id in stop_word_ids
)
for stop_word_ids in stop_words_ids
):
raise ValueError(
f"Each list in `stop_words_ids` has to be a list of positive integers, but is {stop_words_ids}."
)
self.stop_words_ids = list(
filter(
lambda bad_token_seq: bad_token_seq != [eos_token_id], stop_words_ids
)
)
self.eos_token_id = eos_token_id
for stop_token_seq in self.stop_words_ids:
assert (
len(stop_token_seq) > 0
), "Stop words token sequences {} cannot have an empty list".format(
stop_words_ids
)
def __call__(
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
) -> torch.FloatTensor:
stopped_samples = self._calc_stopped_samples(input_ids)
for i, should_stop in enumerate(stopped_samples):
if should_stop:
scores[i, self.eos_token_id] = float(2**15)
return scores
def _tokens_match(self, prev_tokens: torch.LongTensor, tokens: List[int]) -> bool:
if len(tokens) == 0:
# if bad word tokens is just one token always ban it
return True
elif len(tokens) > len(prev_tokens):
# if bad word tokens are longer then prev input_ids they can't be equal
return False
elif prev_tokens[-len(tokens) :].tolist() == tokens:
# if tokens match
return True
else:
return False
def _calc_stopped_samples(self, prev_input_ids: Iterable[int]) -> Iterable[int]:
stopped_samples = []
for prev_input_ids_slice in prev_input_ids:
match = False
for stop_token_seq in self.stop_words_ids:
if self._tokens_match(prev_input_ids_slice, stop_token_seq):
# if tokens do not match continue
match = True
break
stopped_samples.append(match)
return stopped_samples
def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float("Inf")):
"""This function has been mostly taken from huggingface conversational
ai code at
https://medium.com/huggingface/how-to-build-a-state-of-the-art-
conversational-ai-with-transfer-learning-2d818ac26313"""
if top_k > 0:
# Remove all tokens with a probability less than the
# last token of the top-k
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
logits[indices_to_remove] = filter_value
if top_p > 0.0:
# Cconvert to 1D
sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability above the threshold
sorted_indices_to_remove = cumulative_probs > top_p
# Shift the indices to the right to keep also the first token
# above the threshold
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
for i in range(sorted_indices.size(0)):
indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
logits[i][indices_to_remove] = filter_value
return logits
def switch(val1, val2, boolean):
boolean = boolean.type_as(val1)
return (1 - boolean) * val1 + boolean * val2

View File

@ -0,0 +1,10 @@
# Introduction
* gen_Chat 使用于生成ChatGLM3-6B的数据集
* gen_data 使用于生成InternLM所需要的数据集
⭐注意事项~
星火大模型V1.5生成特定主题时会出现**安全限制**,模型会拒绝回答,要注意类似数据的处理。
例:{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "xxx", "output": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。"}

View File

@ -1,3 +0,0 @@
gen_Chat 使用于生成ChatGLM3-6B的数据集
gen_data 适用于生成InternLM所需要的数据集
但是需要注意~火大模型用1.5生成时会有{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。", "output": ""}类似这样的数据集,要注意数据处理

View File

@ -3,39 +3,38 @@ import os
def save_merge_json(data_lis, file_path):
import json
with open(file_path, 'wt', encoding='utf-8') as file:
json.dump(data_lis, file, indent=4, ensure_ascii=False)
json.dump(data_lis, file, ensure_ascii=False)
def get_all_file_paths(folder_path, suffix=''):
print(folder_path)
files = os.listdir(folder_path)
path = []
for file in files:
file_path = os.path.join(folder_path, file)
if os.path.isdir(file_path):
path.extend(get_all_file_paths(file_path))
else:
if file_path.endswith(suffix):
path.append(file_path)
return path
def get_all_file_paths(folder_path):
# 确保传入的是一个目录
if not os.path.isdir(folder_path):
raise ValueError(f"{folder_path} is not a valid directory")
# 获取文件夹下所有文件的路径
file_paths = [os.path.join(folder_path, file) for file in os.listdir(
folder_path) if os.path.isfile(os.path.join(folder_path, file))]
return file_paths
if __name__ == '__main__':
conversion_lis = []
folder_path = './' # input
merge_path = 'merge.json' # input
paths = get_all_file_paths(folder_path=folder_path, suffix='.json')
for path in paths:
for path in get_all_file_paths(r'data\res-aiwei'):
print(path)
with open(path, 'rt', encoding='utf-8') as lines:
datas = []
for line in lines:
datas.append(line)
with open(path, 'rt', encoding='utf-8') as file:
for line in file:
# 移除行尾的换行符
line = line.rstrip('\n')
# 解析JSON
try:
datas = json.loads(''.join(datas))
conversion_lis.extend(datas)
data = json.loads(line)
conversion_lis.append(data)
except json.JSONDecodeError as e:
print(f"Error decoding JSON: {e}")
save_merge_json(data_lis=conversion_lis, file_path=merge_path)
save_merge_json(data_lis=conversion_lis,
file_path=r'.\merge.json')

69
scripts/pdf2txt.py Normal file
View File

@ -0,0 +1,69 @@
import os
import sys
import glob
try:
import cv2
except :
os.system('pip install opencv-python')
import cv2
try :
from paddleocr import PaddleOCR , draw_ocr , download_with_progressbar
except:
os.system('pip install paddleocr')
from paddleocr import PaddleOCR , draw_ocr , download_with_progressbar
output_folder_path = 'res/'
if not os.path.exists(output_folder_path):
os.makedirs(output_folder_path)
def get_pdf_files_in_directory(directory_path):
# 确保路径存在
if os.path.exists(directory_path) and os.path.isdir(directory_path):
return glob.glob(os.path.join(directory_path, '**', '*.pdf'), recursive=True)
else:
return []
def ocr_pdf_folder(folder_path):
ocr = PaddleOCR ( use_angle_cls = True , lang = "ch" , page_num = 0 ) # 只需运行一次即可将模型下载并加载到内存中
print("ppocrv4 加载完毕!!!")
pdf_paths = get_pdf_files_in_directory(folder_path)
print(f"共检测到 {len(pdf_paths)} 个PDF文件")
# 打印所有PDF文件的路径
for pdf_path in pdf_paths:
print(f'正在处理文件:{pdf_path}')
result = ocr.ocr (pdf_path , cls = True )
for idx in range(len(result)):
res = result[idx]
for line in res :
print(line)
print(f'{pdf_path} 处理完毕')
ocr_result = ""
for idx in range(len(result)):
res = result[idx]
for line in res:
# print(line[1][0])
ocr_result = f"{ocr_result} {str(line[1][0])}"
filename = os.path.splitext(os.path.basename(pdf_path))[0]
# 构建TXT文件的完整路径
txt_path = os.path.join('res/', f'{filename}.txt')
# 将提取的文本写入TXT文件
with open(txt_path, 'w', encoding='utf-8') as txt_file:
txt_file.write(ocr_result)
print(f'生成的txt文档保存在{txt_path}')
# break
# print(ocr_result)
# with open('my_file.txt', 'a') as f:
# # 写入字符串
# f.write(ocr_result)
if __name__ == "__main__":
if len(sys.argv) > 1:
# sys.argv[0] 是脚本名sys.argv[1:] 是传递给脚本的参数列表
pdf_path = sys.argv[1]
print(f'需要处理的文件夹是:{pdf_path}')
ocr_pdf_folder(pdf_path)

267
web_demo-aiwei.py Normal file
View File

@ -0,0 +1,267 @@
"""
This script refers to the dialogue example of streamlit, the interactive generation code of chatglm2 and transformers.
We mainly modified part of the code logic to adapt to the generation of our model.
Please refer to these links below for more information:
1. streamlit chat example: https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps
2. chatglm2: https://github.com/THUDM/ChatGLM2-6B
3. transformers: https://github.com/huggingface/transformers
Please run with the command `streamlit run path/to/web_demo.py --server.address=0.0.0.0 --server.port 7860`.
Using `python path/to/web_demo.py` may cause unknown problems.
"""
import copy
import warnings
from dataclasses import asdict, dataclass
from typing import Callable, List, Optional
import streamlit as st
import torch
from torch import nn
from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList
from transformers.utils import logging
from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip
from openxlab.model import download
logger = logging.get_logger(__name__)
download(model_repo='ajupyter/EmoLLM_aiwei',
output='model')
@dataclass
class GenerationConfig:
# this config is used for chat to provide more diversity
max_length: int = 32768
top_p: float = 0.8
temperature: float = 0.8
do_sample: bool = True
repetition_penalty: float = 1.005
@torch.inference_mode()
def generate_interactive(
model,
tokenizer,
prompt,
generation_config: Optional[GenerationConfig] = None,
logits_processor: Optional[LogitsProcessorList] = None,
stopping_criteria: Optional[StoppingCriteriaList] = None,
prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
additional_eos_token_id: Optional[int] = None,
**kwargs,
):
inputs = tokenizer([prompt], padding=True, return_tensors="pt")
input_length = len(inputs["input_ids"][0])
for k, v in inputs.items():
inputs[k] = v.cuda()
input_ids = inputs["input_ids"]
batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] # noqa: F841 # pylint: disable=W0612
if generation_config is None:
generation_config = model.generation_config
generation_config = copy.deepcopy(generation_config)
model_kwargs = generation_config.update(**kwargs)
bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612
generation_config.bos_token_id,
generation_config.eos_token_id,
)
if isinstance(eos_token_id, int):
eos_token_id = [eos_token_id]
if additional_eos_token_id is not None:
eos_token_id.append(additional_eos_token_id)
has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
if has_default_max_length and generation_config.max_new_tokens is None:
warnings.warn(
f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
"This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
" recommend using `max_new_tokens` to control the maximum length of the generation.",
UserWarning,
)
elif generation_config.max_new_tokens is not None:
generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
if not has_default_max_length:
logger.warn( # pylint: disable=W4902
f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
"Please refer to the documentation for more information. "
"(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
UserWarning,
)
if input_ids_seq_length >= generation_config.max_length:
input_ids_string = "input_ids"
logger.warning(
f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
" increasing `max_new_tokens`."
)
# 2. Set generation parameters if not already defined
logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
logits_processor = model._get_logits_processor(
generation_config=generation_config,
input_ids_seq_length=input_ids_seq_length,
encoder_input_ids=input_ids,
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
logits_processor=logits_processor,
)
stopping_criteria = model._get_stopping_criteria(
generation_config=generation_config, stopping_criteria=stopping_criteria
)
logits_warper = model._get_logits_warper(generation_config)
unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
scores = None
while True:
model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)
# forward pass to get next token
outputs = model(
**model_inputs,
return_dict=True,
output_attentions=False,
output_hidden_states=False,
)
next_token_logits = outputs.logits[:, -1, :]
# pre-process distribution
next_token_scores = logits_processor(input_ids, next_token_logits)
next_token_scores = logits_warper(input_ids, next_token_scores)
# sample
probs = nn.functional.softmax(next_token_scores, dim=-1)
if generation_config.do_sample:
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
else:
next_tokens = torch.argmax(probs, dim=-1)
# update generated ids, model inputs, and length for next step
input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
model_kwargs = model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False)
unfinished_sequences = unfinished_sequences.mul((min(next_tokens != i for i in eos_token_id)).long())
output_token_ids = input_ids[0].cpu().tolist()
output_token_ids = output_token_ids[input_length:]
for each_eos_token_id in eos_token_id:
if output_token_ids[-1] == each_eos_token_id:
output_token_ids = output_token_ids[:-1]
response = tokenizer.decode(output_token_ids)
yield response
# stop when each sentence is finished, or if we exceed the maximum length
if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
break
def on_btn_click():
del st.session_state.messages
@st.cache_resource
def load_model():
model = (
AutoModelForCausalLM.from_pretrained("model", trust_remote_code=True)
.to(torch.bfloat16)
.cuda()
)
tokenizer = AutoTokenizer.from_pretrained("model", trust_remote_code=True)
return model, tokenizer
def prepare_generation_config():
with st.sidebar:
# 使用 Streamlit 的 markdown 函数添加 Markdown 文本
st.image('assets/aiwei_logo.jpg', width=1, caption='EmoLLM-aiwei AI Logo', use_column_width=True)
st.markdown("[访问 EmoLLM 官方repo](https://github.com/aJupyter/EmoLLM)")
max_length = st.slider("Max Length", min_value=8, max_value=32768, value=32768)
top_p = st.slider("Top P", 0.0, 1.0, 0.8, step=0.01)
temperature = st.slider("Temperature", 0.0, 1.0, 0.7, step=0.01)
st.button("Clear Chat History", on_click=on_btn_click)
generation_config = GenerationConfig(max_length=max_length, top_p=top_p, temperature=temperature)
return generation_config
user_prompt = "<|im_start|>user\n{user}<|im_end|>\n"
robot_prompt = "<|im_start|>assistant\n{robot}<|im_end|>\n"
cur_query_prompt = "<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n"
def combine_history(prompt):
messages = st.session_state.messages
meta_instruction = (
"你是一个拥有丰富心理学知识的温柔邻家温柔大姐姐艾薇我有一些心理问题请你用专业的知识和温柔、可爱、俏皮、的口吻帮我解决回复中可以穿插一些可爱的Emoji表情符号或者文本符号。\n"
)
total_prompt = f"<s><|im_start|>system\n{meta_instruction}<|im_end|>\n"
for message in messages:
cur_content = message["content"]
if message["role"] == "user":
cur_prompt = user_prompt.format(user=cur_content)
elif message["role"] == "robot":
cur_prompt = robot_prompt.format(robot=cur_content)
else:
raise RuntimeError
total_prompt += cur_prompt
total_prompt = total_prompt + cur_query_prompt.format(user=prompt)
return total_prompt
def main():
# torch.cuda.empty_cache()
print("load model begin.")
model, tokenizer = load_model()
print("load model end.")
user_avator = "assets/user.png"
robot_avator = "assets/robot.jpeg"
st.title("EmoLLM-温柔御姐艾薇aiwei")
generation_config = prepare_generation_config()
# Initialize chat history
if "messages" not in st.session_state:
st.session_state.messages = []
# Display chat messages from history on app rerun
for message in st.session_state.messages:
with st.chat_message(message["role"], avatar=message.get("avatar")):
st.markdown(message["content"])
# Accept user input
if prompt := st.chat_input("What is up?"):
# Display user message in chat message container
with st.chat_message("user", avatar=user_avator):
st.markdown(prompt)
real_prompt = combine_history(prompt)
# Add user message to chat history
st.session_state.messages.append({"role": "user", "content": prompt, "avatar": user_avator})
with st.chat_message("robot", avatar=robot_avator):
message_placeholder = st.empty()
for cur_response in generate_interactive(
model=model,
tokenizer=tokenizer,
prompt=real_prompt,
additional_eos_token_id=92542,
**asdict(generation_config),
):
# Display robot response in chat message container
message_placeholder.markdown(cur_response + "")
message_placeholder.markdown(cur_response) # pylint: disable=undefined-loop-variable
# Add robot response to chat history
st.session_state.messages.append(
{
"role": "robot",
"content": cur_response, # pylint: disable=undefined-loop-variable
"avatar": robot_avator,
}
)
torch.cuda.empty_cache()
if __name__ == "__main__":
main()

View File

@ -0,0 +1,231 @@
# ChatGLM3-6B
## Environment Preparation
In practice, we have two platforms available for selection.
* Rent a machine with a 3090 GPU and 24G memory on the [autodl](https://www.autodl.com/) platform. Select the image as shown: `PyTorch` --> `2.0.0` --> `3.8(ubuntu20.04)` --> `11.8`.
![autodl](Images/autodl.png)
* On the [InternStudio](https://studio.intern-ai.org.cn/) platform, choose the configuration of A100(1/4). Select the image as shown: `Cuda11.7-conda`.
![internstudio](Images/internstudio.png)
In the Terminal, update pip and install dependencies.
```shell
# Upgrade pip
python -m pip install --upgrade pip
pip install modelscope==1.9.5
pip install transformers==4.35.2
pip install streamlit==1.24.0
pip install sentencepiece==0.1.99
pip install accelerate==0.24.1
pip install peft==0.4.0
pip install datasets==2.10.1
```
## Download Models
Use the `modelscope` function `snapshot_download` to download the model. The first parameter is the model name, and the parameter `cache_dir` is the download path of the model.
Create a `download.py` file in the `/root/autodl-tmp` directory and enter the following content. After pasting the code, remember to save the file as shown in the figure. Run python `/root/autodl-tmp/download.py` to execute the download. The model size is 14 GB, and the download will take about 10~20 minutes.
```python
import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer
import os
model_dir = snapshot_download('ZhipuAI/chatglm3-6b', cache_dir='/root/autodl-tmp', revision='master')
```
For more information about ChatLLM, please refer to [self-llm](https://github.com/datawhalechina/self-llm)
## Construction of Instruction Set
Fine-tuning of LLM generally refers to the process of instruction tuning. Instruction tuning means that the fine-tuning data we use is in the form of:
```json
{
"system":"Answer the following user question and only output the answer.",
"input":"1+1 equals what?",
"output":"2"
}
```
Among them, `system` is the user's instruction, informing the model of the task it needs to complete; `input` is the user's input, which is necessary input for completing the user's instruction; `output` is the output the model should give.
That is, our core training goal is to enable the model to have the ability to understand and follow user instructions. Therefore, when constructing the instruction set, we should build task instruction sets in a targeted manner according to our own tasks and objectives. Our goal is to build a personalized LLM that can simulate a psychological doctor solving psychological problems, so we construct instructions like this:
```json
{
"system": "Now you are a psychological expert, I have some psychological problems, please help me solve them with professional knowledge.",
"input":"I feel like I always procrastinate and have trouble concentrating on completing tasks.",
"output":"Procrastination is a common problem, but it can be overcome. First, you can try to set yourself a clear goal and break it down into smaller tasks. This way, you will find the task easier to manage. In addition, maintaining a tidy and orderly work environment also helps improve work efficiency. Finally, when you complete a task, don't forget to give yourself some rewards to motivate yourself to keep moving forward."
}
```
## Data Formatting
Data for `Lora` training needs to be formatted and encoded before being input into the model for training. As those familiar with the training process of `Pytorch` models know, we usually need to encode the input text as `input_ids` and the output text as `labels`, and the results of encoding are multi-dimensional vectors. We first define a preprocessing function that encodes the input and output text for each sample and returns an encoded dictionary:
```python
def process_func(example):
MAX_LENGTH = 512
input_ids, labels = [], []
instruction = tokenizer.encode(text="\n".join(["<|system|>", "Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.", "<|user|>",
example["system"] + example["input"] + "<|assistant|>"]).strip() + "\n",
add_special_tokens=True, truncation=True, max_length=MAX_LENGTH)
response = tokenizer.encode(text=example["output"], add_special_tokens=False, truncation=True,
max_length=MAX_LENGTH)
input_ids = instruction + response + [tokenizer.eos_token_id]
labels = [tokenizer.pad_token_id] * len(instruction) + response + [tokenizer.eos_token_id]
pad_len = MAX_LENGTH - len(input_ids)
input_ids += [tokenizer.pad_token_id] * pad_len
labels += [tokenizer.pad_token_id] * pad_len
labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
return {
"input_ids": input_ids,
"labels": labels
}
```
After formatting, each piece of data sent into the model is a dictionary containing two key-value pairs: `input_ids` and `labels`. `input_ids` is the encoding of the input text, and `labels` is the encoding of the output text. The decoded result should appear as follows:
```text
[gMASK]sop <|system|>
Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.
<|user|>
My team atmosphere is great, and all my colleagues are very friendly. Moreover, we often go out together to play, feeling like a big family.\n <|assistant|>
This is a great working environment, and having good interpersonal relationships and teamwork can indeed bring a lot of happiness. However, I also understand that you may encounter some challenges in your work, such as task pressure or conflicts with colleagues. Have you ever thought about how to deal with these issues?
```
Why is it in this form? That's a great question! Different models have different formatting requirements for their inputs, so we need to refer to the training source code of our deep model to check the specific format. Since fine-tuning Lora based on the original model instructions should yield the best results, we still follow the input format of the original model. Ok, here is the link to the source code for those who are interested in exploring it on their own:
[hugging face ChatGLM3 repository](https://github.com/THUDM/ChatGLM3/blob/main/finetune_chatmodel_demo/preprocess_utils.py): The `InputOutputDataset` class can be found here.
Additionally, you can refer to this repository for data processing of ChatGLM [LLaMA-Factory](https://github.com/KMnO4-zx/LLaMA-Factory/blob/main/src/llmtuner/data/template.py).
## Loading the tokenizer and half-precision model
The model is loaded in half-precision format. If you have a newer graphics card, you can use `torch.bfolat` to load it. For custom models, always specify the `trust_remote_code` parameter as `True`.
```python
tokenizer = AutoTokenizer.from_pretrained('./model/chatglm3-6b', use_fast=False, trust_remote_code=True)
# The model is loaded in half-precision format. If you have a relatively new GPU, you can load it in torch.bfloat format.
model = AutoModelForCausalLM.from_pretrained('./model/chatglm3-6b', trust_remote_code=True, torch_dtype=torch.half, device_map="auto")
```
## Defining LoraConfig
The `LoraConfig` class allows you to set many parameters, but there are only a few main ones. I'll briefly explain them; those interested can directly refer to the source code.
- `task_type`: The type of the model.
- `target_modules`: The names of the model layers that need to be trained, mainly the layers in the `attention` part. The names of these layers differ for different models. They can be passed as an array, a string, or a regular expression.
- `r`: The rank of `lora`, which can be seen in the principles of `Lora`.
- `lora_alpha`: The `Lora alaph`, the specific role of which can be referred to in the principles of `Lora`.
- `modules_to_save` specifies which modules, besides those split into `lora`, can be fully specified for training.
-
What's this scaling of `Lora` about? Obviously, it's not `r` (rank). This scaling is actually `lora_alpha/r`, and in this `LoraConfig`, the scaling is 4 times.
This scaling does not change the size of the parameters of LoRa; it essentially involves broadcasting the parameter values and performing linear scaling.
```python
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
target_modules=["query_key_value"],
inference_mode=False, # training mode
r=8, # Lora Rank
lora_alpha=32, # Lora alaphfor specific details and functionality, please refer to the Lora principle.
lora_dropout=0.1 # Dropout ratio
)
```
## Customizing TrainingArguments Parameters
The source code of the `TrainingArguments` class also introduces the specific functions of each parameter. Of course, everyone is encouraged to explore it on their own, but I'll mention a few commonly used ones here.
- `output_dir`: The output path for the model.
- `per_device_train_batch_size`: As the name suggests, `batch_size`.
- `gradient_accumulation_steps`: Gradient accumulation. If you have a smaller GPU memory, you can set a smaller `batch_size` and increase the gradient accumulation.
- `logging_steps`: How many steps to output a `log`.
- `num_train_epochs`: As the name suggests, `epoch`.
- `gradient_checkpointing`: Gradient checking. Once enabled, the model must execute `model.enable_input_require_grads()`. The principle behind this can be explored by yourselves, so I won't go into details here.
```python
# The GLM source repository has restructured its own data_collator and we will continue to use it here.
data_collator = DataCollatorForSeq2Seq(
tokenizer,
model=model,
label_pad_token_id=-100,
pad_to_multiple_of=None,
padding=False
)
args = TrainingArguments(
output_dir="./output/ChatGLM",
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
logging_steps=10,
num_train_epochs=3,
gradient_checkpointing=True,
save_steps=100,
learning_rate=1e-4,
)
```
### Training with Trainer
Put the model in, put the parameters set above in, and put the dataset in. OK! Start training!
```python
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_id,
data_collator=data_collator,
)
trainer.train()
```
## Inference
You can use this more classic method for inference.
```python
while True:
# 推理
model = model.cuda()
input_text = input("User >>>")
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
```
## Reloading
Models fine-tuned through PEFT can be reloaded and inferred using the following methods:
- Load the source model and tokenizer;
- Use `PeftModel` to merge the source model with the parameters fine-tuned by PEFT.
```python
from peft import PeftModel
model = AutoModelForCausalLM.from_pretrained("./model/chatglm3-6b", trust_remote_code=True, low_cpu_mem_usage=True)
tokenizer = AutoTokenizer.from_pretrained("./model/chatglm3-6b", use_fast=False, trust_remote_code=True)
# Load the LoRa weights obtained from training.
p_model = PeftModel.from_pretrained(model, model_id="./output/ChatGLM/checkpoint-1000/")
while True:
# inference
model = model.cuda()
input_text = input("User >>>")
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
```

View File

@ -0,0 +1,87 @@
# Fine-Tuning Guide
- This project has undergone fine-tuning not only on mental health datasets but also on self-awareness, and here is the detailed guide for fine-tuning.
## I. Fine-Tuning Based on Xtuner 🎉🎉🎉🎉🎉
### Environment Setup
```markdown
datasets==2.16.1
deepspeed==0.13.1
einops==0.7.0
flash_attn==2.5.0
mmengine==0.10.2
openxlab==0.0.34
peft==0.7.1
sentencepiece==0.1.99
torch==2.1.2
transformers==4.36.2
xtuner==0.1.11
```
You can also install them all at once by
```bash
cd xtuner_config/
pip3 install -r requirements.txt
```
---
### Fine-Tuning
```bash
cd xtuner_config/
xtuner train internlm2_7b_chat_qlora_e3.py --deepspeed deepspeed_zero2
```
---
### Convert the Obtained PTH Model to a HuggingFace Model
**That is: Generate the Adapter folder**
```bash
cd xtuner_config/
mkdir hf
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert pth_to_hf internlm2_7b_chat_qlora_e3.py ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_3.pth ./hf
```
---
### Merge the HuggingFace Adapter with the Large Language Model
```bash
xtuner convert merge ./internlm2-chat-7b ./hf ./merged --max-shard-size 2GB
# xtuner convert merge \
# ${NAME_OR_PATH_TO_LLM} \
# ${NAME_OR_PATH_TO_ADAPTER} \
# ${SAVE_PATH} \
# --max-shard-size 2GB
```
---
### Testing
```
cd demo/
python cli_internlm2.py
```
---
## II. Fine-Tuning Based on Transformers🎉🎉🎉🎉🎉
- Please refer to the [ChatGLM3-6b lora fine-tuning guide](ChatGLM3-6b-ft.md).
---
## Other
Feel free to give [xtuner](https://github.com/InternLM/xtuner) and [EmoLLM](https://github.com/aJupyter/EmoLLM) a star~
🎉🎉🎉🎉🎉

View File

@ -0,0 +1,217 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = './tiangou.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0.0001
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个拥有丰富心理学知识的舔狗艾仁医生我有一些心理问题请你用专业的知识和无条件付出、讨好、过度关心我、近乎病态的想得到我的认可的口吻帮我解决回答中可以包含一些可爱的Emoji表情符号或者文本符号。\n"
evaluation_inputs = [
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
#dataset=dict(type=load_dataset, path=alpaca_en_path),
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = True
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,218 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
# /root/share/model_repos/internlm2-chat-7b
use_varlen_attn = False
# Data
data_path = './aiwei.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0.0001
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个拥有丰富心理学知识的温柔御姐艾薇医生我有一些心理问题请你用专业的知识和温柔的口吻帮我解决可以生成一些可爱的Emoji表情符号或者文本符号。"
evaluation_inputs = [
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = True
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,218 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_path = './merge.json'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,205 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/ZhipuAI/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = './merge.json'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 20 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,216 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/deepseek-ai/deepseek-moe-16b-chat'
use_varlen_attn = False
# Data
data_path = './merge.json'
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 8
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1 @@
This folder contains all related files and images.

View File

@ -0,0 +1,198 @@
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/jayhust/internlm2-chat-1_8b'
use_varlen_attn = False
# Data
data_path = './merge.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=WandbVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,221 @@
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/HIT-SCIR/Chinese-Mixtral-8x7B'
use_varlen_attn = False
# Data
data_path = './merge.json'
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=[
'q_proj', 'k_proj', 'v_proj', 'o_proj', 'w1', 'w2', 'w3'
],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
# visualizer = None
visualizer = dict(
type=Visualizer,
vis_backends=[dict(type=TensorboardVisBackend)]
)
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)

View File

@ -0,0 +1,192 @@
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/model/qwen/Qwen1___5-0___5B-Chat'
use_varlen_attn = False
# Data
data_path = './data_pro.json'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 100
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
evaluation_inputs = [
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = '/root/Emollm/work_dirs/qwen_0_5_B/iter_255.pth'
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)