Merge pull request #40 from chg0901/main
[Update] README files translation
This commit is contained in:
commit
88b1794f7e
187
README.md
187
README.md
@ -1,16 +1,13 @@
|
||||
# EmoLLM-心理健康大模型
|
||||
|
||||
[Contributors][contributors-url]
|
||||
[Forks][forks-url]
|
||||
[Stargazers][stars-url]
|
||||
[Issues][issues-url]
|
||||
[MIT License][license-url]
|
||||
|
||||
<!-- [![LinkedIn][linkedin-shield]][linkedin-url] -->
|
||||
|
||||
<!-- PROJECT LOGO -->
|
||||
|
||||
<!-- PROJECT SHIELDS -->
|
||||
[![Contributors][contributors-shield]][contributors-url]
|
||||
[![Forks][forks-shield]][forks-url]
|
||||
[![Issues][issues-shield]][issues-url]
|
||||
[![MIT License][license-shield]][license-url]
|
||||
[![Stargazers][stars-shield]][stars-url]
|
||||
<br />
|
||||
<!-- PROJECT LOGO -->
|
||||
|
||||
<p align="center">
|
||||
<a href="https://github.com/aJupyter/EmoLLM/">
|
||||
@ -18,7 +15,10 @@
|
||||
</a>
|
||||
|
||||
<h3 align="center">EmoLLM</h3>
|
||||
|
||||
<p align="center">
|
||||
简体中文| <a href="README_English_version.md" >English</a>
|
||||
<br />
|
||||
<br />
|
||||
<a href="https://github.com/aJupyter/EmoLLM"><strong>探索本项目的文档 »</strong></a>
|
||||
<br />
|
||||
@ -34,7 +34,20 @@
|
||||
|
||||
<!-- 本篇README.md面向开发者 -->
|
||||
|
||||
**EmoLLM**是一个能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由[InternLM2](https://github.com/InternLM/InternLM)指令微调而来,欢迎大家star~⭐⭐
|
||||
**EmoLLM** 是一系列能够支持 **理解用户-支持用户-帮助用户** 心理健康辅导链路的心理健康大模型,由 `LLM`指令微调而来,欢迎大家star~⭐⭐。目前已经开源的 `LLM`微调配置如下:
|
||||
|
||||
| 模型 | 类型 |
|
||||
| :-------------------: | :------: |
|
||||
| InternLM2_7B_chat | qlora |
|
||||
| InternLM2_1_8B_chat | 全量微调 |
|
||||
| Qwen_7b_chat | qlora |
|
||||
| Qwen1_5-0_5B-Chat | 全量微调 |
|
||||
| Baichuan2_13B_chat | qlora |
|
||||
| ChatGLM3_6B | lora |
|
||||
| DeepSeek MoE_16B_chat | qlora |
|
||||
| Mixtral 8x7B_instruct | qlora |
|
||||
| …… | …… |
|
||||
欢迎大家为本项目做出贡献~
|
||||
|
||||
---
|
||||
|
||||
@ -49,61 +62,86 @@
|
||||
- 预防和干预措施:心理健康大模型还包括预防心理问题和促进心理健康的策略,如心理教育、心理咨询、心理治疗和社会支持系统。
|
||||
- 评估和诊断工具:为了有效促进心理健康,需要有科学的工具来评估个体的心理状态,以及诊断可能存在的心理问题。
|
||||
|
||||
### 最近更新
|
||||
- 【2024.2.29】更新客观评估计算,详见[evaluate](./evaluate/),更新一系列数据集,详见[datasets](./datasets/)。
|
||||
- 【2024.2.27】更新英文readme和一系列数据集(舔狗和单轮对话)
|
||||
- 【2024.2.23】推出基于InternLM2_7B_chat_qlora的 `温柔御姐心理医生艾薇`,[点击获取模型权重](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei),[配置文件](xtuner_config/aiwei-internlm2_chat_7b_qlora.py),[在线体验链接](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
|
||||
- 【2024.2.23】更新[若干微调配置](/xtuner_config/),新增 [data_pro.json](/datasets/data_pro.json)(数量更多、场景更全、更丰富)和 [aiwei.json](/datasets/aiwei.json)(温柔御姐角色扮演专用,带有Emoji表情),即将推出 `温柔御姐心理医生艾薇`
|
||||
- 【2024.2.18】 [基于Qwen1_5-0_5B-Chat全量微调版本开源](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary),算力有限的道友可以玩起来~
|
||||
- 【2024.2.6】 EmoLLM在[**Openxlab** ](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 平台下载量高达18.7k,欢迎大家体验!
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
|
||||
</p>
|
||||
|
||||
<details>
|
||||
<summary>查看更多</summary>
|
||||
|
||||
- 【2024.2.5】 项目荣获公众号**NLP工程化**推文宣传[推文链接](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A),为博主推广一波,欢迎大家关注!!🥳🥳
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
|
||||
</p>
|
||||
|
||||
- 【2024.2.3】 [项目宣传视频](https://www.bilibili.com/video/BV1N7421N76X/)完成 😊
|
||||
- 【2024.1.27】 完善数据构建文档、微调指南、部署指南、Readme等相关文档 👏
|
||||
- 【2024.1.25】 完成EmoLLM第一版并部署上线 https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
|
||||
|
||||
</details>
|
||||
|
||||
## 目录
|
||||
|
||||
- [EmoLLM-心理健康大模型](#emollm-心理健康大模型)
|
||||
- [最近更新](#最近更新)
|
||||
- [目录](#目录)
|
||||
- [开发前的配置要求](#开发前的配置要求)
|
||||
- [**安装步骤**](#安装步骤)
|
||||
- [**使用指南**](#使用指南)
|
||||
- [文件目录说明](#文件目录说明)
|
||||
- [数据构建](#数据构建)
|
||||
- [微调指南](#微调指南)
|
||||
- [demo部署](#demo部署)
|
||||
- [部署指南](#部署指南)
|
||||
- [使用到的框架](#使用到的框架)
|
||||
- [贡献者](#贡献者)
|
||||
- [如何参与开源项目](#如何参与开源项目)
|
||||
- [如何参与本项目](#如何参与本项目)
|
||||
- [版本控制](#版本控制)
|
||||
- [作者](#作者)
|
||||
- [作者(排名不分先后)](#作者排名不分先后)
|
||||
- [版权说明](#版权说明)
|
||||
- [鸣谢](#鸣谢)
|
||||
- [特别鸣谢](#特别鸣谢)
|
||||
- [Star History](#star-history)
|
||||
- [🌟 Contributors](#-contributors)
|
||||
|
||||
###### 开发前的配置要求
|
||||
|
||||
详见[部署要求](https://github.com/aJupyter/EmoLLM/tree/main/%E9%85%8D%E7%BD%AE%E8%A6%81%E6%B1%82)
|
||||
- 硬件:A100 40G(仅针对InternLM2_7B_chat+qlora微调+deepspeed zero2优化)
|
||||
|
||||
###### **安装步骤**
|
||||
###### **使用指南**
|
||||
|
||||
1. Get a free API Key at [https://example.com](https://example.com)
|
||||
2. Clone the repo
|
||||
1. Clone the repo
|
||||
|
||||
```sh
|
||||
git clone https://github.com/aJupyter/EmoLLM.git
|
||||
git clone https://github.com/SmartFlowAI/EmoLLM.git
|
||||
```
|
||||
|
||||
2. 依次阅读或者选择感兴趣的部分阅读:
|
||||
- [文件目录说明](#文件目录说明)
|
||||
- [数据构建](#数据构建)
|
||||
- [微调指南](#微调指南)
|
||||
- [部署指南](#部署指南)
|
||||
- 查看更多详情
|
||||
|
||||
<details>
|
||||
<summary>更多详情</summary>
|
||||
|
||||
### 文件目录说明
|
||||
|
||||
eg:
|
||||
|
||||
```
|
||||
filetree
|
||||
├── ARCHITECTURE.md
|
||||
├── LICENSE.txt
|
||||
├── README.md
|
||||
├── /account/
|
||||
├── /bbs/
|
||||
├── /docs/
|
||||
│ ├── /rules/
|
||||
│ │ ├── backend.txt
|
||||
│ │ └── frontend.txt
|
||||
├── manage.py
|
||||
├── /oa/
|
||||
├── /static/
|
||||
├── /templates/
|
||||
├── useless.md
|
||||
└── /util/
|
||||
|
||||
├─assets:图像资源
|
||||
├─datasets:数据集
|
||||
├─demo:demo脚本
|
||||
├─generate_data:生成数据指南
|
||||
│ └─xinghuo
|
||||
├─scripts:一些可用工具
|
||||
└─xtuner_config:微调指南
|
||||
└─images
|
||||
```
|
||||
|
||||
### 数据构建
|
||||
@ -116,21 +154,18 @@ filetree
|
||||
|
||||
详见[微调指南](xtuner_config/README.md)
|
||||
|
||||
### demo部署
|
||||
### 部署指南
|
||||
|
||||
详见[demo](https://github.com/aJupyter/EmoLLM/demo)
|
||||
详见[部署指南](demo/README.md)
|
||||
|
||||
### 使用到的框架
|
||||
|
||||
- [xxxxxxx](https://getbootstrap.com)
|
||||
- [xxxxxxx](https://jquery.com)
|
||||
- [xxxxxxx](https://laravel.com)
|
||||
- [Xtuner](https://github.com/InternLM/xtuner)
|
||||
- [Transformers](https://github.com/huggingface/transformers)
|
||||
- [Pytorch](https://pytorch.org/)
|
||||
- …
|
||||
|
||||
### 贡献者
|
||||
|
||||
请阅读**CONTRIBUTING.md** 查阅为该项目做出贡献的开发者。
|
||||
|
||||
#### 如何参与开源项目
|
||||
#### 如何参与本项目
|
||||
|
||||
贡献使开源社区成为一个学习、激励和创造的绝佳场所。你所作的任何贡献都是**非常感谢**的。
|
||||
|
||||
@ -144,15 +179,33 @@ filetree
|
||||
|
||||
该项目使用Git进行版本管理。您可以在repository参看当前可用版本。
|
||||
|
||||
</details>
|
||||
|
||||
### 作者(排名不分先后)
|
||||
|
||||
[aJupyter](https://github.com/aJupyter)@datawhale成员、南开大学在读硕士
|
||||
|
||||
[jujimeizup](https://github.com/jujimeizuo)@
|
||||
[jujimeizuo](https://github.com/jujimeizuo)@江南大学在读硕士
|
||||
|
||||
[Smiling&Weeping](https://github.com/Smiling-Weeping-zhr)@
|
||||
[Smiling&Weeping](https://github.com/Smiling-Weeping-zhr)@哈尔滨工业大学(威海)在读本科生
|
||||
|
||||
[Farewell](https://github.com/8baby8)@
|
||||
[Farewell](https://github.com/8baby8)@飞桨领航团区域主管、文心大模型核心开发者
|
||||
|
||||
[ZhouXinAo](https://github.com/zxazys)@南开大学在读硕士
|
||||
|
||||
[MING_X](https://github.com/MING-ZCH)@华中科技大学在读本科生
|
||||
|
||||
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
|
||||
|
||||
[MrCatAI](https://github.com/MrCatAI)@AI搬用工
|
||||
|
||||
[ZeyuBa](https://github.com/ZeyuBa)@自动化所在读硕士
|
||||
|
||||
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@宾夕法尼亚大学在读硕士
|
||||
|
||||
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
|
||||
|
||||
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
|
||||
|
||||
### 版权说明
|
||||
|
||||
@ -163,6 +216,7 @@ filetree
|
||||
- [Sanbu](https://github.com/sanbuphy)
|
||||
- [上海人工智能实验室](https://www.shlab.org.cn/)
|
||||
- [闻星大佬(小助手)](https://github.com/vansin)
|
||||
- [扫地升(公众号宣传)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
|
||||
|
||||
<!-- links -->
|
||||
|
||||
@ -170,22 +224,23 @@ filetree
|
||||
|
||||
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
|
||||
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#aJupyter/EmoLLM&Date)
|
||||
|
||||
## 🌟 Contributors
|
||||
|
||||
[](https://github.com/aJupyter/EmoLLM/graphs/contributors)
|
||||
[](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
|
||||
|
||||
[your-project-path]: aJupyter/EmoLLM
|
||||
[contributors-shield]: https://img.shields.io/github/contributors/aJupyter/EmoLLM.svg?style=flat-square
|
||||
[contributors-url]: https://github.com/aJupyter/EmoLLM/graphs/contributors
|
||||
[forks-shield]: https://img.shields.io/github/forks/aJupyter/EmoLLM.svg?style=flat-square
|
||||
[forks-url]: https://github.com/aJupyter/EmoLLM/network/members
|
||||
[stars-shield]: https://img.shields.io/github/stars/aJupyter/EmoLLM.svg?style=flat-square
|
||||
[stars-url]: https://github.com/aJupyter/EmoLLM/stargazers
|
||||
[issues-shield]: https://img.shields.io/github/issues/aJupyter/EmoLLM.svg?style=flat-square
|
||||
[issues-url]: https://img.shields.io/github/issues/aJupyter/EmoLLM.svg
|
||||
[license-shield]: https://img.shields.io/github/license/aJupyter/EmoLLM.svg?style=flat-square
|
||||
[license-url]: https://github.com/aJupyter/EmoLLM/blob/main/LICENSE
|
||||
[your-project-path]: SmartflowAI/EmoLLM
|
||||
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
|
||||
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
|
||||
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
|
||||
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
|
||||
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
|
||||
|
251
README_EN.md
Normal file
251
README_EN.md
Normal file
@ -0,0 +1,251 @@
|
||||
# EmoLLM - Large Languge Model for Mental Health
|
||||
|
||||
<!-- PROJECT SHIELDS -->
|
||||
[![Contributors][contributors-shield]][contributors-url]
|
||||
[![Forks][forks-shield]][forks-url]
|
||||
[![Issues][issues-shield]][issues-url]
|
||||
[![MIT License][license-shield]][license-url]
|
||||
[![Stargazers][stars-shield]][stars-url]
|
||||
<br />
|
||||
<!-- PROJECT LOGO -->
|
||||
|
||||
<p align="center">
|
||||
<a href="https://github.com/aJupyter/EmoLLM/">
|
||||
<img src="assets/logo.jpeg" alt="Logo" width="30%">
|
||||
</a>
|
||||
|
||||
<h3 align="center">EmoLLM</h3>
|
||||
|
||||
<p align="center">
|
||||
<a href="README.md">简体中文</a> | English
|
||||
<br />
|
||||
<br />
|
||||
<a href="https://github.com/aJupyter/EmoLLM"><strong>Explore the documentation of this project »</strong></a>
|
||||
<br />
|
||||
<br />
|
||||
<a href="https://github.com/aJupyter/EmoLLM/tree/main/demo">View the Demo</a>
|
||||
·
|
||||
<a href="https://github.com/aJupyter/EmoLLM/issues">Report a Bug</a>
|
||||
·
|
||||
<a href="https://github.com/aJupyter/EmoLLM/issues">Propose a New Feature</a>
|
||||
</p>
|
||||
|
||||
</p>
|
||||
|
||||
<!-- 本篇README.md面向开发者 -->
|
||||
|
||||
|
||||
**EmoLLM** is a series of large language models designed to understand, support and help customers in mental health counseling. It is fine-tuned from the LLM instructions. We really appreciate it if you can give it a star~⭐⭐. The open-sourced configuration is as follows:
|
||||
|
||||
| model | type |
|
||||
| :-------------------: | :------: |
|
||||
| InternLM2_7B_chat | qlora |
|
||||
| InternLM2_1_8B_chat | full finetuning |
|
||||
| Qwen_7b_chat | qlora |
|
||||
| Qwen1_5-0_5B-Chat | full finetuning |
|
||||
| Baichuan2_13B_chat | qlora |
|
||||
| ChatGLM3_6B | lora |
|
||||
| DeepSeek MoE_16B_chat | qlora |
|
||||
| Mixtral 8x7B_instruct | qlora |
|
||||
| …… | …… |
|
||||
Everyone is welcome to contribute to this project ~
|
||||
---
|
||||
|
||||
The Model is aimed at fully understanding and promoting the mental health of individuals, groups, and society. This model typically includes the following key components:
|
||||
|
||||
- Cognitive factors: Involving an individual's thought patterns, belief systems, cognitive biases, and problem-solving abilities. Cognitive factors significantly impact mental health as they affect how individuals interpret and respond to life events.
|
||||
- Emotional factors: Including emotion regulation, emotional expression, and emotional experiences. Emotional health is a crucial part of mental health, involving how individuals manage and express their emotions and how they recover from negative emotions.
|
||||
- Behavioral factors: Concerning an individual's behavior patterns, habits, and coping strategies. This includes stress management skills, social skills, and self-efficacy, which is the confidence in one's abilities.
|
||||
- Social environment: Comprising external factors such as family, work, community, and cultural background, which have direct and indirect impacts on an individual's mental health.
|
||||
- Physical health: There is a close relationship between physical and mental health. Good physical health can promote mental health and vice versa.
|
||||
- Psychological resilience: Refers to an individual's ability to recover from adversity and adapt. Those with strong psychological resilience can bounce back from challenges and learn and grow from them.
|
||||
- Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
|
||||
- Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
|
||||
### Recent Updates
|
||||
- 【2024.2.29】 Updated objective assessment calculations, see [evaluate](./evaluate/) for details. A series of datasets have also been updated, see [datasets](./datasets/) for details.
|
||||
- 【2024.2.27】 Updated English README and a series of datasets (licking dogs and one-round dialogue)
|
||||
- 【2024.2.23】The "Gentle Lady Psychologist Ai Wei" based on InternLM2_7B_chat_qlora was launched. [Click here to obtain the model weights](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_aiwei), [configuration file](xtuner_config/aiwei-internlm2_chat_7b_qlora.py), [online experience link](https://openxlab.org.cn/apps/detail/ajupyter/EmoLLM-aiwei)
|
||||
|
||||
- 【2024.2.23】Updated [several fine-tuning configurations](/xtuner_config/), added [data_pro.json](/datasets/data_pro.json) (more quantity, more comprehensive scenarios, richer content) and [aiwei.json](/datasets/aiwei.json) (dedicated to the gentle lady role-play, featuring Emoji expressions), the "Gentle Lady Psychologist Ai Wei" is coming soon.
|
||||
|
||||
- 【2024.2.18】 The full fine-tuned version based on Qwen1_5-0_5B-Chat has been [open-sourced](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary). Friends with limited computational resources can now dive in and explore it.
|
||||
|
||||
- 【2024.2.6】 [Open-sourced based on the Qwen1_5-0_5B-Chat full-scale fine-tuned version](https://www.modelscope.cn/models/aJupyter/EmoLLM_Qwen1_5-0_5B-Chat_full_sft/summary), friends with limited computing power can start experimenting~
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/7e931682-c54d-4ded-bc67-79130c68d744" alt="模型下载量">
|
||||
</p>
|
||||
|
||||
<details>
|
||||
<summary>View More</summary>
|
||||
|
||||
- 【2024.2.5】 The project has been promoted by the official WeChat account NLP Engineering. Here's the [link](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A) to the article. Welcome everyone to follow!! 🥳🥳
|
||||
|
||||
<p align="center">
|
||||
<img src="https://github.com/aJupyter/EmoLLM/assets/62385492/47868d6a-2e91-4aa9-a630-e594c14295b4" alt="公众号二维码">
|
||||
</p>
|
||||
|
||||
- 【2024.2.3】 [Project Vedio](https://www.bilibili.com/video/BV1N7421N76X/) at bilibili 😊
|
||||
- 【2024.1.27】 Complete data construction documentation, fine-tuning guide, deployment guide, Readme, and other related documents 👏
|
||||
- 【2024.1.25】 Complete the first version of EmoLLM and deploy it online https://openxlab.org.cn/apps/detail/jujimeizuo/EmoLLM 😀
|
||||
|
||||
</details>
|
||||
|
||||
## Contents
|
||||
|
||||
- [EmoLLM - Large Languge Model for Mental Health](#emollm---large-languge-model-for-mental-health)
|
||||
- [Everyone is welcome to contribute to this project ~](#everyone-is-welcome-to-contribute-to-this-project-)
|
||||
- [Recent Updates](#recent-updates)
|
||||
- [Contents](#contents)
|
||||
- [Pre-development Configuration Requirements.](#pre-development-configuration-requirements)
|
||||
- [**User Guide**](#user-guide)
|
||||
- [File Directory Explanation](#file-directory-explanation)
|
||||
- [Data Construction](#data-construction)
|
||||
- [Fine-tuning Guide](#fine-tuning-guide)
|
||||
- [Deployment Guide](#deployment-guide)
|
||||
- [Frameworks Used](#frameworks-used)
|
||||
- [How to participate in this project](#how-to-participate-in-this-project)
|
||||
- [Version control](#version-control)
|
||||
- [Authors (in no particular order)](#authors-in-no-particular-order)
|
||||
- [Copyright Notice](#copyright-notice)
|
||||
- [Acknowledgments](#acknowledgments)
|
||||
- [Star History](#star-history)
|
||||
- [🌟 Contributors](#-contributors)
|
||||
|
||||
###### Pre-development Configuration Requirements.
|
||||
|
||||
- A100 40G (specifically for InternLM2_7B_chat + qlora fine-tuning + deepspeed zero2 optimization)
|
||||
|
||||
###### **User Guide**
|
||||
|
||||
1. Clone the repo
|
||||
|
||||
```sh
|
||||
git clone https://github.com/SmartFlowAI/EmoLLM.git
|
||||
```
|
||||
|
||||
1. Read in sequence or read sections you're interested in:
|
||||
- [File Directory Explanation](#file-directory-explanation)
|
||||
- [Data Construction](#data-construction)
|
||||
- [Fine-tuning Guide](#fine-tuning-guide)
|
||||
- [Deployment Guide](#deployment-guide)
|
||||
- View More Details
|
||||
|
||||
<details>
|
||||
<summary>Additional Details</summary>
|
||||
|
||||
### File Directory Explanation
|
||||
|
||||
```
|
||||
├─assets:Image Resources
|
||||
├─datasets:Dataset
|
||||
├─demo:demo scripts
|
||||
├─generate_data:Data Generation Guide
|
||||
│ └─xinghuo
|
||||
├─scripts:Some Available Tools
|
||||
└─xtuner_config:Fine-tuning Guide
|
||||
└─images
|
||||
```
|
||||
|
||||
### Data Construction
|
||||
|
||||
Please read the [Data Construction Guide ](generate_data/tutorial.md)for reference.
|
||||
|
||||
The dataset used for this fine-tuning can be found at [datasets](datasets/data.json)
|
||||
|
||||
### Fine-tuning Guide
|
||||
|
||||
For details, see the [fine-tuning guide](xtuner_config/README.md)
|
||||
|
||||
### Deployment Guide
|
||||
|
||||
For details, see the [deployment guide](demo/README.md)
|
||||
|
||||
### Frameworks Used
|
||||
|
||||
- [Xtuner](https://github.com/InternLM/xtuner)
|
||||
- [Transformers](https://github.com/huggingface/transformers)
|
||||
- [Pytorch](https://pytorch.org/)
|
||||
- …
|
||||
|
||||
#### How to participate in this project
|
||||
|
||||
Contributions make the open-source community an excellent place for learning, inspiration, and creation. Any contribution you make is greatly appreciated.
|
||||
|
||||
1. Fork the Project
|
||||
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
|
||||
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
|
||||
4. Push to the Branch (`git push origin feature/AmazingFeature`)
|
||||
5. Open a Pull Request
|
||||
|
||||
### Version control
|
||||
|
||||
This project uses Git for version control. You can see the current available versions in the repository.
|
||||
|
||||
</details>
|
||||
|
||||
### Authors (in no particular order)
|
||||
|
||||
[aJupyter](https://github.com/aJupyter)@Datawhale member, Master's student at Nankai University
|
||||
|
||||
[jujimeizuo](https://github.com/jujimeizuo)@Master's student at Jiangnan University
|
||||
|
||||
[Smiling&Weeping](https://github.com/Smiling-Weeping-zhr)@Undergraduate student at Harbin Institute of Technology (Weihai)
|
||||
|
||||
[Farewell](https://github.com/8baby8)@
|
||||
|
||||
[ZhouXinAo](https://github.com/zxazys)@Master's student at Nankai University
|
||||
|
||||
[MING_X](https://github.com/MING-ZCH) @Undergraduate at Huazhong University of Science and Technology
|
||||
|
||||
[Z_L](https://github.com/JasonLLLLLLLLLLL)@swufe
|
||||
|
||||
[MrCatAI](https://github.com/MrCatAI)@AI Removal of Labour
|
||||
|
||||
[ZeyuBa](https://github.com/ZeyuBa)@Master's student at Institute of Automation
|
||||
|
||||
[aiyinyuedejustin](https://github.com/aiyinyuedejustin)@Master's student at University of Pennsylvania
|
||||
|
||||
[Nobody-ML](https://github.com/Nobody-ML)@中国石油大学(华东)在读本科生
|
||||
|
||||
[chg0901](https://github.com/chg0901)@韩国光云大学博士生
|
||||
|
||||
### Copyright Notice
|
||||
|
||||
The project is licensed under the MIT License. Please refer to the details
|
||||
[LICENSE](https://github.com/aJupyter/EmoLLM/blob/master/LICENSE)
|
||||
|
||||
### Acknowledgments
|
||||
|
||||
- [Sanbu](https://github.com/sanbuphy)
|
||||
- [Shanghai Artificial Intelligence Laboratory](https://www.shlab.org.cn/)
|
||||
- [Vanin](https://github.com/vansin)
|
||||
- [Bloom up (WeChat Official Account Promotion)](https://mp.weixin.qq.com/s/78lrRl2tlXEKUfElnkVx4A)
|
||||
|
||||
<!-- links -->
|
||||
|
||||
<!-- [linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square&logo=linkedin&colorB=555 -->
|
||||
|
||||
<!-- [linkedin-url]: https://linkedin.com/in/aJupyter -->
|
||||
|
||||
<!-- 太少了,没必要放 -->
|
||||
|
||||
## Star History
|
||||
|
||||
[](https://star-history.com/#aJupyter/EmoLLM&Date)
|
||||
|
||||
## 🌟 Contributors
|
||||
|
||||
[](https://github.com/SmartFlowAI/EmoLLM/graphs/contributors)
|
||||
|
||||
[your-project-path]: SmartflowAI/EmoLLM
|
||||
[contributors-shield]: https://img.shields.io/github/contributors/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[contributors-url]: https://github.com/SmartflowAI/EmoLLM/graphs/contributors
|
||||
[forks-shield]: https://img.shields.io/github/forks/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[forks-url]: https://github.com/SmartflowAI/EmoLLM/network/members
|
||||
[stars-shield]: https://img.shields.io/github/stars/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[stars-url]: https://github.com/SmartflowAI/EmoLLM/stargazers
|
||||
[issues-shield]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[issues-url]: https://img.shields.io/github/issues/SmartflowAI/EmoLLM.svg
|
||||
[license-shield]: https://img.shields.io/github/license/SmartflowAI/EmoLLM.svg?style=flat-square
|
||||
[license-url]: https://github.com/SmartflowAI/EmoLLM/blob/main/LICENSE
|
3
app.py
3
app.py
@ -1,2 +1,3 @@
|
||||
import os
|
||||
os.system('streamlit run web_internlm2.py --server.address=0.0.0.0 --server.port 7860')
|
||||
# os.system('streamlit run web_internlm2.py --server.address=0.0.0.0 --server.port 7860')
|
||||
os.system('streamlit run web_demo-aiwei.py --server.address=0.0.0.0 --server.port 7860')
|
||||
|
BIN
assets/aiwei_logo.jpg
Normal file
BIN
assets/aiwei_logo.jpg
Normal file
Binary file not shown.
After Width: | Height: | Size: 133 KiB |
BIN
assets/deploy_1.png
Normal file
BIN
assets/deploy_1.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 766 KiB |
BIN
assets/deploy_2.png
Normal file
BIN
assets/deploy_2.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 315 KiB |
BIN
assets/deploy_3.png
Normal file
BIN
assets/deploy_3.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 61 KiB |
BIN
assets/deploy_4.png
Normal file
BIN
assets/deploy_4.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 643 KiB |
101142
datasets/SoulStar_data.json
Normal file
101142
datasets/SoulStar_data.json
Normal file
File diff suppressed because one or more lines are too long
24870
datasets/aiwei.json
Normal file
24870
datasets/aiwei.json
Normal file
File diff suppressed because it is too large
Load Diff
138359
datasets/data_pro.json
Normal file
138359
datasets/data_pro.json
Normal file
File diff suppressed because one or more lines are too long
146974
datasets/multi_turn_dataset_1.json
Normal file
146974
datasets/multi_turn_dataset_1.json
Normal file
File diff suppressed because it is too large
Load Diff
110310
datasets/multi_turn_dataset_2.json
Normal file
110310
datasets/multi_turn_dataset_2.json
Normal file
File diff suppressed because it is too large
Load Diff
56166
datasets/single_turn_dataset_1.json
Normal file
56166
datasets/single_turn_dataset_1.json
Normal file
File diff suppressed because one or more lines are too long
73170
datasets/single_turn_dataset_2.json
Normal file
73170
datasets/single_turn_dataset_2.json
Normal file
File diff suppressed because one or more lines are too long
20386
datasets/tiangou.json
Normal file
20386
datasets/tiangou.json
Normal file
File diff suppressed because it is too large
Load Diff
59
demo/README.md
Normal file
59
demo/README.md
Normal file
@ -0,0 +1,59 @@
|
||||
# EmoLLM 部署指南
|
||||
|
||||
## 本地部署
|
||||
|
||||
- Clone repo
|
||||
|
||||
```bash
|
||||
git clone https://github.com/aJupyter/EmoLLM.git
|
||||
```
|
||||
|
||||
- 安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
- 下载模型
|
||||
- 模型权重:https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
|
||||
- 通过 openxlab.model.download 下载,详情请看 [cli_internlm2](./cli_internlm2.py)
|
||||
|
||||
```bash
|
||||
from openxlab.model import download
|
||||
|
||||
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
|
||||
```
|
||||
|
||||
- 可以手动下载,放在 `./model` 目录下,然后把上面的代码删掉
|
||||
|
||||
- cli_demo
|
||||
|
||||
```bash
|
||||
python ./demo/cli_internlm2.py
|
||||
```
|
||||
|
||||
- web_demo
|
||||
|
||||
```bash
|
||||
python ./app.py
|
||||
```
|
||||
|
||||
如果在服务器上部署,需要配置本地端口映射
|
||||
|
||||
## OpenXLab 上部署
|
||||
|
||||
- 登陆 OpenXLab,创建 Gradio 应用
|
||||
|
||||

|
||||
|
||||
- 选择配置,立即创建
|
||||
|
||||

|
||||
|
||||
- 等待构建、启动
|
||||
|
||||

|
||||
|
||||
- 项目体验
|
||||
|
||||

|
59
demo/README_EN.md
Normal file
59
demo/README_EN.md
Normal file
@ -0,0 +1,59 @@
|
||||
# Deploying Guide for EmoLLM
|
||||
|
||||
## Local Deployment
|
||||
|
||||
- Clone repo
|
||||
|
||||
```bash
|
||||
git clone https://github.com/aJupyter/EmoLLM.git
|
||||
```
|
||||
|
||||
- Install dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
- Download the model
|
||||
- Model weights:https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model
|
||||
- Download via openxlab.model.download, see [cli_internlm2](./cli_internlm2.py) for details
|
||||
|
||||
```bash
|
||||
from openxlab.model import download
|
||||
|
||||
download(model_repo='jujimeizuo/EmoLLM_Model', output='model')
|
||||
```
|
||||
|
||||
- You can also download manually and place it in the `./model` directory, then delete the above code.
|
||||
|
||||
- cli_demo
|
||||
|
||||
```bash
|
||||
python ./demo/cli_internlm2.py
|
||||
```
|
||||
|
||||
- web_demo
|
||||
|
||||
```bash
|
||||
python ./app.py
|
||||
```
|
||||
|
||||
If deploying on a server, you need to configure local port mapping.
|
||||
|
||||
## Deploy on OpenXLab
|
||||
|
||||
- Log in to OpenXLab and create a Gradio application
|
||||
|
||||

|
||||
|
||||
- Select configurations and create the project
|
||||
|
||||

|
||||
|
||||
- Wait for the build and startup
|
||||
|
||||

|
||||
|
||||
- Try your own project
|
||||
|
||||

|
@ -1,8 +1,11 @@
|
||||
import torch
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
from openxlab.model import download
|
||||
|
||||
download(model_repo='jujimeizuo/EmoLLM_Model',
|
||||
output='model')
|
||||
|
||||
model_name_or_path = "./model"
|
||||
model_name_or_path = "model"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
|
||||
|
49
evaluate/Genera_evaluation.md
Normal file
49
evaluate/Genera_evaluation.md
Normal file
@ -0,0 +1,49 @@
|
||||
# EmoLLM通用指标评估
|
||||
|
||||
## 简介
|
||||
|
||||
本文档提供了关于如何使用 `eval.py` 和 `metric.py` 两个脚本的指导。这些脚本用于评估 EmoLLM-心理健康大模型的生成结果。
|
||||
|
||||
## 安装
|
||||
|
||||
- Python 3.x
|
||||
- PyTorch
|
||||
- Transformers
|
||||
- Datasets
|
||||
- NLTK
|
||||
- Rouge
|
||||
- Jieba
|
||||
|
||||
可以使用以下命令安装:
|
||||
|
||||
```bash
|
||||
pip install torch transformers datasets nltk rouge jieba
|
||||
```
|
||||
|
||||
## 用法
|
||||
|
||||
### convert.py
|
||||
|
||||
将原始多轮对话数据转换为测评用的单轮数据。
|
||||
|
||||
### eval.py
|
||||
|
||||
`eval.py` 脚本用于生成医生的回复并进行评估,主要分为以下几部分:
|
||||
|
||||
1. 加载模型和分词器。
|
||||
2. 设置测试参数,如测试数据数量和批处理大小。
|
||||
3. 准备数据。
|
||||
4. 生成响应并评估。
|
||||
|
||||
### metric.py
|
||||
|
||||
`metric.py` 脚本包含计算评估指标的函数,可设置按字符级别或按词级别进行评估,目前包含 BLEU 和 ROUGE 分数。
|
||||
|
||||
## 测试结果
|
||||
|
||||
对data.json中的数据进行测试,结果如下:
|
||||
|
||||
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|
||||
|----------|---------|---------|---------|---------|---------|---------|---------|
|
||||
| Qwen1_5-0_5B-Chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
|
||||
| InternLM2_7B_chat | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
|
111
evaluate/InternLM2_7B_chat_eval.py
Normal file
111
evaluate/InternLM2_7B_chat_eval.py
Normal file
@ -0,0 +1,111 @@
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer,DataCollatorWithPadding
|
||||
from qwen_generation_utils import decode_tokens
|
||||
import torch
|
||||
import datasets
|
||||
|
||||
|
||||
model_dir = './model'
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_dir, device_map="auto", padding_side='left',trust_remote_code=True)
|
||||
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
||||
model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto",pad_token_id=tokenizer.eos_token_id, trust_remote_code=True, torch_dtype=torch.float16)
|
||||
# (Optional) If on low resource devices, you can load model in 4-bit or 8-bit to further save GPU memory via bitsandbytes.
|
||||
# InternLM 7B in 4bit will cost nearly 8GB GPU memory.
|
||||
# pip install -U bitsandbytes
|
||||
# 8-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_8bit=True)
|
||||
# 4-bit: model = AutoModelForCausalLM.from_pretrained(model_dir, device_map="auto", trust_remote_code=True, load_in_4bit=True)
|
||||
model = model.eval()
|
||||
|
||||
# # convert data
|
||||
# import ujson
|
||||
# def transform_conversation_data(raw_data):
|
||||
# try:
|
||||
# instruction = '<|im_start|>system\n'+raw_data.get("conversation", "")[0]['system'] + "<|im_end|>\n"
|
||||
|
||||
# conversation = raw_data.get("conversation", [])
|
||||
# for i, dialog in enumerate(conversation):
|
||||
# instruction += "<|im_start|>user\n来访者:" + dialog["input"]+ "<|im_end|>\n"
|
||||
|
||||
# if i < len(conversation) - 1:
|
||||
# instruction += "<|im_start|>assistant\n医生:" + dialog["output"]+"<|im_end|>\n"
|
||||
|
||||
# response = conversation[-1]["output"] if conversation else ""
|
||||
|
||||
# instruction +="<|im_start|>assistant\n医生:"
|
||||
|
||||
# return {"instruction": instruction, "output": response}
|
||||
|
||||
# except Exception as e:
|
||||
# pass
|
||||
|
||||
|
||||
# with open(f'./data_dir/data.json', 'r', encoding='utf-8') as f1:
|
||||
# data = ujson.load(f1)
|
||||
# with open(f'./data_dir/converted.json', 'w', encoding='utf-8') as f:
|
||||
# for j, item in enumerate(data):
|
||||
# temp=transform_conversation_data(item)
|
||||
# if temp:
|
||||
# transformed_data =ujson.dumps(temp, ensure_ascii=False)
|
||||
# f.write(transformed_data+'\n')
|
||||
|
||||
#set test params
|
||||
|
||||
|
||||
#set test params
|
||||
test_num=1596 #测试数据条数
|
||||
batch_size=12
|
||||
|
||||
|
||||
#prepare data and dataloader
|
||||
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
|
||||
references =dataset['output'][:test_num]
|
||||
|
||||
hypotheses = []
|
||||
def preprocess(data):
|
||||
length = list(map(len, data['instruction']))
|
||||
model_inputs=tokenizer(data['instruction'], max_length=512, truncation=True )
|
||||
labels=tokenizer(data['output'], padding=True,max_length=128, truncation=True )
|
||||
model_inputs['labels']=labels['input_ids']
|
||||
model_inputs['length'] = length
|
||||
return model_inputs
|
||||
preprocessed_dataset = dataset.map(preprocess, batched=True,remove_columns=['instruction', 'output',])
|
||||
|
||||
|
||||
collator=DataCollatorWithPadding(tokenizer=tokenizer,)
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
dataloader = DataLoader(preprocessed_dataset, batch_size=batch_size, collate_fn=collator)
|
||||
|
||||
#generate responses
|
||||
stop_word="<|im_end|>"
|
||||
for batch in dataloader:
|
||||
batch_input_ids = torch.LongTensor(batch['input_ids']).to(model.device)
|
||||
batch_labels = batch['labels']
|
||||
attention_mask = batch['attention_mask']
|
||||
length = batch['length']
|
||||
batch_out_ids = model.generate(
|
||||
batch_input_ids.to(model.device),
|
||||
return_dict_in_generate=False,
|
||||
max_new_tokens=256,
|
||||
do_sample=True,
|
||||
temperature=0.1,
|
||||
eos_token_id=92542
|
||||
)
|
||||
|
||||
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
|
||||
batch_response = [
|
||||
decode_tokens(
|
||||
batch_out_ids[i][padding_lens[i]:],
|
||||
tokenizer,
|
||||
context_length=0,
|
||||
raw_text_len=length[i],
|
||||
chat_format="raw",
|
||||
verbose=False,
|
||||
errors='replace'
|
||||
).replace("医生:","") for i in range(batch_size)]
|
||||
hypotheses.extend([r.replace(stop_word," ").split()[0] for r in batch_response if stop_word in r])
|
||||
|
||||
|
||||
# Load metric
|
||||
from metric import compute_metrics
|
||||
|
||||
print(compute_metrics((hypotheses,references)))
|
30
evaluate/Professional_evaluation.md
Normal file
30
evaluate/Professional_evaluation.md
Normal file
@ -0,0 +1,30 @@
|
||||
# EmoLLM专业指标评估
|
||||
|
||||
## 简介
|
||||
|
||||
本文档介绍一种专业评测方法,并提供 EmoLLM 在专业指标的得分。
|
||||
|
||||
## 评测方法
|
||||
|
||||
本评测方法采用论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》提出的评测指标与方法。
|
||||
|
||||
* 指标:Comprehensiveness, Professionalism, Authenticity, Safety
|
||||
* 方法:Turn-Based Dialogue Evaluation
|
||||
* 数据集:CPsyCounE
|
||||
|
||||
## 评测结果
|
||||
|
||||
评测模型: [EmoLLM](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model)(InternLM2-7B-chat + qlora), 得分:
|
||||
| Metric | Value |
|
||||
|-------------------|------------|
|
||||
| Comprehensiveness | 1.32 |
|
||||
| Professionalism | 2.20 |
|
||||
| Authenticity | 2.10 |
|
||||
| Safety | 1.00 |
|
||||
|
||||
## 比较
|
||||
|
||||
* [EmoLLM](https://openxlab.org.cn/models/detail/jujimeizuo/EmoLLM_Model) 在 InternLM2-7B-Chat 基础上提升较大;相比 Role-playing ChatGPT 在心理咨询任务上能力相近
|
||||
|
||||
* 对比结果图片来源于论文《CPsyCoun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for Chinese Psychological Counseling》
|
||||

|
80
evaluate/Qwen1_5-0_5B-Chat_eval.py
Normal file
80
evaluate/Qwen1_5-0_5B-Chat_eval.py
Normal file
@ -0,0 +1,80 @@
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer,DataCollatorWithPadding
|
||||
from qwen_generation_utils import decode_tokens
|
||||
import torch
|
||||
import datasets
|
||||
|
||||
#load model and tokenizer
|
||||
tokenizer = AutoTokenizer.from_pretrained(
|
||||
'./EmoLLM_Qwen1_5-0_5B-Chat_full_sft',
|
||||
pad_token='<|extra_0|>',
|
||||
eos_token='<|endoftext|>',
|
||||
padding_side='left',
|
||||
trust_remote_code=True
|
||||
)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
'./EmoLLM_Qwen1_5-0_5B-Chat_full_sft',
|
||||
pad_token_id=tokenizer.pad_token_id,
|
||||
device_map="cuda:0",
|
||||
trust_remote_code=True
|
||||
).eval()
|
||||
|
||||
|
||||
#set test params
|
||||
test_num=1596 #测试数据条数
|
||||
batch_size=12
|
||||
|
||||
|
||||
#prepare data and dataloader
|
||||
dataset = datasets.load_dataset('json', data_files='./data_dir/converted.json',split=f"train[:{test_num}]")
|
||||
references =dataset['output'][:test_num]
|
||||
|
||||
hypotheses = []
|
||||
def preprocess(data):
|
||||
length = list(map(len, data['instruction']))
|
||||
model_inputs=tokenizer(data['instruction'], max_length=512, truncation=True )
|
||||
labels=tokenizer(data['output'], padding=True,max_length=128, truncation=True )
|
||||
model_inputs['labels']=labels['input_ids']
|
||||
model_inputs['length'] = length
|
||||
return model_inputs
|
||||
preprocessed_dataset = dataset.map(preprocess, batched=True,remove_columns=['instruction', 'output',])
|
||||
|
||||
|
||||
collator=DataCollatorWithPadding(tokenizer=tokenizer,)
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
dataloader = DataLoader(preprocessed_dataset, batch_size=batch_size, collate_fn=collator)
|
||||
|
||||
|
||||
|
||||
#generate responses
|
||||
for batch in dataloader:
|
||||
batch_input_ids = torch.LongTensor(batch['input_ids']).to(model.device)
|
||||
batch_labels = batch['labels']
|
||||
attention_mask = batch['attention_mask']
|
||||
length = batch['length']
|
||||
batch_out_ids = model.generate(
|
||||
batch_input_ids.to(model.device),
|
||||
return_dict_in_generate=False,
|
||||
max_new_tokens=256,
|
||||
temperature=0.1,
|
||||
pad_token_id=tokenizer.eos_token_id
|
||||
)
|
||||
padding_lens = [batch_input_ids[i].eq(tokenizer.pad_token_id).sum().item() for i in range(batch_input_ids.size(0))]
|
||||
batch_response = [
|
||||
decode_tokens(
|
||||
batch_out_ids[i][padding_lens[i]:],
|
||||
tokenizer,
|
||||
context_length=0,
|
||||
raw_text_len=length[i],
|
||||
chat_format="raw",
|
||||
verbose=False,
|
||||
errors='replace'
|
||||
) for i in range(batch_size)
|
||||
]
|
||||
hypotheses.extend(batch_response)
|
||||
|
||||
|
||||
# Load metric
|
||||
from metric import compute_metrics
|
||||
|
||||
print(compute_metrics((hypotheses,references)))
|
21
evaluate/README.md
Normal file
21
evaluate/README.md
Normal file
@ -0,0 +1,21 @@
|
||||
# EmoLLM评测
|
||||
|
||||
## 通用指标评测
|
||||
|
||||
* 具体指标、方法见 see [General_evaluation.md](./General_evaluation.md)
|
||||
|
||||
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
|
||||
|----------|---------|---------|---------|---------|---------|---------|---------|
|
||||
| Qwen1_5-0_5B-Chat | 27.23% | 8.55% | 17.05% | 26.65% | 13.11% | 7.19% | 4.05% |
|
||||
| InternLM2_7B_chat | 37.86% | 15.23% | 24.34% | 39.71% | 22.66% | 14.26% | 9.21% |
|
||||
|
||||
## 专业指标评测
|
||||
|
||||
* 具体指标、方法见 [Professional_evaluation.md](./Professional_evaluation.md)
|
||||
|
||||
| Metric | Value |
|
||||
|-------------------|------------|
|
||||
| Comprehensiveness | 1.32 |
|
||||
| Professionalism | 2.20 |
|
||||
| Authenticity | 2.10 |
|
||||
| Safety | 1.00 |
|
26
evaluate/README_EN.md
Normal file
26
evaluate/README_EN.md
Normal file
@ -0,0 +1,26 @@
|
||||
# EmoLLM Evaluation
|
||||
|
||||
## General Metrics Evaluation
|
||||
|
||||
* For specific metrics and methods, see [General_evaluation.md](./General_evaluation_EN.md)
|
||||
|
||||
| Metric | Value |
|
||||
|---------|----------------------|
|
||||
| ROUGE-1 | 27.23% |
|
||||
| ROUGE-2 | 8.55% |
|
||||
| ROUGE-L | 17.05% |
|
||||
| BLEU-1 | 26.65% |
|
||||
| BLEU-2 | 13.11% |
|
||||
| BLEU-3 | 7.19% |
|
||||
| BLEU-4 | 4.05% |
|
||||
|
||||
## Professional Metrics Evaluation
|
||||
|
||||
* For specific metrics and methods, see [Professional_evaluation_EN.md](./Professional_evaluation_EN.md)
|
||||
|
||||
| Metric | Value |
|
||||
|-------------------|------------|
|
||||
| Comprehensiveness | 1.32 |
|
||||
| Professionalism | 2.20 |
|
||||
| Authenticity | 2.10 |
|
||||
| Safety | 1.00 |
|
31
evaluate/data_dir/convert.py
Normal file
31
evaluate/data_dir/convert.py
Normal file
@ -0,0 +1,31 @@
|
||||
import ujson
|
||||
def transform_conversation_data(raw_data):
|
||||
try:
|
||||
instruction = raw_data.get("conversation", "")[0]['system'] + "\n\n对话:"
|
||||
|
||||
conversation = raw_data.get("conversation", [])
|
||||
for i, dialog in enumerate(conversation):
|
||||
instruction += "\n来访者:" + dialog["input"]
|
||||
|
||||
if i < len(conversation) - 1:
|
||||
instruction += "\n医生:" + dialog["output"]
|
||||
|
||||
response = conversation[-1]["output"] if conversation else ""
|
||||
|
||||
instruction += "\n医生:"
|
||||
|
||||
return {"instruction": instruction, "output": response}
|
||||
|
||||
except Exception as e:
|
||||
pass
|
||||
|
||||
|
||||
with open(f'./train_dir/data.json', 'r', encoding='utf-8') as f1:
|
||||
data = ujson.load(f1)
|
||||
with open(f'./train_dir/converted.json', 'w', encoding='utf-8') as f:
|
||||
for j, item in enumerate(data):
|
||||
temp=transform_conversation_data(item)
|
||||
if temp:
|
||||
transformed_data =ujson.dumps(temp, ensure_ascii=False)
|
||||
f.write(transformed_data+'\n')
|
||||
print('********')
|
1596
evaluate/data_dir/converted.json
Normal file
1596
evaluate/data_dir/converted.json
Normal file
File diff suppressed because it is too large
Load Diff
28282
evaluate/data_dir/data.json
Normal file
28282
evaluate/data_dir/data.json
Normal file
File diff suppressed because it is too large
Load Diff
33
evaluate/metric.py
Normal file
33
evaluate/metric.py
Normal file
@ -0,0 +1,33 @@
|
||||
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
|
||||
from rouge import Rouge
|
||||
import numpy as np
|
||||
import jieba
|
||||
def compute_metrics(eval_pred):
|
||||
predictions, labels = eval_pred
|
||||
|
||||
# 字符级别
|
||||
# decoded_preds = [" ".join((pred.replace(" ", ""))) for pred in predictions]
|
||||
# decoded_labels = [" ".join((label.replace(" ", ""))) for label in labels]
|
||||
|
||||
# 词级别
|
||||
decoded_preds = [" ".join(jieba.cut(pred.replace(" ", ""))) for pred in predictions]
|
||||
decoded_labels = [" ".join(jieba.cut(label.replace(" ", ""))) for label in labels]
|
||||
|
||||
|
||||
|
||||
|
||||
rouge = Rouge()
|
||||
|
||||
bleu =np.array([0.,0.,0.,0.])
|
||||
weights = [(1.,0.,0.,0.),(1./2., 1./2.),(1./3., 1./3., 1./3.),(1./4., 1./4., 1./4., 1./4.)]
|
||||
for decoded_label, decoded_pred in zip(decoded_labels, decoded_preds):
|
||||
bleu +=np.array( sentence_bleu(
|
||||
references=[decoded_label.split(' ')],
|
||||
hypothesis=decoded_pred.split(' '),
|
||||
smoothing_function=SmoothingFunction().method1,weights=weights
|
||||
))
|
||||
bleu /= len(decoded_labels)
|
||||
result = rouge.get_scores(decoded_preds, decoded_labels, avg=True)
|
||||
result = {key: value['f'] * 100 for key, value in result.items()}
|
||||
result["bleu"] = {'bleu_1':bleu[0] * 100,'bleu_2':bleu[1] * 100,'bleu_3':bleu[2] * 100,'bleu_4':bleu[3] * 100}
|
||||
return result
|
416
evaluate/qwen_generation_utils.py
Normal file
416
evaluate/qwen_generation_utils.py
Normal file
@ -0,0 +1,416 @@
|
||||
# Copyright (c) Alibaba Cloud.
|
||||
#
|
||||
# This source code is licensed under the license found in the
|
||||
# LICENSE file in the root directory of this source tree.
|
||||
|
||||
"""Generation support."""
|
||||
|
||||
from typing import Tuple, List, Union, Iterable
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn.functional as F
|
||||
from transformers import PreTrainedTokenizer
|
||||
from transformers import logging
|
||||
from transformers.generation import LogitsProcessor
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
# Types.
|
||||
HistoryType = List[Tuple[str, str]]
|
||||
TokensType = List[int]
|
||||
BatchTokensType = List[List[int]]
|
||||
|
||||
|
||||
def pad_batch(batch: BatchTokensType, pad_id: int, seq_length: int) -> BatchTokensType:
|
||||
for tokens in batch:
|
||||
context_length = len(tokens)
|
||||
if context_length < seq_length:
|
||||
tokens.extend([pad_id] * (seq_length - context_length))
|
||||
return batch
|
||||
|
||||
|
||||
def get_ltor_masks_and_position_ids(
|
||||
data,
|
||||
eod_token,
|
||||
reset_position_ids,
|
||||
reset_attention_mask,
|
||||
eod_mask_loss,
|
||||
):
|
||||
"""Build masks and position id for left to right model."""
|
||||
|
||||
# Extract batch size and sequence length.
|
||||
micro_batch_size, seq_length = data.size()
|
||||
|
||||
# Attention mask (lower triangular).
|
||||
if reset_attention_mask:
|
||||
att_mask_batch = micro_batch_size
|
||||
else:
|
||||
att_mask_batch = 1
|
||||
attention_mask = torch.tril(
|
||||
torch.ones((att_mask_batch, seq_length, seq_length), device=data.device)
|
||||
).view(att_mask_batch, 1, seq_length, seq_length)
|
||||
|
||||
# Loss mask.
|
||||
loss_mask = torch.ones(data.size(), dtype=torch.float, device=data.device)
|
||||
if eod_mask_loss:
|
||||
loss_mask[data == eod_token] = 0.0
|
||||
|
||||
# Position ids.
|
||||
position_ids = torch.arange(seq_length, dtype=torch.long, device=data.device)
|
||||
position_ids = position_ids.unsqueeze(0).expand_as(data)
|
||||
# We need to clone as the ids will be modifed based on batch index.
|
||||
if reset_position_ids:
|
||||
position_ids = position_ids.clone()
|
||||
|
||||
if reset_position_ids or reset_attention_mask:
|
||||
# Loop through the batches:
|
||||
for b in range(micro_batch_size):
|
||||
|
||||
# Find indecies where EOD token is.
|
||||
eod_index = position_ids[b, data[b] == eod_token]
|
||||
# Detach indecies from positions if going to modify positions.
|
||||
if reset_position_ids:
|
||||
eod_index = eod_index.clone()
|
||||
|
||||
# Loop through EOD indecies:
|
||||
prev_index = 0
|
||||
for j in range(eod_index.size()[0]):
|
||||
i = eod_index[j]
|
||||
# Mask attention loss.
|
||||
if reset_attention_mask:
|
||||
attention_mask[b, 0, (i + 1) :, : (i + 1)] = 0
|
||||
# Reset positions.
|
||||
if reset_position_ids:
|
||||
position_ids[b, (i + 1) :] -= i + 1 - prev_index
|
||||
prev_index = i + 1
|
||||
|
||||
# Convert attention mask to binary:
|
||||
attention_mask = attention_mask < 0.5
|
||||
|
||||
return attention_mask, loss_mask, position_ids
|
||||
|
||||
|
||||
def get_batch(context_tokens: torch.LongTensor, eod_id: int):
|
||||
"""Generate batch from context tokens."""
|
||||
# Move to GPU.
|
||||
tokens = context_tokens.contiguous().to(context_tokens.device)
|
||||
# Get the attention mask and postition ids.
|
||||
attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
|
||||
tokens,
|
||||
eod_id,
|
||||
reset_position_ids=False,
|
||||
reset_attention_mask=False,
|
||||
eod_mask_loss=False,
|
||||
)
|
||||
return tokens, attention_mask, position_ids
|
||||
|
||||
|
||||
def get_stop_words_ids(chat_format, tokenizer):
|
||||
if chat_format == "raw":
|
||||
stop_words_ids = [tokenizer.encode("Human:"), [tokenizer.eod_id]]
|
||||
elif chat_format == "chatml":
|
||||
stop_words_ids = [[tokenizer.im_end_id], [tokenizer.im_start_id]]
|
||||
else:
|
||||
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
|
||||
return stop_words_ids
|
||||
|
||||
|
||||
def make_context(
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
query: str,
|
||||
history: List[Tuple[str, str]] = None,
|
||||
system: str = "",
|
||||
max_window_size: int = 6144,
|
||||
chat_format: str = "chatml",
|
||||
):
|
||||
if history is None:
|
||||
history = []
|
||||
|
||||
if chat_format == "chatml":
|
||||
im_start, im_end = "<|im_start|>", "<|im_end|>"
|
||||
im_start_tokens = [tokenizer.im_start_id]
|
||||
im_end_tokens = [tokenizer.im_end_id]
|
||||
nl_tokens = tokenizer.encode("\n")
|
||||
|
||||
def _tokenize_str(role, content):
|
||||
return f"{role}\n{content}", tokenizer.encode(
|
||||
role, allowed_special=set()
|
||||
) + nl_tokens + tokenizer.encode(content, allowed_special=set())
|
||||
|
||||
system_text, system_tokens_part = _tokenize_str("system", system)
|
||||
system_tokens = im_start_tokens + system_tokens_part + im_end_tokens
|
||||
|
||||
raw_text = ""
|
||||
context_tokens = []
|
||||
|
||||
for turn_query, turn_response in reversed(history):
|
||||
query_text, query_tokens_part = _tokenize_str("user", turn_query)
|
||||
query_tokens = im_start_tokens + query_tokens_part + im_end_tokens
|
||||
response_text, response_tokens_part = _tokenize_str(
|
||||
"assistant", turn_response
|
||||
)
|
||||
response_tokens = im_start_tokens + response_tokens_part + im_end_tokens
|
||||
|
||||
next_context_tokens = nl_tokens + query_tokens + nl_tokens + response_tokens
|
||||
prev_chat = (
|
||||
f"\n{im_start}{query_text}{im_end}\n{im_start}{response_text}{im_end}"
|
||||
)
|
||||
|
||||
current_context_size = (
|
||||
len(system_tokens) + len(next_context_tokens) + len(context_tokens)
|
||||
)
|
||||
if current_context_size < max_window_size:
|
||||
context_tokens = next_context_tokens + context_tokens
|
||||
raw_text = prev_chat + raw_text
|
||||
else:
|
||||
break
|
||||
|
||||
context_tokens = system_tokens + context_tokens
|
||||
raw_text = f"{im_start}{system_text}{im_end}" + raw_text
|
||||
context_tokens += (
|
||||
nl_tokens
|
||||
+ im_start_tokens
|
||||
+ _tokenize_str("user", query)[1]
|
||||
+ im_end_tokens
|
||||
+ nl_tokens
|
||||
+ im_start_tokens
|
||||
+ tokenizer.encode("assistant")
|
||||
+ nl_tokens
|
||||
)
|
||||
raw_text += f"\n{im_start}user\n{query}{im_end}\n{im_start}assistant\n"
|
||||
|
||||
elif chat_format == "raw":
|
||||
raw_text = query
|
||||
context_tokens = tokenizer.encode(raw_text)
|
||||
else:
|
||||
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
|
||||
|
||||
return raw_text, context_tokens
|
||||
|
||||
|
||||
def _decode_default(
|
||||
tokens: List[int],
|
||||
*,
|
||||
stop_words: List[str],
|
||||
eod_words: List[str],
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
raw_text_len: int,
|
||||
verbose: bool = False,
|
||||
return_end_reason: bool = False,
|
||||
errors: str='replace',
|
||||
):
|
||||
trim_decode_tokens = tokenizer.decode(tokens, errors=errors)[raw_text_len:]
|
||||
if verbose:
|
||||
print("\nRaw Generate: ", trim_decode_tokens)
|
||||
|
||||
end_reason = f"Gen length {len(tokens)}"
|
||||
for stop_word in stop_words:
|
||||
trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
|
||||
for eod_word in eod_words:
|
||||
if eod_word in trim_decode_tokens:
|
||||
end_reason = f"Gen {eod_word!r}"
|
||||
trim_decode_tokens = trim_decode_tokens.split(eod_word)[0]
|
||||
trim_decode_tokens = trim_decode_tokens.strip()
|
||||
if verbose:
|
||||
print("\nEnd Reason:", end_reason)
|
||||
print("\nGenerate: ", trim_decode_tokens)
|
||||
|
||||
if return_end_reason:
|
||||
return trim_decode_tokens, end_reason
|
||||
else:
|
||||
return trim_decode_tokens
|
||||
|
||||
|
||||
def _decode_chatml(
|
||||
tokens: List[int],
|
||||
*,
|
||||
stop_words: List[str],
|
||||
eod_token_ids: List[int],
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
raw_text_len: int,
|
||||
context_length: int,
|
||||
verbose: bool = False,
|
||||
return_end_reason: bool = False,
|
||||
errors: str='replace'
|
||||
):
|
||||
end_reason = f"Gen length {len(tokens)}"
|
||||
eod_token_idx = context_length
|
||||
for eod_token_idx in range(context_length, len(tokens)):
|
||||
if tokens[eod_token_idx] in eod_token_ids:
|
||||
end_reason = f"Gen {tokenizer.decode([tokens[eod_token_idx]])!r}"
|
||||
break
|
||||
|
||||
trim_decode_tokens = tokenizer.decode(tokens[:eod_token_idx], errors=errors)[raw_text_len:]
|
||||
if verbose:
|
||||
print("\nRaw Generate w/o EOD:", tokenizer.decode(tokens, errors=errors)[raw_text_len:])
|
||||
print("\nRaw Generate:", trim_decode_tokens)
|
||||
print("\nEnd Reason:", end_reason)
|
||||
for stop_word in stop_words:
|
||||
trim_decode_tokens = trim_decode_tokens.replace(stop_word, "").strip()
|
||||
trim_decode_tokens = trim_decode_tokens.strip()
|
||||
if verbose:
|
||||
print("\nGenerate:", trim_decode_tokens)
|
||||
|
||||
if return_end_reason:
|
||||
return trim_decode_tokens, end_reason
|
||||
else:
|
||||
return trim_decode_tokens
|
||||
|
||||
|
||||
def decode_tokens(
|
||||
tokens: Union[torch.LongTensor, TokensType],
|
||||
tokenizer: PreTrainedTokenizer,
|
||||
raw_text_len: int,
|
||||
context_length: int,
|
||||
chat_format: str,
|
||||
verbose: bool = False,
|
||||
return_end_reason: bool = False,
|
||||
errors: str="replace",
|
||||
) -> str:
|
||||
if torch.is_tensor(tokens):
|
||||
tokens = tokens.cpu().numpy().tolist()
|
||||
|
||||
if chat_format == "chatml":
|
||||
return _decode_chatml(
|
||||
tokens,
|
||||
stop_words=[],
|
||||
eod_token_ids=[tokenizer.im_start_id, tokenizer.im_end_id],
|
||||
tokenizer=tokenizer,
|
||||
raw_text_len=raw_text_len,
|
||||
context_length=context_length,
|
||||
verbose=verbose,
|
||||
return_end_reason=return_end_reason,
|
||||
errors=errors,
|
||||
)
|
||||
elif chat_format == "raw":
|
||||
return _decode_default(
|
||||
tokens,
|
||||
stop_words=["<|endoftext|>"],
|
||||
eod_words=["<|endoftext|>"],
|
||||
tokenizer=tokenizer,
|
||||
raw_text_len=raw_text_len,
|
||||
verbose=verbose,
|
||||
return_end_reason=return_end_reason,
|
||||
errors=errors,
|
||||
)
|
||||
else:
|
||||
raise NotImplementedError(f"Unknown chat format {chat_format!r}")
|
||||
|
||||
|
||||
class StopWordsLogitsProcessor(LogitsProcessor):
|
||||
"""
|
||||
:class:`transformers.LogitsProcessor` that enforces that when specified sequences appear, stop geration.
|
||||
|
||||
Args:
|
||||
stop_words_ids (:obj:`List[List[int]]`):
|
||||
List of list of token ids of stop ids. In order to get the tokens of the words
|
||||
that should not appear in the generated text, use :obj:`tokenizer(bad_word,
|
||||
add_prefix_space=True).input_ids`.
|
||||
eos_token_id (:obj:`int`):
|
||||
The id of the `end-of-sequence` token.
|
||||
"""
|
||||
|
||||
def __init__(self, stop_words_ids: Iterable[Iterable[int]], eos_token_id: int):
|
||||
|
||||
if not isinstance(stop_words_ids, List) or len(stop_words_ids) == 0:
|
||||
raise ValueError(
|
||||
f"`stop_words_ids` has to be a non-emtpy list, but is {stop_words_ids}."
|
||||
)
|
||||
if any(not isinstance(bad_word_ids, list) for bad_word_ids in stop_words_ids):
|
||||
raise ValueError(
|
||||
f"`stop_words_ids` has to be a list of lists, but is {stop_words_ids}."
|
||||
)
|
||||
if any(
|
||||
any(
|
||||
(not isinstance(token_id, (int, np.integer)) or token_id < 0)
|
||||
for token_id in stop_word_ids
|
||||
)
|
||||
for stop_word_ids in stop_words_ids
|
||||
):
|
||||
raise ValueError(
|
||||
f"Each list in `stop_words_ids` has to be a list of positive integers, but is {stop_words_ids}."
|
||||
)
|
||||
|
||||
self.stop_words_ids = list(
|
||||
filter(
|
||||
lambda bad_token_seq: bad_token_seq != [eos_token_id], stop_words_ids
|
||||
)
|
||||
)
|
||||
self.eos_token_id = eos_token_id
|
||||
for stop_token_seq in self.stop_words_ids:
|
||||
assert (
|
||||
len(stop_token_seq) > 0
|
||||
), "Stop words token sequences {} cannot have an empty list".format(
|
||||
stop_words_ids
|
||||
)
|
||||
|
||||
def __call__(
|
||||
self, input_ids: torch.LongTensor, scores: torch.FloatTensor
|
||||
) -> torch.FloatTensor:
|
||||
stopped_samples = self._calc_stopped_samples(input_ids)
|
||||
for i, should_stop in enumerate(stopped_samples):
|
||||
if should_stop:
|
||||
scores[i, self.eos_token_id] = float(2**15)
|
||||
return scores
|
||||
|
||||
def _tokens_match(self, prev_tokens: torch.LongTensor, tokens: List[int]) -> bool:
|
||||
if len(tokens) == 0:
|
||||
# if bad word tokens is just one token always ban it
|
||||
return True
|
||||
elif len(tokens) > len(prev_tokens):
|
||||
# if bad word tokens are longer then prev input_ids they can't be equal
|
||||
return False
|
||||
elif prev_tokens[-len(tokens) :].tolist() == tokens:
|
||||
# if tokens match
|
||||
return True
|
||||
else:
|
||||
return False
|
||||
|
||||
def _calc_stopped_samples(self, prev_input_ids: Iterable[int]) -> Iterable[int]:
|
||||
stopped_samples = []
|
||||
for prev_input_ids_slice in prev_input_ids:
|
||||
match = False
|
||||
for stop_token_seq in self.stop_words_ids:
|
||||
if self._tokens_match(prev_input_ids_slice, stop_token_seq):
|
||||
# if tokens do not match continue
|
||||
match = True
|
||||
break
|
||||
stopped_samples.append(match)
|
||||
|
||||
return stopped_samples
|
||||
|
||||
|
||||
def top_k_logits(logits, top_k=0, top_p=0.0, filter_value=-float("Inf")):
|
||||
"""This function has been mostly taken from huggingface conversational
|
||||
ai code at
|
||||
https://medium.com/huggingface/how-to-build-a-state-of-the-art-
|
||||
conversational-ai-with-transfer-learning-2d818ac26313"""
|
||||
|
||||
if top_k > 0:
|
||||
# Remove all tokens with a probability less than the
|
||||
# last token of the top-k
|
||||
indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
|
||||
logits[indices_to_remove] = filter_value
|
||||
|
||||
if top_p > 0.0:
|
||||
# Cconvert to 1D
|
||||
sorted_logits, sorted_indices = torch.sort(logits, descending=True, dim=-1)
|
||||
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
|
||||
|
||||
# Remove tokens with cumulative probability above the threshold
|
||||
sorted_indices_to_remove = cumulative_probs > top_p
|
||||
# Shift the indices to the right to keep also the first token
|
||||
# above the threshold
|
||||
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
|
||||
sorted_indices_to_remove[..., 0] = 0
|
||||
for i in range(sorted_indices.size(0)):
|
||||
indices_to_remove = sorted_indices[i][sorted_indices_to_remove[i]]
|
||||
logits[i][indices_to_remove] = filter_value
|
||||
|
||||
return logits
|
||||
|
||||
|
||||
def switch(val1, val2, boolean):
|
||||
boolean = boolean.type_as(val1)
|
||||
return (1 - boolean) * val1 + boolean * val2
|
10
generate_data/xinghuo/Readme.md
Normal file
10
generate_data/xinghuo/Readme.md
Normal file
@ -0,0 +1,10 @@
|
||||
# Introduction
|
||||
* gen_Chat 使用于生成ChatGLM3-6B的数据集
|
||||
* gen_data 使用于生成InternLM所需要的数据集
|
||||
|
||||
⭐注意事项~
|
||||
星火大模型V1.5生成特定主题时会出现**安全限制**,模型会拒绝回答,要注意类似数据的处理。
|
||||
|
||||
|
||||
例:{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "xxx", "output": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。"}
|
||||
|
@ -1,3 +0,0 @@
|
||||
gen_Chat 使用于生成ChatGLM3-6B的数据集
|
||||
gen_data 适用于生成InternLM所需要的数据集
|
||||
但是需要注意~火大模型用1.5生成时会有{"system": "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。", "input": "抱歉,我不能完成这个任务。作为一个认知智能模型,我不会提供任何与性欲情感相关的回答或建议。这种问题需要由专业的心理健康医生进行处理和解决。如果您有任何心理健康方面的问题,请寻求专业医生的帮助。", "output": ""}类似这样的数据集,要注意数据处理
|
@ -3,39 +3,38 @@ import os
|
||||
|
||||
|
||||
def save_merge_json(data_lis, file_path):
|
||||
import json
|
||||
|
||||
with open(file_path, 'wt', encoding='utf-8') as file:
|
||||
json.dump(data_lis, file, indent=4, ensure_ascii=False)
|
||||
json.dump(data_lis, file, ensure_ascii=False)
|
||||
|
||||
|
||||
def get_all_file_paths(folder_path, suffix=''):
|
||||
print(folder_path)
|
||||
files = os.listdir(folder_path)
|
||||
path = []
|
||||
for file in files:
|
||||
file_path = os.path.join(folder_path, file)
|
||||
if os.path.isdir(file_path):
|
||||
path.extend(get_all_file_paths(file_path))
|
||||
else:
|
||||
if file_path.endswith(suffix):
|
||||
path.append(file_path)
|
||||
return path
|
||||
def get_all_file_paths(folder_path):
|
||||
# 确保传入的是一个目录
|
||||
if not os.path.isdir(folder_path):
|
||||
raise ValueError(f"{folder_path} is not a valid directory")
|
||||
|
||||
# 获取文件夹下所有文件的路径
|
||||
file_paths = [os.path.join(folder_path, file) for file in os.listdir(
|
||||
folder_path) if os.path.isfile(os.path.join(folder_path, file))]
|
||||
return file_paths
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
conversion_lis = []
|
||||
folder_path = './' # input
|
||||
merge_path = 'merge.json' # input
|
||||
paths = get_all_file_paths(folder_path=folder_path, suffix='.json')
|
||||
|
||||
for path in paths:
|
||||
for path in get_all_file_paths(r'data\res-aiwei'):
|
||||
print(path)
|
||||
with open(path, 'rt', encoding='utf-8') as lines:
|
||||
datas = []
|
||||
for line in lines:
|
||||
datas.append(line)
|
||||
|
||||
with open(path, 'rt', encoding='utf-8') as file:
|
||||
for line in file:
|
||||
# 移除行尾的换行符
|
||||
line = line.rstrip('\n')
|
||||
# 解析JSON
|
||||
try:
|
||||
datas = json.loads(''.join(datas))
|
||||
conversion_lis.extend(datas)
|
||||
data = json.loads(line)
|
||||
conversion_lis.append(data)
|
||||
except json.JSONDecodeError as e:
|
||||
print(f"Error decoding JSON: {e}")
|
||||
save_merge_json(data_lis=conversion_lis, file_path=merge_path)
|
||||
save_merge_json(data_lis=conversion_lis,
|
||||
file_path=r'.\merge.json')
|
||||
|
69
scripts/pdf2txt.py
Normal file
69
scripts/pdf2txt.py
Normal file
@ -0,0 +1,69 @@
|
||||
import os
|
||||
import sys
|
||||
import glob
|
||||
try:
|
||||
import cv2
|
||||
except :
|
||||
os.system('pip install opencv-python')
|
||||
import cv2
|
||||
try :
|
||||
from paddleocr import PaddleOCR , draw_ocr , download_with_progressbar
|
||||
except:
|
||||
os.system('pip install paddleocr')
|
||||
from paddleocr import PaddleOCR , draw_ocr , download_with_progressbar
|
||||
output_folder_path = 'res/'
|
||||
if not os.path.exists(output_folder_path):
|
||||
os.makedirs(output_folder_path)
|
||||
|
||||
def get_pdf_files_in_directory(directory_path):
|
||||
# 确保路径存在
|
||||
if os.path.exists(directory_path) and os.path.isdir(directory_path):
|
||||
return glob.glob(os.path.join(directory_path, '**', '*.pdf'), recursive=True)
|
||||
else:
|
||||
return []
|
||||
def ocr_pdf_folder(folder_path):
|
||||
ocr = PaddleOCR ( use_angle_cls = True , lang = "ch" , page_num = 0 ) # 只需运行一次即可将模型下载并加载到内存中
|
||||
print("ppocrv4 加载完毕!!!")
|
||||
pdf_paths = get_pdf_files_in_directory(folder_path)
|
||||
print(f"共检测到 {len(pdf_paths)} 个PDF文件")
|
||||
# 打印所有PDF文件的路径
|
||||
for pdf_path in pdf_paths:
|
||||
print(f'正在处理文件:{pdf_path}')
|
||||
|
||||
result = ocr.ocr (pdf_path , cls = True )
|
||||
for idx in range(len(result)):
|
||||
res = result[idx]
|
||||
for line in res :
|
||||
print(line)
|
||||
print(f'{pdf_path} 处理完毕')
|
||||
ocr_result = ""
|
||||
for idx in range(len(result)):
|
||||
res = result[idx]
|
||||
for line in res:
|
||||
# print(line[1][0])
|
||||
ocr_result = f"{ocr_result} {str(line[1][0])}"
|
||||
|
||||
filename = os.path.splitext(os.path.basename(pdf_path))[0]
|
||||
|
||||
# 构建TXT文件的完整路径
|
||||
txt_path = os.path.join('res/', f'{filename}.txt')
|
||||
|
||||
# 将提取的文本写入TXT文件
|
||||
with open(txt_path, 'w', encoding='utf-8') as txt_file:
|
||||
txt_file.write(ocr_result)
|
||||
|
||||
print(f'生成的txt文档保存在{txt_path}')
|
||||
# break
|
||||
# print(ocr_result)
|
||||
# with open('my_file.txt', 'a') as f:
|
||||
# # 写入字符串
|
||||
# f.write(ocr_result)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
if len(sys.argv) > 1:
|
||||
# sys.argv[0] 是脚本名,sys.argv[1:] 是传递给脚本的参数列表
|
||||
pdf_path = sys.argv[1]
|
||||
print(f'需要处理的文件夹是:{pdf_path}')
|
||||
ocr_pdf_folder(pdf_path)
|
||||
|
267
web_demo-aiwei.py
Normal file
267
web_demo-aiwei.py
Normal file
@ -0,0 +1,267 @@
|
||||
"""
|
||||
This script refers to the dialogue example of streamlit, the interactive generation code of chatglm2 and transformers.
|
||||
We mainly modified part of the code logic to adapt to the generation of our model.
|
||||
Please refer to these links below for more information:
|
||||
1. streamlit chat example: https://docs.streamlit.io/knowledge-base/tutorials/build-conversational-apps
|
||||
2. chatglm2: https://github.com/THUDM/ChatGLM2-6B
|
||||
3. transformers: https://github.com/huggingface/transformers
|
||||
Please run with the command `streamlit run path/to/web_demo.py --server.address=0.0.0.0 --server.port 7860`.
|
||||
Using `python path/to/web_demo.py` may cause unknown problems.
|
||||
"""
|
||||
import copy
|
||||
import warnings
|
||||
from dataclasses import asdict, dataclass
|
||||
from typing import Callable, List, Optional
|
||||
|
||||
import streamlit as st
|
||||
import torch
|
||||
from torch import nn
|
||||
from transformers.generation.utils import LogitsProcessorList, StoppingCriteriaList
|
||||
from transformers.utils import logging
|
||||
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM # isort: skip
|
||||
from openxlab.model import download
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
download(model_repo='ajupyter/EmoLLM_aiwei',
|
||||
output='model')
|
||||
|
||||
@dataclass
|
||||
class GenerationConfig:
|
||||
# this config is used for chat to provide more diversity
|
||||
max_length: int = 32768
|
||||
top_p: float = 0.8
|
||||
temperature: float = 0.8
|
||||
do_sample: bool = True
|
||||
repetition_penalty: float = 1.005
|
||||
|
||||
|
||||
@torch.inference_mode()
|
||||
def generate_interactive(
|
||||
model,
|
||||
tokenizer,
|
||||
prompt,
|
||||
generation_config: Optional[GenerationConfig] = None,
|
||||
logits_processor: Optional[LogitsProcessorList] = None,
|
||||
stopping_criteria: Optional[StoppingCriteriaList] = None,
|
||||
prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
|
||||
additional_eos_token_id: Optional[int] = None,
|
||||
**kwargs,
|
||||
):
|
||||
inputs = tokenizer([prompt], padding=True, return_tensors="pt")
|
||||
input_length = len(inputs["input_ids"][0])
|
||||
for k, v in inputs.items():
|
||||
inputs[k] = v.cuda()
|
||||
input_ids = inputs["input_ids"]
|
||||
batch_size, input_ids_seq_length = input_ids.shape[0], input_ids.shape[-1] # noqa: F841 # pylint: disable=W0612
|
||||
if generation_config is None:
|
||||
generation_config = model.generation_config
|
||||
generation_config = copy.deepcopy(generation_config)
|
||||
model_kwargs = generation_config.update(**kwargs)
|
||||
bos_token_id, eos_token_id = ( # noqa: F841 # pylint: disable=W0612
|
||||
generation_config.bos_token_id,
|
||||
generation_config.eos_token_id,
|
||||
)
|
||||
if isinstance(eos_token_id, int):
|
||||
eos_token_id = [eos_token_id]
|
||||
if additional_eos_token_id is not None:
|
||||
eos_token_id.append(additional_eos_token_id)
|
||||
has_default_max_length = kwargs.get("max_length") is None and generation_config.max_length is not None
|
||||
if has_default_max_length and generation_config.max_new_tokens is None:
|
||||
warnings.warn(
|
||||
f"Using `max_length`'s default ({generation_config.max_length}) to control the generation length. "
|
||||
"This behaviour is deprecated and will be removed from the config in v5 of Transformers -- we"
|
||||
" recommend using `max_new_tokens` to control the maximum length of the generation.",
|
||||
UserWarning,
|
||||
)
|
||||
elif generation_config.max_new_tokens is not None:
|
||||
generation_config.max_length = generation_config.max_new_tokens + input_ids_seq_length
|
||||
if not has_default_max_length:
|
||||
logger.warn( # pylint: disable=W4902
|
||||
f"Both `max_new_tokens` (={generation_config.max_new_tokens}) and `max_length`(="
|
||||
f"{generation_config.max_length}) seem to have been set. `max_new_tokens` will take precedence. "
|
||||
"Please refer to the documentation for more information. "
|
||||
"(https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)",
|
||||
UserWarning,
|
||||
)
|
||||
|
||||
if input_ids_seq_length >= generation_config.max_length:
|
||||
input_ids_string = "input_ids"
|
||||
logger.warning(
|
||||
f"Input length of {input_ids_string} is {input_ids_seq_length}, but `max_length` is set to"
|
||||
f" {generation_config.max_length}. This can lead to unexpected behavior. You should consider"
|
||||
" increasing `max_new_tokens`."
|
||||
)
|
||||
|
||||
# 2. Set generation parameters if not already defined
|
||||
logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
|
||||
stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
|
||||
|
||||
logits_processor = model._get_logits_processor(
|
||||
generation_config=generation_config,
|
||||
input_ids_seq_length=input_ids_seq_length,
|
||||
encoder_input_ids=input_ids,
|
||||
prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
|
||||
logits_processor=logits_processor,
|
||||
)
|
||||
|
||||
stopping_criteria = model._get_stopping_criteria(
|
||||
generation_config=generation_config, stopping_criteria=stopping_criteria
|
||||
)
|
||||
logits_warper = model._get_logits_warper(generation_config)
|
||||
|
||||
unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
|
||||
scores = None
|
||||
while True:
|
||||
model_inputs = model.prepare_inputs_for_generation(input_ids, **model_kwargs)
|
||||
# forward pass to get next token
|
||||
outputs = model(
|
||||
**model_inputs,
|
||||
return_dict=True,
|
||||
output_attentions=False,
|
||||
output_hidden_states=False,
|
||||
)
|
||||
|
||||
next_token_logits = outputs.logits[:, -1, :]
|
||||
|
||||
# pre-process distribution
|
||||
next_token_scores = logits_processor(input_ids, next_token_logits)
|
||||
next_token_scores = logits_warper(input_ids, next_token_scores)
|
||||
|
||||
# sample
|
||||
probs = nn.functional.softmax(next_token_scores, dim=-1)
|
||||
if generation_config.do_sample:
|
||||
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
|
||||
else:
|
||||
next_tokens = torch.argmax(probs, dim=-1)
|
||||
|
||||
# update generated ids, model inputs, and length for next step
|
||||
input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
|
||||
model_kwargs = model._update_model_kwargs_for_generation(outputs, model_kwargs, is_encoder_decoder=False)
|
||||
unfinished_sequences = unfinished_sequences.mul((min(next_tokens != i for i in eos_token_id)).long())
|
||||
|
||||
output_token_ids = input_ids[0].cpu().tolist()
|
||||
output_token_ids = output_token_ids[input_length:]
|
||||
for each_eos_token_id in eos_token_id:
|
||||
if output_token_ids[-1] == each_eos_token_id:
|
||||
output_token_ids = output_token_ids[:-1]
|
||||
response = tokenizer.decode(output_token_ids)
|
||||
|
||||
yield response
|
||||
# stop when each sentence is finished, or if we exceed the maximum length
|
||||
if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
|
||||
break
|
||||
|
||||
|
||||
def on_btn_click():
|
||||
del st.session_state.messages
|
||||
|
||||
|
||||
@st.cache_resource
|
||||
def load_model():
|
||||
model = (
|
||||
AutoModelForCausalLM.from_pretrained("model", trust_remote_code=True)
|
||||
.to(torch.bfloat16)
|
||||
.cuda()
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("model", trust_remote_code=True)
|
||||
return model, tokenizer
|
||||
|
||||
|
||||
def prepare_generation_config():
|
||||
with st.sidebar:
|
||||
# 使用 Streamlit 的 markdown 函数添加 Markdown 文本
|
||||
st.image('assets/aiwei_logo.jpg', width=1, caption='EmoLLM-aiwei AI Logo', use_column_width=True)
|
||||
st.markdown("[访问 EmoLLM 官方repo](https://github.com/aJupyter/EmoLLM)")
|
||||
|
||||
max_length = st.slider("Max Length", min_value=8, max_value=32768, value=32768)
|
||||
top_p = st.slider("Top P", 0.0, 1.0, 0.8, step=0.01)
|
||||
temperature = st.slider("Temperature", 0.0, 1.0, 0.7, step=0.01)
|
||||
st.button("Clear Chat History", on_click=on_btn_click)
|
||||
|
||||
generation_config = GenerationConfig(max_length=max_length, top_p=top_p, temperature=temperature)
|
||||
|
||||
return generation_config
|
||||
|
||||
|
||||
user_prompt = "<|im_start|>user\n{user}<|im_end|>\n"
|
||||
robot_prompt = "<|im_start|>assistant\n{robot}<|im_end|>\n"
|
||||
cur_query_prompt = "<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n"
|
||||
|
||||
|
||||
def combine_history(prompt):
|
||||
messages = st.session_state.messages
|
||||
meta_instruction = (
|
||||
"你是一个拥有丰富心理学知识的温柔邻家温柔大姐姐艾薇,我有一些心理问题,请你用专业的知识和温柔、可爱、俏皮、的口吻帮我解决,回复中可以穿插一些可爱的Emoji表情符号或者文本符号。\n"
|
||||
)
|
||||
total_prompt = f"<s><|im_start|>system\n{meta_instruction}<|im_end|>\n"
|
||||
for message in messages:
|
||||
cur_content = message["content"]
|
||||
if message["role"] == "user":
|
||||
cur_prompt = user_prompt.format(user=cur_content)
|
||||
elif message["role"] == "robot":
|
||||
cur_prompt = robot_prompt.format(robot=cur_content)
|
||||
else:
|
||||
raise RuntimeError
|
||||
total_prompt += cur_prompt
|
||||
total_prompt = total_prompt + cur_query_prompt.format(user=prompt)
|
||||
return total_prompt
|
||||
|
||||
|
||||
def main():
|
||||
# torch.cuda.empty_cache()
|
||||
print("load model begin.")
|
||||
model, tokenizer = load_model()
|
||||
print("load model end.")
|
||||
|
||||
user_avator = "assets/user.png"
|
||||
robot_avator = "assets/robot.jpeg"
|
||||
|
||||
st.title("EmoLLM-温柔御姐艾薇(aiwei)")
|
||||
|
||||
generation_config = prepare_generation_config()
|
||||
|
||||
# Initialize chat history
|
||||
if "messages" not in st.session_state:
|
||||
st.session_state.messages = []
|
||||
|
||||
# Display chat messages from history on app rerun
|
||||
for message in st.session_state.messages:
|
||||
with st.chat_message(message["role"], avatar=message.get("avatar")):
|
||||
st.markdown(message["content"])
|
||||
|
||||
# Accept user input
|
||||
if prompt := st.chat_input("What is up?"):
|
||||
# Display user message in chat message container
|
||||
with st.chat_message("user", avatar=user_avator):
|
||||
st.markdown(prompt)
|
||||
real_prompt = combine_history(prompt)
|
||||
# Add user message to chat history
|
||||
st.session_state.messages.append({"role": "user", "content": prompt, "avatar": user_avator})
|
||||
|
||||
with st.chat_message("robot", avatar=robot_avator):
|
||||
message_placeholder = st.empty()
|
||||
for cur_response in generate_interactive(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompt=real_prompt,
|
||||
additional_eos_token_id=92542,
|
||||
**asdict(generation_config),
|
||||
):
|
||||
# Display robot response in chat message container
|
||||
message_placeholder.markdown(cur_response + "▌")
|
||||
message_placeholder.markdown(cur_response) # pylint: disable=undefined-loop-variable
|
||||
# Add robot response to chat history
|
||||
st.session_state.messages.append(
|
||||
{
|
||||
"role": "robot",
|
||||
"content": cur_response, # pylint: disable=undefined-loop-variable
|
||||
"avatar": robot_avator,
|
||||
}
|
||||
)
|
||||
torch.cuda.empty_cache()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
231
xtuner_config/ChatGLM3-6b-ft_EN.md
Normal file
231
xtuner_config/ChatGLM3-6b-ft_EN.md
Normal file
@ -0,0 +1,231 @@
|
||||
# ChatGLM3-6B
|
||||
|
||||
## Environment Preparation
|
||||
|
||||
In practice, we have two platforms available for selection.
|
||||
|
||||
* Rent a machine with a 3090 GPU and 24G memory on the [autodl](https://www.autodl.com/) platform. Select the image as shown: `PyTorch` --> `2.0.0` --> `3.8(ubuntu20.04)` --> `11.8`.
|
||||

|
||||
* On the [InternStudio](https://studio.intern-ai.org.cn/) platform, choose the configuration of A100(1/4). Select the image as shown: `Cuda11.7-conda`.
|
||||

|
||||
|
||||
In the Terminal, update pip and install dependencies.
|
||||
|
||||
```shell
|
||||
# Upgrade pip
|
||||
python -m pip install --upgrade pip
|
||||
|
||||
pip install modelscope==1.9.5
|
||||
pip install transformers==4.35.2
|
||||
pip install streamlit==1.24.0
|
||||
pip install sentencepiece==0.1.99
|
||||
pip install accelerate==0.24.1
|
||||
pip install peft==0.4.0
|
||||
pip install datasets==2.10.1
|
||||
```
|
||||
|
||||
## Download Models
|
||||
|
||||
Use the `modelscope` function `snapshot_download` to download the model. The first parameter is the model name, and the parameter `cache_dir` is the download path of the model.
|
||||
|
||||
Create a `download.py` file in the `/root/autodl-tmp` directory and enter the following content. After pasting the code, remember to save the file as shown in the figure. Run python `/root/autodl-tmp/download.py` to execute the download. The model size is 14 GB, and the download will take about 10~20 minutes.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from modelscope import snapshot_download, AutoModel, AutoTokenizer
|
||||
import os
|
||||
model_dir = snapshot_download('ZhipuAI/chatglm3-6b', cache_dir='/root/autodl-tmp', revision='master')
|
||||
```
|
||||
|
||||
For more information about ChatLLM, please refer to [self-llm](https://github.com/datawhalechina/self-llm)
|
||||
|
||||
## Construction of Instruction Set
|
||||
|
||||
Fine-tuning of LLM generally refers to the process of instruction tuning. Instruction tuning means that the fine-tuning data we use is in the form of:
|
||||
|
||||
```json
|
||||
{
|
||||
"system":"Answer the following user question and only output the answer.",
|
||||
"input":"1+1 equals what?",
|
||||
"output":"2"
|
||||
}
|
||||
```
|
||||
|
||||
Among them, `system` is the user's instruction, informing the model of the task it needs to complete; `input` is the user's input, which is necessary input for completing the user's instruction; `output` is the output the model should give.
|
||||
|
||||
That is, our core training goal is to enable the model to have the ability to understand and follow user instructions. Therefore, when constructing the instruction set, we should build task instruction sets in a targeted manner according to our own tasks and objectives. Our goal is to build a personalized LLM that can simulate a psychological doctor solving psychological problems, so we construct instructions like this:
|
||||
|
||||
```json
|
||||
{
|
||||
"system": "Now you are a psychological expert, I have some psychological problems, please help me solve them with professional knowledge.",
|
||||
"input":"I feel like I always procrastinate and have trouble concentrating on completing tasks.",
|
||||
"output":"Procrastination is a common problem, but it can be overcome. First, you can try to set yourself a clear goal and break it down into smaller tasks. This way, you will find the task easier to manage. In addition, maintaining a tidy and orderly work environment also helps improve work efficiency. Finally, when you complete a task, don't forget to give yourself some rewards to motivate yourself to keep moving forward."
|
||||
}
|
||||
```
|
||||
|
||||
## Data Formatting
|
||||
|
||||
Data for `Lora` training needs to be formatted and encoded before being input into the model for training. As those familiar with the training process of `Pytorch` models know, we usually need to encode the input text as `input_ids` and the output text as `labels`, and the results of encoding are multi-dimensional vectors. We first define a preprocessing function that encodes the input and output text for each sample and returns an encoded dictionary:
|
||||
|
||||
```python
|
||||
def process_func(example):
|
||||
MAX_LENGTH = 512
|
||||
input_ids, labels = [], []
|
||||
instruction = tokenizer.encode(text="\n".join(["<|system|>", "Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.", "<|user|>",
|
||||
example["system"] + example["input"] + "<|assistant|>"]).strip() + "\n",
|
||||
add_special_tokens=True, truncation=True, max_length=MAX_LENGTH)
|
||||
|
||||
response = tokenizer.encode(text=example["output"], add_special_tokens=False, truncation=True,
|
||||
max_length=MAX_LENGTH)
|
||||
|
||||
input_ids = instruction + response + [tokenizer.eos_token_id]
|
||||
labels = [tokenizer.pad_token_id] * len(instruction) + response + [tokenizer.eos_token_id]
|
||||
pad_len = MAX_LENGTH - len(input_ids)
|
||||
input_ids += [tokenizer.pad_token_id] * pad_len
|
||||
labels += [tokenizer.pad_token_id] * pad_len
|
||||
labels = [(l if l != tokenizer.pad_token_id else -100) for l in labels]
|
||||
|
||||
return {
|
||||
"input_ids": input_ids,
|
||||
"labels": labels
|
||||
}
|
||||
```
|
||||
|
||||
After formatting, each piece of data sent into the model is a dictionary containing two key-value pairs: `input_ids` and `labels`. `input_ids` is the encoding of the input text, and `labels` is the encoding of the output text. The decoded result should appear as follows:
|
||||
|
||||
```text
|
||||
[gMASK]sop <|system|>
|
||||
Now you are a psychological expert, I have some psychological problems, please help me solve them with your professional knowledge.
|
||||
<|user|>
|
||||
My team atmosphere is great, and all my colleagues are very friendly. Moreover, we often go out together to play, feeling like a big family.\n <|assistant|>
|
||||
This is a great working environment, and having good interpersonal relationships and teamwork can indeed bring a lot of happiness. However, I also understand that you may encounter some challenges in your work, such as task pressure or conflicts with colleagues. Have you ever thought about how to deal with these issues?
|
||||
```
|
||||
|
||||
Why is it in this form? That's a great question! Different models have different formatting requirements for their inputs, so we need to refer to the training source code of our deep model to check the specific format. Since fine-tuning Lora based on the original model instructions should yield the best results, we still follow the input format of the original model. Ok, here is the link to the source code for those who are interested in exploring it on their own:
|
||||
|
||||
[hugging face ChatGLM3 repository](https://github.com/THUDM/ChatGLM3/blob/main/finetune_chatmodel_demo/preprocess_utils.py): The `InputOutputDataset` class can be found here.
|
||||
Additionally, you can refer to this repository for data processing of ChatGLM [LLaMA-Factory](https://github.com/KMnO4-zx/LLaMA-Factory/blob/main/src/llmtuner/data/template.py).
|
||||
|
||||
## Loading the tokenizer and half-precision model
|
||||
|
||||
The model is loaded in half-precision format. If you have a newer graphics card, you can use `torch.bfolat` to load it. For custom models, always specify the `trust_remote_code` parameter as `True`.
|
||||
|
||||
```python
|
||||
tokenizer = AutoTokenizer.from_pretrained('./model/chatglm3-6b', use_fast=False, trust_remote_code=True)
|
||||
|
||||
# The model is loaded in half-precision format. If you have a relatively new GPU, you can load it in torch.bfloat format.
|
||||
model = AutoModelForCausalLM.from_pretrained('./model/chatglm3-6b', trust_remote_code=True, torch_dtype=torch.half, device_map="auto")
|
||||
```
|
||||
|
||||
## Defining LoraConfig
|
||||
|
||||
The `LoraConfig` class allows you to set many parameters, but there are only a few main ones. I'll briefly explain them; those interested can directly refer to the source code.
|
||||
|
||||
- `task_type`: The type of the model.
|
||||
- `target_modules`: The names of the model layers that need to be trained, mainly the layers in the `attention` part. The names of these layers differ for different models. They can be passed as an array, a string, or a regular expression.
|
||||
- `r`: The rank of `lora`, which can be seen in the principles of `Lora`.
|
||||
- `lora_alpha`: The `Lora alaph`, the specific role of which can be referred to in the principles of `Lora`.
|
||||
- `modules_to_save` specifies which modules, besides those split into `lora`, can be fully specified for training.
|
||||
-
|
||||
What's this scaling of `Lora` about? Obviously, it's not `r` (rank). This scaling is actually `lora_alpha/r`, and in this `LoraConfig`, the scaling is 4 times.
|
||||
|
||||
This scaling does not change the size of the parameters of LoRa; it essentially involves broadcasting the parameter values and performing linear scaling.
|
||||
|
||||
```python
|
||||
config = LoraConfig(
|
||||
task_type=TaskType.CAUSAL_LM,
|
||||
target_modules=["query_key_value"],
|
||||
inference_mode=False, # training mode
|
||||
r=8, # Lora Rank
|
||||
lora_alpha=32, # Lora alaph,for specific details and functionality, please refer to the Lora principle.
|
||||
lora_dropout=0.1 # Dropout ratio
|
||||
)
|
||||
```
|
||||
|
||||
## Customizing TrainingArguments Parameters
|
||||
|
||||
The source code of the `TrainingArguments` class also introduces the specific functions of each parameter. Of course, everyone is encouraged to explore it on their own, but I'll mention a few commonly used ones here.
|
||||
|
||||
- `output_dir`: The output path for the model.
|
||||
- `per_device_train_batch_size`: As the name suggests, `batch_size`.
|
||||
- `gradient_accumulation_steps`: Gradient accumulation. If you have a smaller GPU memory, you can set a smaller `batch_size` and increase the gradient accumulation.
|
||||
- `logging_steps`: How many steps to output a `log`.
|
||||
- `num_train_epochs`: As the name suggests, `epoch`.
|
||||
- `gradient_checkpointing`: Gradient checking. Once enabled, the model must execute `model.enable_input_require_grads()`. The principle behind this can be explored by yourselves, so I won't go into details here.
|
||||
|
||||
```python
|
||||
# The GLM source repository has restructured its own data_collator and we will continue to use it here.
|
||||
|
||||
data_collator = DataCollatorForSeq2Seq(
|
||||
tokenizer,
|
||||
model=model,
|
||||
label_pad_token_id=-100,
|
||||
pad_to_multiple_of=None,
|
||||
padding=False
|
||||
)
|
||||
|
||||
args = TrainingArguments(
|
||||
output_dir="./output/ChatGLM",
|
||||
per_device_train_batch_size=4,
|
||||
gradient_accumulation_steps=2,
|
||||
logging_steps=10,
|
||||
num_train_epochs=3,
|
||||
gradient_checkpointing=True,
|
||||
save_steps=100,
|
||||
learning_rate=1e-4,
|
||||
)
|
||||
```
|
||||
|
||||
### Training with Trainer
|
||||
|
||||
Put the model in, put the parameters set above in, and put the dataset in. OK! Start training!
|
||||
|
||||
```python
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=args,
|
||||
train_dataset=tokenized_id,
|
||||
data_collator=data_collator,
|
||||
)
|
||||
trainer.train()
|
||||
```
|
||||
|
||||
## Inference
|
||||
|
||||
You can use this more classic method for inference.
|
||||
|
||||
```python
|
||||
while True:
|
||||
# 推理
|
||||
model = model.cuda()
|
||||
input_text = input("User >>>")
|
||||
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Reloading
|
||||
|
||||
Models fine-tuned through PEFT can be reloaded and inferred using the following methods:
|
||||
|
||||
- Load the source model and tokenizer;
|
||||
- Use `PeftModel` to merge the source model with the parameters fine-tuned by PEFT.
|
||||
|
||||
|
||||
```python
|
||||
from peft import PeftModel
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("./model/chatglm3-6b", trust_remote_code=True, low_cpu_mem_usage=True)
|
||||
tokenizer = AutoTokenizer.from_pretrained("./model/chatglm3-6b", use_fast=False, trust_remote_code=True)
|
||||
|
||||
# Load the LoRa weights obtained from training.
|
||||
p_model = PeftModel.from_pretrained(model, model_id="./output/ChatGLM/checkpoint-1000/")
|
||||
|
||||
|
||||
while True:
|
||||
# inference
|
||||
model = model.cuda()
|
||||
input_text = input("User >>>")
|
||||
ipt = tokenizer("<|system|>\nNow you are a mental health expert, and I have some psychological issues. Please use your professional knowledge to help me solve them.\n<|user|>\n {}\n{}".format(input_text, "").strip() + "<|assistant|>\n", return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**ipt, max_length=128, do_sample=True)[0], skip_special_tokens=True))
|
||||
|
||||
```
|
87
xtuner_config/README_EN.md
Normal file
87
xtuner_config/README_EN.md
Normal file
@ -0,0 +1,87 @@
|
||||
# Fine-Tuning Guide
|
||||
|
||||
- This project has undergone fine-tuning not only on mental health datasets but also on self-awareness, and here is the detailed guide for fine-tuning.
|
||||
|
||||
## I. Fine-Tuning Based on Xtuner 🎉🎉🎉🎉🎉
|
||||
|
||||
### Environment Setup
|
||||
|
||||
```markdown
|
||||
datasets==2.16.1
|
||||
deepspeed==0.13.1
|
||||
einops==0.7.0
|
||||
flash_attn==2.5.0
|
||||
mmengine==0.10.2
|
||||
openxlab==0.0.34
|
||||
peft==0.7.1
|
||||
sentencepiece==0.1.99
|
||||
torch==2.1.2
|
||||
transformers==4.36.2
|
||||
xtuner==0.1.11
|
||||
```
|
||||
|
||||
You can also install them all at once by
|
||||
|
||||
```bash
|
||||
cd xtuner_config/
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Fine-Tuning
|
||||
|
||||
```bash
|
||||
cd xtuner_config/
|
||||
xtuner train internlm2_7b_chat_qlora_e3.py --deepspeed deepspeed_zero2
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Convert the Obtained PTH Model to a HuggingFace Model
|
||||
|
||||
**That is: Generate the Adapter folder**
|
||||
|
||||
```bash
|
||||
cd xtuner_config/
|
||||
mkdir hf
|
||||
export MKL_SERVICE_FORCE_INTEL=1
|
||||
|
||||
xtuner convert pth_to_hf internlm2_7b_chat_qlora_e3.py ./work_dirs/internlm_chat_7b_qlora_oasst1_e3_copy/epoch_3.pth ./hf
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Merge the HuggingFace Adapter with the Large Language Model
|
||||
|
||||
```bash
|
||||
xtuner convert merge ./internlm2-chat-7b ./hf ./merged --max-shard-size 2GB
|
||||
# xtuner convert merge \
|
||||
# ${NAME_OR_PATH_TO_LLM} \
|
||||
# ${NAME_OR_PATH_TO_ADAPTER} \
|
||||
# ${SAVE_PATH} \
|
||||
# --max-shard-size 2GB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Testing
|
||||
|
||||
```
|
||||
cd demo/
|
||||
python cli_internlm2.py
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## II. Fine-Tuning Based on Transformers🎉🎉🎉🎉🎉
|
||||
|
||||
- Please refer to the [ChatGLM3-6b lora fine-tuning guide](ChatGLM3-6b-ft.md).
|
||||
|
||||
---
|
||||
|
||||
## Other
|
||||
|
||||
Feel free to give [xtuner](https://github.com/InternLM/xtuner) and [EmoLLM](https://github.com/aJupyter/EmoLLM) a star~
|
||||
|
||||
🎉🎉🎉🎉🎉
|
217
xtuner_config/airen-internlm2_chat_7b_qlora.py
Normal file
217
xtuner_config/airen-internlm2_chat_7b_qlora.py
Normal file
@ -0,0 +1,217 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './tiangou.json'
|
||||
prompt_template = PROMPT_TEMPLATE.internlm2_chat
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 1
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 1e-5
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0.0001
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个拥有丰富心理学知识的舔狗艾仁医生,我有一些心理问题,请你用专业的知识和无条件付出、讨好、过度关心我、近乎病态的想得到我的认可的口吻帮我解决,回答中可以包含一些可爱的Emoji表情符号或者文本符号。\n"
|
||||
evaluation_inputs = [
|
||||
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
quantization_config=dict(
|
||||
type=BitsAndBytesConfig,
|
||||
load_in_4bit=True,
|
||||
load_in_8bit=False,
|
||||
llm_int8_threshold=6.0,
|
||||
llm_int8_has_fp16_weight=False,
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type='nf4')),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=64,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.1,
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
#dataset=dict(type=load_dataset, path=alpaca_en_path),
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=WandbVisBackend)]
|
||||
)
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = True
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
218
xtuner_config/aiwei-internlm2_chat_7b_qlora.py
Normal file
218
xtuner_config/aiwei-internlm2_chat_7b_qlora.py
Normal file
@ -0,0 +1,218 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/share/model_repos/internlm2-chat-7b'
|
||||
# /root/share/model_repos/internlm2-chat-7b
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './aiwei.json'
|
||||
prompt_template = PROMPT_TEMPLATE.internlm2_chat
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 1
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 5
|
||||
optim_type = AdamW
|
||||
lr = 1e-5
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0.0001
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个拥有丰富心理学知识的温柔御姐艾薇医生,我有一些心理问题,请你用专业的知识和温柔的口吻帮我解决,可以生成一些可爱的Emoji表情符号或者文本符号。"
|
||||
evaluation_inputs = [
|
||||
'我最近总是感到很焦虑,尤其是在学业上。我有个特别崇拜的同学,他好像在各方面都比我优秀,我总觉得自己怎么努力也追不上他,这让我压力特别大。', '我知道应该理性看待,但就是忍不住会去比较。我甚至晚上会因为这个睡不着觉,总想着怎样才能像他那样出色。'
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
quantization_config=dict(
|
||||
type=BitsAndBytesConfig,
|
||||
load_in_4bit=True,
|
||||
load_in_8bit=False,
|
||||
llm_int8_threshold=6.0,
|
||||
llm_int8_has_fp16_weight=False,
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type='nf4')),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=64,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.1,
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=WandbVisBackend)]
|
||||
)
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = True
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
218
xtuner_config/baichuan2_13b_chat_qlora_alpaca_e3.py
Normal file
218
xtuner_config/baichuan2_13b_chat_qlora_alpaca_e3.py
Normal file
@ -0,0 +1,218 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/baichuan-inc/Baichuan2-13B-Chat'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './merge.json'
|
||||
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 4
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-4
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
quantization_config=dict(
|
||||
type=BitsAndBytesConfig,
|
||||
load_in_4bit=True,
|
||||
load_in_8bit=False,
|
||||
llm_int8_threshold=6.0,
|
||||
llm_int8_has_fp16_weight=False,
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type='nf4')),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=64,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.1,
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=WandbVisBackend)]
|
||||
)
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
205
xtuner_config/chatglm3_6b_lora_alpaca_e3.py
Normal file
205
xtuner_config/chatglm3_6b_lora_alpaca_e3.py
Normal file
@ -0,0 +1,205 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/ZhipuAI/chatglm3-6b'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './merge.json'
|
||||
prompt_template = PROMPT_TEMPLATE.chatglm3
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 20 # per_device
|
||||
accumulative_counts = 4
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-4
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
encode_special_tokens=True,
|
||||
padding_side='left')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=64,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.1,
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = None
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
216
xtuner_config/deepseek_moe_16b_chat_qlora_oasst1_e3.py
Normal file
216
xtuner_config/deepseek_moe_16b_chat_qlora_oasst1_e3.py
Normal file
@ -0,0 +1,216 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/deepseek-ai/deepseek-moe-16b-chat'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './merge.json'
|
||||
prompt_template = PROMPT_TEMPLATE.deepseek_moe
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 8
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-4
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
quantization_config=dict(
|
||||
type=BitsAndBytesConfig,
|
||||
load_in_4bit=True,
|
||||
load_in_8bit=False,
|
||||
llm_int8_threshold=6.0,
|
||||
llm_int8_has_fp16_weight=False,
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type='nf4')),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=16,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.05,
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
train_dataset = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=train_dataset,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=WandbVisBackend)]
|
||||
)
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
1
xtuner_config/images/README_EN.md
Normal file
1
xtuner_config/images/README_EN.md
Normal file
@ -0,0 +1 @@
|
||||
This folder contains all related files and images.
|
198
xtuner_config/internlm2_1_8b_full_alpaca_e3.py
Normal file
198
xtuner_config/internlm2_1_8b_full_alpaca_e3.py
Normal file
@ -0,0 +1,198 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from torch.optim import AdamW
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/jayhust/internlm2-chat-1_8b'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './merge.json'
|
||||
prompt_template = PROMPT_TEMPLATE.default
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 4
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-5
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=WandbVisBackend)]
|
||||
)
|
||||
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
221
xtuner_config/mixtral_8x7b_instruct_qlora_oasst1_e3.py
Normal file
221
xtuner_config/mixtral_8x7b_instruct_qlora_oasst1_e3.py
Normal file
@ -0,0 +1,221 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
import torch
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from peft import LoraConfig
|
||||
from torch.optim import AdamW
|
||||
from transformers import (AutoModelForCausalLM, AutoTokenizer,
|
||||
BitsAndBytesConfig)
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE
|
||||
|
||||
from mmengine.visualization import Visualizer,WandbVisBackend, TensorboardVisBackend
|
||||
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/HIT-SCIR/Chinese-Mixtral-8x7B'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './merge.json'
|
||||
prompt_template = PROMPT_TEMPLATE.mixtral
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 4
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-4
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 500
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 500
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
torch_dtype=torch.float16,
|
||||
quantization_config=dict(
|
||||
type=BitsAndBytesConfig,
|
||||
load_in_4bit=True,
|
||||
load_in_8bit=False,
|
||||
llm_int8_threshold=6.0,
|
||||
llm_int8_has_fp16_weight=False,
|
||||
bnb_4bit_compute_dtype=torch.float16,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
bnb_4bit_quant_type='nf4')),
|
||||
lora=dict(
|
||||
type=LoraConfig,
|
||||
r=64,
|
||||
lora_alpha=16,
|
||||
lora_dropout=0.1,
|
||||
target_modules=[
|
||||
'q_proj', 'k_proj', 'v_proj', 'o_proj', 'w1', 'w2', 'w3'
|
||||
],
|
||||
bias='none',
|
||||
task_type='CAUSAL_LM'))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
train_dataset = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=train_dataset,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
# visualizer = None
|
||||
visualizer = dict(
|
||||
type=Visualizer,
|
||||
vis_backends=[dict(type=TensorboardVisBackend)]
|
||||
)
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = None
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
192
xtuner_config/qwen1_5_0_5_B_full.py
Normal file
192
xtuner_config/qwen1_5_0_5_B_full.py
Normal file
@ -0,0 +1,192 @@
|
||||
# Copyright (c) OpenMMLab. All rights reserved.
|
||||
from datasets import load_dataset
|
||||
from mmengine.dataset import DefaultSampler
|
||||
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
|
||||
LoggerHook, ParamSchedulerHook)
|
||||
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
|
||||
from torch.optim import AdamW
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
from xtuner.dataset import process_hf_dataset
|
||||
from xtuner.dataset.collate_fns import default_collate_fn
|
||||
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
|
||||
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
|
||||
VarlenAttnArgsToMessageHubHook)
|
||||
from xtuner.engine.runner import TrainLoop
|
||||
from xtuner.model import SupervisedFinetune
|
||||
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
|
||||
|
||||
#######################################################################
|
||||
# PART 1 Settings #
|
||||
#######################################################################
|
||||
# Model
|
||||
pretrained_model_name_or_path = '/root/model/qwen/Qwen1___5-0___5B-Chat'
|
||||
use_varlen_attn = False
|
||||
|
||||
# Data
|
||||
data_path = './data_pro.json'
|
||||
prompt_template = PROMPT_TEMPLATE.qwen_chat
|
||||
max_length = 2048
|
||||
pack_to_max_length = True
|
||||
|
||||
# Scheduler & Optimizer
|
||||
batch_size = 16 # per_device
|
||||
accumulative_counts = 4
|
||||
dataloader_num_workers = 0
|
||||
max_epochs = 3
|
||||
optim_type = AdamW
|
||||
lr = 2e-5
|
||||
betas = (0.9, 0.999)
|
||||
weight_decay = 0
|
||||
max_norm = 1 # grad clip
|
||||
warmup_ratio = 0.03
|
||||
|
||||
# Save
|
||||
save_steps = 100
|
||||
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
|
||||
|
||||
# Evaluate the generation performance during the training
|
||||
evaluation_freq = 100
|
||||
SYSTEM = "现在你是一个心理专家,我有一些心理问题,请你用专业的知识帮我解决。"
|
||||
evaluation_inputs = [
|
||||
'我压力很大', '生活没意思', "非常容易羡慕别人啊"
|
||||
]
|
||||
|
||||
#######################################################################
|
||||
# PART 2 Model & Tokenizer #
|
||||
#######################################################################
|
||||
tokenizer = dict(
|
||||
type=AutoTokenizer.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True,
|
||||
padding_side='right')
|
||||
|
||||
model = dict(
|
||||
type=SupervisedFinetune,
|
||||
use_varlen_attn=use_varlen_attn,
|
||||
llm=dict(
|
||||
type=AutoModelForCausalLM.from_pretrained,
|
||||
pretrained_model_name_or_path=pretrained_model_name_or_path,
|
||||
trust_remote_code=True))
|
||||
|
||||
#######################################################################
|
||||
# PART 3 Dataset & Dataloader #
|
||||
#######################################################################
|
||||
alpaca_en = dict(
|
||||
type=process_hf_dataset,
|
||||
dataset=dict(type=load_dataset, path='json', data_files=dict(train=data_path)),
|
||||
tokenizer=tokenizer,
|
||||
max_length=max_length,
|
||||
dataset_map_fn=None,
|
||||
template_map_fn=dict(
|
||||
type=template_map_fn_factory, template=prompt_template),
|
||||
remove_unused_columns=True,
|
||||
shuffle_before_pack=True,
|
||||
pack_to_max_length=pack_to_max_length,
|
||||
use_varlen_attn=use_varlen_attn)
|
||||
|
||||
train_dataloader = dict(
|
||||
batch_size=batch_size,
|
||||
num_workers=dataloader_num_workers,
|
||||
dataset=alpaca_en,
|
||||
sampler=dict(type=DefaultSampler, shuffle=True),
|
||||
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
|
||||
|
||||
#######################################################################
|
||||
# PART 4 Scheduler & Optimizer #
|
||||
#######################################################################
|
||||
# optimizer
|
||||
optim_wrapper = dict(
|
||||
type=AmpOptimWrapper,
|
||||
optimizer=dict(
|
||||
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
|
||||
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
|
||||
accumulative_counts=accumulative_counts,
|
||||
loss_scale='dynamic',
|
||||
dtype='float16')
|
||||
|
||||
# learning policy
|
||||
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
|
||||
param_scheduler = [
|
||||
dict(
|
||||
type=LinearLR,
|
||||
start_factor=1e-5,
|
||||
by_epoch=True,
|
||||
begin=0,
|
||||
end=warmup_ratio * max_epochs,
|
||||
convert_to_iter_based=True),
|
||||
dict(
|
||||
type=CosineAnnealingLR,
|
||||
eta_min=0.0,
|
||||
by_epoch=True,
|
||||
begin=warmup_ratio * max_epochs,
|
||||
end=max_epochs,
|
||||
convert_to_iter_based=True)
|
||||
]
|
||||
|
||||
# train, val, test setting
|
||||
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
|
||||
|
||||
#######################################################################
|
||||
# PART 5 Runtime #
|
||||
#######################################################################
|
||||
# Log the dialogue periodically during the training process, optional
|
||||
custom_hooks = [
|
||||
dict(type=DatasetInfoHook, tokenizer=tokenizer),
|
||||
dict(
|
||||
type=EvaluateChatHook,
|
||||
tokenizer=tokenizer,
|
||||
every_n_iters=evaluation_freq,
|
||||
evaluation_inputs=evaluation_inputs,
|
||||
system=SYSTEM,
|
||||
prompt_template=prompt_template)
|
||||
]
|
||||
|
||||
if use_varlen_attn:
|
||||
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
|
||||
|
||||
# configure default hooks
|
||||
default_hooks = dict(
|
||||
# record the time of every iteration.
|
||||
timer=dict(type=IterTimerHook),
|
||||
# print log every 10 iterations.
|
||||
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
|
||||
# enable the parameter scheduler.
|
||||
param_scheduler=dict(type=ParamSchedulerHook),
|
||||
# save checkpoint per `save_steps`.
|
||||
checkpoint=dict(
|
||||
type=CheckpointHook,
|
||||
by_epoch=False,
|
||||
interval=save_steps,
|
||||
max_keep_ckpts=save_total_limit),
|
||||
# set sampler seed in distributed evrionment.
|
||||
sampler_seed=dict(type=DistSamplerSeedHook),
|
||||
)
|
||||
|
||||
# configure environment
|
||||
env_cfg = dict(
|
||||
# whether to enable cudnn benchmark
|
||||
cudnn_benchmark=False,
|
||||
# set multi process parameters
|
||||
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
|
||||
# set distributed parameters
|
||||
dist_cfg=dict(backend='nccl'),
|
||||
)
|
||||
|
||||
# set visualizer
|
||||
visualizer = None
|
||||
|
||||
# set log level
|
||||
log_level = 'INFO'
|
||||
|
||||
# load from which checkpoint
|
||||
load_from = '/root/Emollm/work_dirs/qwen_0_5_B/iter_255.pth'
|
||||
|
||||
# whether to resume training from the loaded checkpoint
|
||||
resume = False
|
||||
|
||||
# Defaults to use random seed and disable `deterministic`
|
||||
randomness = dict(seed=None, deterministic=False)
|
||||
|
||||
# set log processor
|
||||
log_processor = dict(by_epoch=False)
|
Loading…
Reference in New Issue
Block a user