Dev (#135)

2024-03-24 00:48:06 +08:00 · 2024-03-24 00:48:06 +08:00 · d9c44ff9d1
commit d9c44ff9d1
parent 3232964494 06f9fe543b
17 changed files with 261 additions and 12 deletions
--- a/README.md
+++ b/README.md
@ -78,7 +78,9 @@

 ### 🎇最近更新

- 【2024.3.12】在百度飞浆平台发布[艾薇](https://aistudio.baidu.com/community/app/63335)
+- 【2024.3.25】在百度飞桨平台发布[爹系男友心理咨询师](https://aistudio.baidu.com/community/app/68787)
+- 【2024.3.24】在OpenXLab和ModelScope平台发布InternLM2-Base-7B QLoRA微调模型, 具体请查看[InternLM2-Base-7B QLoRA](./xtuner_config/README_internlm2_7b_base_qlora.md)
+- 【2024.3.12】在百度飞桨平台发布[艾薇](https://aistudio.baidu.com/community/app/63335)
 - 【2024.3.11】 **EmoLLM V2.0 相比 EmoLLM V1.0 全面提升，已超越 Role-playing ChatGPT 在心理咨询任务上的能力！**[点击体验EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0)，更新[数据集统计及详细信息](./datasets/)、[路线图](./assets/Roadmap_ZH.png)
 - 【2024.3.9】 新增并发功能加速 [QA 对生成](./scripts/qa_generation/)、[RAG pipeline](./rag/)
 - 【2024.3.3】 [基于InternLM2-7B-chat全量微调版本EmoLLM V2.0开源](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full)，需要两块A100*80G，更新专业评估，详见[evaluate](./evaluate/)，更新基于PaddleOCR的PDF转txt工具脚本，详见[scripts](./scripts/)
@ -236,7 +238,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git
 |                 [ZeyuBa](https://github.com/ZeyuBa)                  |                      自动化所在读硕士                      |  |           |
 |       [aiyinyuedejustin](https://github.com/aiyinyuedejustin)        |                    宾夕法尼亚大学在读硕士                     |  |           |
 |              [Nobody-ML](https://github.com/Nobody-ML)               |                  中国石油大学（华东）在读本科生                   |  |           |
-|                [chg0901](https://github.com/chg0901)                 | [MiniSora](https://github.com/mini-sora/minisora/) |[MiniSora](https://github.com/mini-sora/minisora/)主要维护者，管理员| LLM微调、数据清洗、文档翻译 |
+|                [chg0901](https://github.com/chg0901)                 | [MiniSora](https://github.com/mini-sora/minisora/) |[MiniSora](https://github.com/mini-sora/minisora/)主要维护者，管理员| LLM预训练和微调、模型上传、数据清洗、文档翻译 |
 |                 [Mxoder](https://github.com/Mxoder)                  |                   北京航空航天大学在读本科生                    |  |           |
 |               [Anooyman](https://github.com/Anooyman)                |                      南京理工大学硕士                      |  |           |
 |             [Vicky-3021](https://github.com/Vicky-3021)              |                   西安电子科技大学硕士（研0）                   |  |           |
--- a/README_EN.md
+++ b/README_EN.md
@ -77,8 +77,12 @@ The Model aims to fully understand and promote the mental health of individuals,
 - Psychological resilience: Refers to an individual's ability to recover from adversity and adapt. Those with strong psychological resilience can bounce back from challenges and learn and grow from them.
 - Prevention and intervention measures: The Mental Health Grand Model also includes strategies for preventing psychological issues and promoting mental health, such as psychological education, counseling, therapy, and social support systems.
 - Assessment and diagnostic tools: Effective promotion of mental health requires scientific tools to assess individuals' psychological states and diagnose potential psychological issues.
+
 ### Recent Updates
- 【2024.3.12】 Released on Baidu Flying Pulp Platform [aiwei](https://aistudio.baidu.com/community/app/63335)
+
+- 【2024.3.25】 [Daddy-like Boy-Friend] is released on Baidu Paddle-Paddle AI Studio Platform (https://aistudio.baidu.com/community/app/68787)
+- 【2024.3.24】 The InternLM2-Base-7B QLoRA fine-tuned model has been released on the OpenXLab and ModelScope platforms. For more details, please refer to [InternLM2-Base-7B QLoRA](./xtuner_config/README_internlm2_7b_base_qlora.md).
+- 【2024.3.12】 [aiwei] is released on Baidu Paddle-Paddle AI Studio Platform (https://aistudio.baidu.com/community/app/63335)
 - 【2024.3.11】 **EmoLLM V2.0 is greatly improved in all scores compared to EmoLLM V1.0. Surpasses the performance of Role-playing ChatGPT on counseling tasks!** [Click to experience EmoLLM V2.0](https://openxlab.org.cn/apps/detail/Farewell1/EmoLLMV2.0), update [dataset statistics and details](./datasets/), [Roadmap](./assets/Roadmap_ZH.png)
 - 【2024.3.9】 Add concurrency acceleration [QA pair generation](./scripts/qa_generation/), [RAG pipeline](./rag/)
 - 【2024.3.3】 [Based on InternLM2-7B-chat full fine-tuned version EmoLLM V2.0 open sourced](https://openxlab.org.cn/models/detail/ajupyter/EmoLLM_internlm2_7b_full), need two A100*80G, update professional evaluation, see [evaluate](./evaluate/), update PaddleOCR-based PDF to txt tool scripts, see [scripts](./scripts/).
@ -248,7 +252,7 @@ This project uses Git for version control. You can see the currently available v
 |                   [ZeyuBa](https://github.com/ZeyuBa)                   |              Institute of Automation, Master's student               |  |                                    |
 |         [aiyinyuedejustin](https://github.com/aiyinyuedejustin)         |             University of Pennsylvania, Master's student             |  |                                    |
 |                [Nobody-ML](https://github.com/Nobody-ML)                |  China University of Petroleum (East China), Undergraduate student   |  |                                    |
-|                  [chg0901](https://github.com/chg0901)                  |          [MiniSora](https://github.com/mini-sora/minisora)           |Maintainer and Admin of [MiniSora](https://github.com/mini-sora/minisora) | LLM Fine-Tuning, Data Cleaning and Docs Translation |
+|                  [chg0901](https://github.com/chg0901)                  |          [MiniSora](https://github.com/mini-sora/minisora)           |Maintainer and Admin of [MiniSora](https://github.com/mini-sora/minisora) | LLM Pre-Training and Fine-Tuning, Model Uploading, Data Cleaning and Docs Translation |
 |                   [Mxoder](https://github.com/Mxoder)                   |              Beihang University, Undergraduate student               |  |                                    |
 |                 [Anooyman](https://github.com/Anooyman)                 |    Nanjing University of Science and Technology, Master's student    |  |                                    |
 |               [Vicky-3021](https://github.com/Vicky-3021)               |        Xidian University, Master's student (Research Year 0)         |  |                                    |
--- a/datasets/README.md
+++ b/datasets/README.md
@ -4,11 +4,13 @@
 * 数据按格式分为两种类型：**QA** 和 **Conversation**
 * 数据汇总：General（**6个数据集**）；Role-play（**3个数据集**）

- ## 数据集类型
+## 数据集类型
+
 * **General**：通用数据集，包含心理学知识、心理咨询技术等通用内容
 * **Role-play**：角色扮演数据集，包含特定角色对话风格数据等内容

 ## 数据类型
+
 * **QA**：问答对
 * **Conversation**：多轮对话

@ -28,7 +30,9 @@
 |     ……      |          ……           |      ……      |   ……    |

 ## 数据集来源
-**General**：
+
+### **General**
+
 * 数据集 data 来自本项目
 * 数据集 data_pro 来自本项目
 * 数据集 multi_turn_dataset_1 来源 [Smile](https://github.com/qiuhuachuan/smile)
@ -36,24 +40,28 @@
 * 数据集 single_turn_dataset_1 来自本项目
 * 数据集 single_turn_dataset_2 来自本项目

-**Role-play**：
+### **Role-play**
+
 * 数据集 aiwei 来自本项目
 * 数据集 tiangou 来自本项目
 * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)

 ## 数据集去重
+
 结合绝对匹配以及模糊匹配(Simhash)算法，对数据集进行去重以提升微调模型的效果。在确保数据集的高质量的同时，通过调整阈值减少因错误匹配而丢失重要数据的风险。

-**Simhash算法介绍**
+### **Simhash算法介绍**
+
 Simhash（相似性哈希）是一种用于检测大量数据中相似或重复项的算法。它通过将文本转换为一组数值指纹来工作，这些指纹对相似的文本具有高度的相似性。Simhash算法对于处理文本数据特别有效，尤其是在处理大量数据时。

-**Simhash实现步骤**
+### **Simhash实现步骤**
+
 *文本预处理：将文本数据转换为适合Simhash处理的格式。这可能包括分词、去除停用词、词干提取等。
 *生成Simhash指纹：对预处理后的文本应用Simhash算法，生成一组数值指纹。每个指纹代表文本内容的一个哈希值。
 *比较指纹：通过比较哈希值的相似性来识别重复或相似的记录。Simhash的特点是即使在文本有少量差异时，生成的哈希值也具有较高的相似性。
 *确定阈值：设置一个相似性阈值，只有当两个指纹的相似度超过这个阈值时，才认为它们代表相似或重复的记录。
 *处理相似记录：对于被标记为相似的记录，可以进一步人工审查或自动合并，以消除重复。

-## 用法
-### deduplicate.py
-`deduplicate.py` 用于将datasets下以模型命名的文件夹下(例如：'datasets/qwen').json数据进行去重，输出去重后的数据到 `datasets/qwen/dedup` 文件夹下。
+### deduplicate.py用法
+
+`deduplicate.py` 用于将datasets下以模型命名的文件夹下(例如：'datasets/qwen').json数据进行去重，输出去重后的数据到 `datasets/qwen/dedup` 文件夹下。
--- a/scripts/README_Model_Uploading.md
+++ b/scripts/README_Model_Uploading.md
@ -0,0 +1,235 @@
+# 模型上传指南
+
+## OpenXLab浦源平台
+
+### OpenXLab平台介绍
+
+<div align="center">
+<img src="./asserts/openxlab.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+OpenXLab浦源 内容平台 是面向 AI 研究员和开发者提供 AI 领域的一站式服务平台，包含数据集中心、模型中心和应用中心。内容平台为 AI 研究员和开发者提供了所需的模型训练物料，同时也为他们提供了模型推理应用的托管服务。此外，内容平台致力于打造一个 AI 数据集、模型与应用的交流社区，为 AI 研究者提供一个分享和交流的平台。通过内容平台，AI 研究者可以更好地展示自己的模型能力，并激发创造力，助力 AI 生态的可持续发展。
+
+更多介绍请查看[OpenXLab浦源平台介绍](https://openxlab.org.cn/docs/intro.html)
+
+<!--  应用中心：应用中心提供应用托管的服务，用户只需遵循平台规范，通过简单的前端封装组件（Gradio）即可构建模型推理应用演示 demo，应用中心提供免费应用部署的能力，普通用户也可在应用中心中交互式体验模型的能力，更好帮助用户寻找想要的学术模型或应用服务。通过前端封装组件和平台的 SDK 工具，帮助 AI 开发者简单快速构建人工智能应用。
+
+ 模型中心：支持丰富模型管理方式，模型中心基于模型元信息标准规范，支持用户上传、存储、检索、评测各类模型。基于平台内的命令行工具，便于 AI 开发者上传和发布模型文件，搭建对象存储体系，提供大文件存储能力，快速上传下载功能，便于 AI 开发者进行模型存储。
+
+ 数据集中心：支持多元数据管理，数据中心提供公开数据集的展示、检索和下载等，同时提供私有数据集的上传、管理和发布功能，支持用户自建数据集的开放共享。数据集中心为人工智能研究者提供免费开源的数据集，通过数据集中心，研究者可以获得格式统一的各领域经典数据集。通过平台的搜索功能，研究者可以迅速便捷地找到自己所需数据集；通过平台的统一格式，研究者可以便捷地对跨数据集任务进行开发。
+
+<div align="center">
+<img src="./asserts/平台概述.e6d980f8.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+内容平台中，模型仓库存储模型相关的权重文件，应用仓库管理部署应用，为了简化贡献者的维护成本，模型相关的算法训练代码和应用的相关代码托管至 GitHub 中，即 GitHub 中可以存放算法训练的代码和应用相关的代码，贡献者只需维护 GitHub 仓库代码即可，无需多方维护代码，内容平台只提供模型权重的存储服务和应用的部署服务。
+
+<div align="center">
+<img src="./asserts/GitHub与平台的关系.bee7809e.png" width="600"/>
+  <div align="center">
+  </div>
+</div> -->
+
+## 模型创建流程
+
+### 要点强调
+
+- 浦源-模型中心提供目前支持通过Git命令进行文件上传
+- 使用该方法进行文件上传前，请您确认已安装Git
+- 由于上传需要进行权限校验，这里我们推荐使用VSCode远程ssh连接InternLM AI Studio, 获取XLab秘钥
+
+### 创建具体步骤
+
+- 步骤1：点击“创建模型”按钮
+- 步骤2：填写仓库相关信息
+- **步骤3：上传模型相关文件**
+
+更多详情和操作步骤请查看, 请参考[**模型创建流程 **(步骤1和2)](https://openxlab.org.cn/docs/models/%E6%A8%A1%E5%9E%8B%E5%88%9B%E5%BB%BA%E6%B5%81%E7%A8%8B.html)和[**上传模型**(步骤3)](https://openxlab.org.cn/docs/models/%E4%B8%8A%E4%BC%A0%E6%A8%A1%E5%9E%8B.html), 这里我们将给出所用到的基本步骤和需要注意的操作要点.
+
+## 上传模型
+
+### 上传具体步骤
+
+- **步骤1：安装git lfs**
+- **步骤2：配置git和lfs**
+- **步骤3：配置OpenXLab秘钥**
+- 步骤4：在本地的文件夹内调整文件
+- 步骤5：上传本地文件夹中的模型文件到OpenXLab
+- 步骤6：上传后查看和添加README.md
+
+这里展示最顺利的截图, 不包含下面的`安装git lfs`
+
+<div align="center">
+<img src="./asserts/full_upload.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+### 1. 安装git lfs
+
+```bash
+curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh
+apt install git-lfs
+```
+
+### 2. 配置git和lfs
+
+```bash
+git config --global user.name "your username"
+
+git lfs install # 这个很关键
+git clone https://code.openxlab.org.cn//chg0901/EmoLLM-InternLM7B-base.git  # 要上传的模型链接, 由步骤1和2创建
+
+```
+
+### 3. 配置OpenXLab秘钥
+
+- 详情请参考[**密钥管理**](https://openxlab.org.cn/security?tab=git), 获取您的 Git Access Token
+- 点击 “**添加令牌**” 按钮
+- 由于后续需要进行文件上传，所以请您在新建token时，选择 **“可写” 权限**
+
+### 4. 在本地的文件夹内调整文件（文件夹名同仓库同名）
+
+将merge后的模型文件复制到git clone后的文件夹中
+
+```bash
+cd ./merge
+cp ./* /root/EmoLLM-InternLM7B-base/
+```
+
+### 5. 上传本地文件夹中的模型文件到OpenXLab
+
+```bash
+git add -A
+git commit -m "commit EmoLLM-InternLM7B-base"
+git push
+```
+
+push的时候, 需要填写username和password三次,
+
+<div align="center">
+<img src="./asserts/username_password.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+### 6. 上传后查看和添加README.md
+
+上传完模型, 还可以复制之前上传的`README.md`文件到自己的仓库中.
+
+处理完之后, 就可以看到自己的模型Repo了.
+
+<div align="center">
+<img src="./asserts/result1.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+<div align="center">
+<img src="./asserts/result2.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+### 可能遇到的问题
+
+可以查看下面的截图, 查看bug和解决方法以及所用的bash命令. 
+
+出现这个问题的原因是因为上传不成功或者上传被打断.
+
+<div align="center">
+<img src="./asserts/upload_error.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+<div align="center">
+<img src="./asserts/upload_error_solution.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+<div align="center">
+<img src="./asserts/upload_error_solution2.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+bash命令如下:
+
+```bash
+git add -A
+git commit -m "commit EmoLLM-InternLM7B-base"
+git push  # 出现error
+
+# solution1
+git gc --prune=now
+git remote prune origin
+git push
+
+# solution2 (可能solution1无效)
+git update-ref -d refs/heads/main
+git fetch
+git merge origin/main
+
+# error 解决, 重新上传
+git push
+git commit -m "commit EmoLLM-InternLM7B-base"
+git push
+```
+
+## ModelScope
+
+### ModelScope平台介绍
+
+ModelScope旨在打造下一代开源的模型即服务共享平台，为泛AI开发者提供灵活、易用、低成本的一站式模型服务产品，让模型应用更简单！
+
+我们希望在汇集行业领先的预训练模型，减少开发者的重复研发成本，提供更加绿色环保、开源开放的AI开发环境和模型服务，助力绿色“数字经济”事业的建设。
+ModelScope平台将以开源的方式提供多类优质模型，开发者可在平台上免费体验与下载使用。
+
+### 模型创建
+
+ModelScope平台内的模型创建和OpenXLab, 这里不再赘述, 可以点击[ModelScope模型创建链接地址](https://modelscope.cn/models/create)自行填写.
+
+<div align="center">
+<img src="./asserts/ms_create.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+<div align="center">
+<img src="./asserts/ms_config.png" width="600"/>
+  <div align="center">
+  </div>
+</div>
+
+### 使用Python SDK上传模型
+
+可以使用modelscope modelhub来将已经训练好的模型上传到ModelScope平台, 
+
+ModelScope的上传比OpenXLab简单不少, 在ModelScope社区网页创建对应模型之后，只需要**配置访问令牌(请从ModelScope`个人中心->访问令牌获取`)**, 然后将本地模型目录通过push_model接口进行上传即可.
+
+需要注意的是, **ModelScope要求上传的模型目录含有`configuration.json`文件**, 我们训练的merge模型目录只有`config.json`, 因此可以复制这个文件然后修改文件名即可.
+
+```python
+from modelscope.hub.api import HubApi
+
+YOUR_ACCESS_TOKEN = '请从ModelScope个人中心->访问令牌获取'
+
+api = HubApi()
+api.login(YOUR_ACCESS_TOKEN)
+api.push_model(
+    model_id="yourname/your_model_id", 
+    model_dir="my_model_dir" # 本地模型目录，要求目录中必须包含configuration.json
+)
+```
+
+<div align="center">
+<img src="./asserts/ms_upload.png" width="900"/>
+  <div align="center">
+  </div>
+</div>
--- a/scripts/asserts/GitHub与平台的关系.bee7809e.png
+++ b/scripts/asserts/GitHub与平台的关系.bee7809e.png
--- a/scripts/asserts/full_upload.png
+++ b/scripts/asserts/full_upload.png
--- a/scripts/asserts/ms_config.png
+++ b/scripts/asserts/ms_config.png
--- a/scripts/asserts/ms_create.png
+++ b/scripts/asserts/ms_create.png
--- a/scripts/asserts/ms_upload.png
+++ b/scripts/asserts/ms_upload.png
--- a/scripts/asserts/openxlab.png
+++ b/scripts/asserts/openxlab.png
--- a/scripts/asserts/result1.png
+++ b/scripts/asserts/result1.png
--- a/scripts/asserts/result2.png
+++ b/scripts/asserts/result2.png
--- a/scripts/asserts/upload_error.png
+++ b/scripts/asserts/upload_error.png
--- a/scripts/asserts/upload_error_solution.png
+++ b/scripts/asserts/upload_error_solution.png
--- a/scripts/asserts/upload_error_solution2.png
+++ b/scripts/asserts/upload_error_solution2.png
--- a/scripts/asserts/username_password.png
+++ b/scripts/asserts/username_password.png
--- a/scripts/asserts/平台概述.e6d980f8.png
+++ b/scripts/asserts/平台概述.e6d980f8.png