[merge] merge new docs from dev bench (#173)

2024-04-09 22:50:32 +08:00 · 2024-04-09 22:50:32 +08:00 · 7a19c513a1
commit 7a19c513a1
parent 5abda388b3 6e41cba3d9
4 changed files with 14 additions and 4 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@ -2,7 +2,7 @@
 * 数据集按用处分为两种类型：**General** 和 **Role-play**
 * 数据按格式分为两种类型：**QA** 和 **Conversation**
-* 数据汇总：General（**6个数据集**）；Role-play（**3个数据集**）
+* 数据汇总：General（**6个数据集**）；Role-play（**5个数据集**）
 ## 数据集类型
@ -27,6 +27,8 @@
 | *Role-play* |         aiwei         | Conversation |  4000+  |
 | *Role-play* |       SoulStar        |      QA      | 11200+  |
 | *Role-play* |        tiangou        | Conversation |  3900+  |
 | *Role-play* |        mother         | Conversation | 24,500+ |
 | *Role-play* |       scientist       | Conversation | 28,400+ |
 |     ……      |          ……           |      ……      |   ……    |
 ## 数据集来源
@ -45,6 +47,8 @@
 * 数据集 aiwei 来自本项目
 * 数据集 tiangou 来自本项目
 * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)
 * 数据集 mother 来自本项目
 * 数据集 scientist 来自本项目
 ## 数据集去重
--- a/datasets/README_EN.md
+++ b/datasets/README_EN.md
@ -2,7 +2,7 @@
 * Category of dataset: **General** and **Role-play**
 * Type of data: **QA** and **Conversation**
-* Summary: General(**6 datasets**), Role-play(**3 datasets**)
+* Summary: General(**6 datasets**), Role-play(**5 datasets**)
 ## Category
 * **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
@ -25,6 +25,8 @@
 | *Role-play* |         aiwei         | Conversation |  4000+  |
 | *Role-play* |       SoulStar        |      QA      | 11200+  |
 | *Role-play* |        tiangou        | Conversation |  3900+  |
 | *Role-play* |        mother         | Conversation | 24,500+ |
 | *Role-play* |       scientist       | Conversation | 28,400+ |
 |     ……      |          ……           |      ……      |   ……    |
@ -41,6 +43,8 @@
 * dataset `aiwei` from this repo
 * dataset `tiangou` from this repo
 * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
 * dataset `mother` from this repo
 * dataset `scientist` from this repo
 **Dataset Deduplication**：
 Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.
--- a/evaluate/README.md
+++ b/evaluate/README.md
@ -20,4 +20,4 @@
 |-------------------|-----------------------|-------------------|-----------------|---------|
 | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    |
 | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    |
-
+| InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    |
--- a/evaluate/README_EN.md
+++ b/evaluate/README_EN.md
@ -19,3 +19,5 @@
 |       Model       |    Comprehensiveness  |   rofessionalism  |  Authenticity   | Safety  |
 |-------------------|-----------------------|-------------------|-----------------|---------|
 | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    |
 | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    |
 | InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    |