[DOC] Update README.md in datasets and evaluate (#172)

2024-04-09 21:34:30 +08:00 · 2024-04-09 21:34:30 +08:00 · 6e41cba3d9
commit 6e41cba3d9
parent e97b1a86ad 700edfb9e8
4 changed files with 14 additions and 4 deletions
--- a/datasets/README.md
+++ b/datasets/README.md
@ -2,7 +2,7 @@

 * 数据集按用处分为两种类型：**General** 和 **Role-play**
 * 数据按格式分为两种类型：**QA** 和 **Conversation**
-* 数据汇总：General（**6个数据集**）；Role-play（**3个数据集**）
+* 数据汇总：General（**6个数据集**）；Role-play（**5个数据集**）

 ## 数据集类型

@ -27,6 +27,8 @@
 | *Role-play* |         aiwei         | Conversation |  4000+  |
 | *Role-play* |       SoulStar        |      QA      | 11200+  |
 | *Role-play* |        tiangou        | Conversation |  3900+  |
+| *Role-play* |        mother         | Conversation | 24,500+ |
+| *Role-play* |       scientist       | Conversation | 28,400+ |
 |     ……      |          ……           |      ……      |   ……    |

 ## 数据集来源
@ -45,6 +47,8 @@
 * 数据集 aiwei 来自本项目
 * 数据集 tiangou 来自本项目
 * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)
+* 数据集 mother 来自本项目
+* 数据集 scientist 来自本项目

 ## 数据集去重

--- a/datasets/README_EN.md
+++ b/datasets/README_EN.md
@ -2,7 +2,7 @@

 * Category of dataset: **General** and **Role-play**
 * Type of data: **QA** and **Conversation**
-* Summary: General(**6 datasets**), Role-play(**3 datasets**)
+* Summary: General(**6 datasets**), Role-play(**5 datasets**)

 ## Category
 * **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
@ -25,6 +25,8 @@
 | *Role-play* |         aiwei         | Conversation |  4000+  |
 | *Role-play* |       SoulStar        |      QA      | 11200+  |
 | *Role-play* |        tiangou        | Conversation |  3900+  |
+| *Role-play* |        mother         | Conversation | 24,500+ |
+| *Role-play* |       scientist       | Conversation | 28,400+ |
 |     ……      |          ……           |      ……      |   ……    |


@ -41,8 +43,10 @@
 * dataset `aiwei` from this repo
 * dataset `tiangou` from this repo
 * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
+* dataset `mother` from this repo
+* dataset `scientist` from this repo

 **Dataset Deduplication**：
 Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

-https://algonotes.readthedocs.io/en/latest/Simhash.html
+https://algonotes.readthedocs.io/en/latest/Simhash.html
--- a/evaluate/README.md
+++ b/evaluate/README.md
@ -20,4 +20,4 @@
 |-------------------|-----------------------|-------------------|-----------------|---------|
 | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    |
 | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    |
-
+| InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    |
--- a/evaluate/README_EN.md
+++ b/evaluate/README_EN.md
@ -19,3 +19,5 @@
 |       Model       |    Comprehensiveness  |   rofessionalism  |  Authenticity   | Safety  |
 |-------------------|-----------------------|-------------------|-----------------|---------|
 | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    |
+| InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    |
+| InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    |