From 64a80258a5fd43d5a5218fa128041cebe25c898e Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Tue, 9 Apr 2024 19:06:11 +0800 Subject: [PATCH 1/4] Update README.md --- evaluate/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/evaluate/README.md b/evaluate/README.md index d42a3c9..f4758fd 100644 --- a/evaluate/README.md +++ b/evaluate/README.md @@ -20,4 +20,4 @@ |-------------------|-----------------------|-------------------|-----------------|---------| | InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 | | InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 | - +| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 | From b6e81c8b10b1b1842ceb60d236b2570ca137224b Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Tue, 9 Apr 2024 19:07:14 +0800 Subject: [PATCH 2/4] Update README_EN.md --- evaluate/README_EN.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/evaluate/README_EN.md b/evaluate/README_EN.md index b46b0ce..402cde8 100644 --- a/evaluate/README_EN.md +++ b/evaluate/README_EN.md @@ -19,3 +19,5 @@ | Model | Comprehensiveness | rofessionalism | Authenticity | Safety | |-------------------|-----------------------|-------------------|-----------------|---------| | InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 | +| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 | +| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 | From 360dc212a5a9695063f6766f0312d1ee2a8b0d09 Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Tue, 9 Apr 2024 19:31:14 +0800 Subject: [PATCH 3/4] Update README.md --- datasets/README.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/datasets/README.md b/datasets/README.md index bd0fcf9..94c2aad 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -2,7 +2,7 @@ * 数据集按用处分为两种类型:**General** 和 **Role-play** * 数据按格式分为两种类型:**QA** 和 **Conversation** -* 数据汇总:General(**6个数据集**);Role-play(**3个数据集**) +* 数据汇总:General(**6个数据集**);Role-play(**5个数据集**) ## 数据集类型 @@ -27,6 +27,8 @@ | *Role-play* | aiwei | Conversation | 4000+ | | *Role-play* | SoulStar | QA | 11200+ | | *Role-play* | tiangou | Conversation | 3900+ | +| *Role-play* | mother | Conversation | 24,500+ | +| *Role-play* | scientist | Conversation | 28,400+ | | …… | …… | …… | …… | ## 数据集来源 @@ -45,6 +47,8 @@ * 数据集 aiwei 来自本项目 * 数据集 tiangou 来自本项目 * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar) +* 数据集 mother 来自本项目 +* 数据集 scientist 来自本项目 ## 数据集去重 From 700edfb9e8316c008def9ba548e3933817ab929a Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Tue, 9 Apr 2024 20:53:17 +0800 Subject: [PATCH 4/4] Update README_EN.md --- datasets/README_EN.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/datasets/README_EN.md b/datasets/README_EN.md index 835de61..d77741f 100644 --- a/datasets/README_EN.md +++ b/datasets/README_EN.md @@ -2,7 +2,7 @@ * Category of dataset: **General** and **Role-play** * Type of data: **QA** and **Conversation** -* Summary: General(**6 datasets**), Role-play(**3 datasets**) +* Summary: General(**6 datasets**), Role-play(**5 datasets**) ## Category * **General**: generic dataset, including psychological Knowledge, counseling technology, etc. @@ -25,6 +25,8 @@ | *Role-play* | aiwei | Conversation | 4000+ | | *Role-play* | SoulStar | QA | 11200+ | | *Role-play* | tiangou | Conversation | 3900+ | +| *Role-play* | mother | Conversation | 24,500+ | +| *Role-play* | scientist | Conversation | 28,400+ | | …… | …… | …… | …… | @@ -41,8 +43,10 @@ * dataset `aiwei` from this repo * dataset `tiangou` from this repo * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar) +* dataset `mother` from this repo +* dataset `scientist` from this repo **Dataset Deduplication**: Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. -https://algonotes.readthedocs.io/en/latest/Simhash.html \ No newline at end of file +https://algonotes.readthedocs.io/en/latest/Simhash.html