From 700edfb9e8316c008def9ba548e3933817ab929a Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Tue, 9 Apr 2024 20:53:17 +0800 Subject: [PATCH] Update README_EN.md --- datasets/README_EN.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/datasets/README_EN.md b/datasets/README_EN.md index 835de61..d77741f 100644 --- a/datasets/README_EN.md +++ b/datasets/README_EN.md @@ -2,7 +2,7 @@ * Category of dataset: **General** and **Role-play** * Type of data: **QA** and **Conversation** -* Summary: General(**6 datasets**), Role-play(**3 datasets**) +* Summary: General(**6 datasets**), Role-play(**5 datasets**) ## Category * **General**: generic dataset, including psychological Knowledge, counseling technology, etc. @@ -25,6 +25,8 @@ | *Role-play* | aiwei | Conversation | 4000+ | | *Role-play* | SoulStar | QA | 11200+ | | *Role-play* | tiangou | Conversation | 3900+ | +| *Role-play* | mother | Conversation | 24,500+ | +| *Role-play* | scientist | Conversation | 28,400+ | | …… | …… | …… | …… | @@ -41,8 +43,10 @@ * dataset `aiwei` from this repo * dataset `tiangou` from this repo * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar) +* dataset `mother` from this repo +* dataset `scientist` from this repo **Dataset Deduplication**: Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. -https://algonotes.readthedocs.io/en/latest/Simhash.html \ No newline at end of file +https://algonotes.readthedocs.io/en/latest/Simhash.html