Update README_EN.md

2024-04-09 20:53:17 +08:00 · 2024-04-09 20:53:17 +08:00 · 700edfb9e8
commit 700edfb9e8
parent 360dc212a5
1 changed files with 6 additions and 2 deletions
--- a/datasets/README_EN.md
+++ b/datasets/README_EN.md
@ -2,7 +2,7 @@
 * Category of dataset: **General** and **Role-play**
 * Type of data: **QA** and **Conversation**
-* Summary: General(**6 datasets**), Role-play(**3 datasets**)
+* Summary: General(**6 datasets**), Role-play(**5 datasets**)
 ## Category
 * **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
@ -25,6 +25,8 @@
 | *Role-play* |         aiwei         | Conversation |  4000+  |
 | *Role-play* |       SoulStar        |      QA      | 11200+  |
 | *Role-play* |        tiangou        | Conversation |  3900+  |
 | *Role-play* |        mother         | Conversation | 24,500+ |
 | *Role-play* |       scientist       | Conversation | 28,400+ |
 |     ……      |          ……           |      ……      |   ……    |
@ -41,8 +43,10 @@
 * dataset `aiwei` from this repo
 * dataset `tiangou` from this repo
 * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
 * dataset `mother` from this repo
 * dataset `scientist` from this repo
 **Dataset Deduplication**：
 Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.
-https://algonotes.readthedocs.io/en/latest/Simhash.html
+https://algonotes.readthedocs.io/en/latest/Simhash.html