From 8de90d35f1f8ec8d35de8c37717da95edfe4d697 Mon Sep 17 00:00:00 2001 From: MING_X <119648793+MING-ZCH@users.noreply.github.com> Date: Sun, 21 Apr 2024 17:33:33 +0800 Subject: [PATCH] Update README_EN.md --- datasets/README_EN.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/datasets/README_EN.md b/datasets/README_EN.md index 82d8b75..bf5d726 100644 --- a/datasets/README_EN.md +++ b/datasets/README_EN.md @@ -47,6 +47,4 @@ * dataset `scientist` from this repo **Dataset Deduplication**: -Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. - -https://algonotes.readthedocs.io/en/latest/Simhash.html +Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced by adjusting the threshold.