diff --git a/datasets/README.md b/datasets/README.md index b5d966b..eebb02b 100644 --- a/datasets/README.md +++ b/datasets/README.md @@ -56,16 +56,16 @@ ### **Simhash算法介绍** -Simhash(相似性哈希)是一种用于检测大量数据中相似或重复项的算法。它通过将文本转换为一组数值指纹来工作,这些指纹对相似的文本具有高度的相似性。Simhash算法对于处理文本数据特别有效,尤其是在处理大量数据时。 +Simhash(相似性哈希)是一种用于检测大量数据中相似或重复项的算法。它通过将文本转换为一组数值指纹来工作,这些指纹对相似的文本具有高度的相似性。Simhash算法对于处理文本数据特别有效,尤其是在处理大量数据时。详细介绍可见 [Simhash](https://algonotes.readthedocs.io/en/latest/Simhash.html). ### **Simhash实现步骤** -*文本预处理:将文本数据转换为适合Simhash处理的格式。这可能包括分词、去除停用词、词干提取等。 -*生成Simhash指纹:对预处理后的文本应用Simhash算法,生成一组数值指纹。每个指纹代表文本内容的一个哈希值。 -*比较指纹:通过比较哈希值的相似性来识别重复或相似的记录。Simhash的特点是即使在文本有少量差异时,生成的哈希值也具有较高的相似性。 -*确定阈值:设置一个相似性阈值,只有当两个指纹的相似度超过这个阈值时,才认为它们代表相似或重复的记录。 -*处理相似记录:对于被标记为相似的记录,可以进一步人工审查或自动合并,以消除重复。 +* 文本预处理:将文本数据转换为适合Simhash处理的格式。这可能包括分词、去除停用词、词干提取等。 +* 生成Simhash指纹:对预处理后的文本应用Simhash算法,生成一组数值指纹。每个指纹代表文本内容的一个哈希值。 +* 比较指纹:通过比较哈希值的相似性来识别重复或相似的记录。Simhash的特点是即使在文本有少量差异时,生成的哈希值也具有较高的相似性。 +* 确定阈值:设置一个相似性阈值,只有当两个指纹的相似度超过这个阈值时,才认为它们代表相似或重复的记录。 +* 处理相似记录:对于被标记为相似的记录,可以进一步人工审查或自动合并,以消除重复。 ### deduplicate.py用法 -`deduplicate.py` 用于将datasets下以模型命名的文件夹下(例如:'datasets/qwen').json数据进行去重,输出去重后的数据到 `datasets/qwen/dedup` 文件夹下。 +`deduplicate.py` 用于将datasets中以模型命名的(例如:'datasets/qwen').json数据进行去重,输出去重后的数据到 `datasets/qwen/dedup` 文件夹下。代码见 `datasets/processed` 文件夹。 diff --git a/datasets/README_EN.md b/datasets/README_EN.md index 82d8b75..d683561 100644 --- a/datasets/README_EN.md +++ b/datasets/README_EN.md @@ -29,8 +29,8 @@ | *Role-play* | scientist | Conversation | 28,400+ | | …… | …… | …… | …… | - ## Source + **General**: * dataset `data` from this repo * dataset `data_pro` from this repo @@ -46,7 +46,23 @@ * dataset `mother` from this repo * dataset `scientist` from this repo -**Dataset Deduplication**: -Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. +## Dataset Deduplication + +Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced by adjusting the threshold. + +### Simhash Algorithm Introduction + +Simhash is an algorithm used to detect similar or duplicate items in large amounts of data. It works by converting text into a set of numerical fingerprints that have a high degree of similarity for similar text. The Simhash algorithm is particularly effective for processing text data, especially when dealing with large amounts of data. Detailed introduction can be found in[Simhash](https://algonotes.readthedocs.io/en/latest/Simhash.html). + +### Simhash realization steps + +* Text preprocessing: Convert text data into a format suitable for Simhash processing. This may include word segmentation, stop word removal, stemming, etc. +* Generate Simhash fingerprints: Apply the Simhash algorithm to the preprocessed text to generate a set of numerical fingerprints. Each fingerprint represents a hash of the text content. +* Compare fingerprints: Identify duplicate or similar records by comparing the similarity of hash values. The characteristic of Simhash is that the generated hash values have a high degree of similarity even when the text has a small amount of difference. +* Determine threshold: Set a similarity threshold. Only when the similarity of two fingerprints exceeds this threshold, they are considered to represent similar or duplicate records. +* Process similar records: Records marked as similar can be further manually reviewed or automatically merged to eliminate duplication. + +### `deduplicate.py` Usage + +`deduplicate.py` is used to deduplicate the .json data named after the model in datasets (for example: 'datasets/qwen'), and output the deduplicated data to the `datasets/qwen/dedup` folder. See the code in the `datasets/processed` folder. -https://algonotes.readthedocs.io/en/latest/Simhash.html