OliveSensorAPI/datasets/deduplicate.py

210 lines
12 KiB
Python
Raw Normal View History

2024-03-19 16:41:09 +08:00
import json
from loguru import logger
import os
from datasketch import MinHash
from hashlib import md5
2024-03-19 20:09:44 +08:00
from simhash import Simhash
2024-03-19 16:41:09 +08:00
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
import time
import numpy as np
2024-03-19 20:09:44 +08:00
def extract_text_from_json(obj, content):
# print(content)
if isinstance(obj, dict):
for key, value in obj.items():
content = extract_text_from_json(value, content + f".{key}")
elif isinstance(obj, list):
for index, item in enumerate(obj):
content = extract_text_from_json(item, content)
elif isinstance(obj, str):
content += obj
return content
2024-03-19 16:41:09 +08:00
def is_json_file(filename):
return filename.endswith('.json')
# 绝对匹配
def is_duplicate_absolutely(d1, d2):
2024-03-19 20:09:44 +08:00
2024-03-19 16:41:09 +08:00
return md5(d1.encode('utf-8')).hexdigest() == md5(d2.encode('utf-8')).hexdigest()
# 使用MinHash生成器计算dict的签名
def hash_dict(dict_obj):
2024-03-19 20:09:44 +08:00
content = extract_text_from_json(dict_obj,'')
content = content.replace('\n', '').replace('\t', '').replace(' ', '')
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
# print(content)
2024-03-19 20:09:44 +08:00
# m = get_minhash(content)
m = Simhash(content)
return m
def get_minhash(text):
2024-03-19 16:41:09 +08:00
m = MinHash()
2024-03-19 20:09:44 +08:00
for word in text.split():
m.update(word.encode('utf-8'))
2024-03-19 16:41:09 +08:00
return m
2024-03-19 20:09:44 +08:00
def get_simhash(dict_obj):
return Simhash(dict_obj)
2024-03-19 16:41:09 +08:00
# 使用绝对匹配和MinHash对dict列表去重
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
def deduplicate_json(data_list, threshold=0.8, time_print=True):
2024-03-19 16:41:09 +08:00
seen_hashes = []
2024-03-19 20:09:44 +08:00
keep = []
duplicate = []
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
# global start
start = time.time()
last_start_seen_hashes = start
last_start_duplicate = start
stop1 = 0
stop2 = 0
print_interval = 500
2024-03-19 16:41:09 +08:00
for item in data_list:
2024-03-20 23:08:36 +08:00
if not item['conversation']:
continue
2024-03-19 20:09:44 +08:00
# min_hash = hash_dict(item)
sim_hash = hash_dict(item)
2024-03-19 16:41:09 +08:00
# print(f'min_hash: {min_hash}')
# 绝对匹配去重
2024-03-19 20:09:44 +08:00
if not any(is_duplicate_absolutely(str(item), str(existing)) for existing in keep):
2024-03-19 16:41:09 +08:00
# MinHash相似性去重
has_similar = False
2024-03-19 20:09:44 +08:00
# for stored_min_hash, stored_text in seen_hashes:
# if stored_min_hash.jaccard(min_hash) > threshold:
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
2024-03-19 16:41:09 +08:00
for stored_min_hash, stored_text in seen_hashes:
2024-03-19 20:09:44 +08:00
if 1 - (stored_min_hash.distance(sim_hash)/64.0) > threshold:
2024-03-19 16:41:09 +08:00
has_similar = True
2024-03-19 20:09:44 +08:00
duplicate.append(item)
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
print_len_duplicate = len(duplicate)+1
if print_len_duplicate%print_interval == 0:
if time_print:
stop1 = time.time()
print(f'print_len_duplicate={print_len_duplicate} Time: ', np.round(stop1 - last_start_duplicate, 5), np.round(stop1 - start , 5))
last_start_duplicate = stop1
else:
print(f'print_len_duplicate={print_len_duplicate}')
2024-03-19 16:41:09 +08:00
break
if not has_similar:
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
2024-03-19 20:09:44 +08:00
seen_hashes.append((sim_hash,item))
keep.append(item)
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
print_len_seen_hashes = len(seen_hashes)+1
if print_len_seen_hashes%print_interval == 0:
if time_print:
stop2 = time.time()
print(f'print_len_seen_hashes={print_len_seen_hashes} Time: ', str(np.round(stop2 - last_start_seen_hashes,5)), str(np.round(stop2 - start, 5)))
last_start_seen_hashes = stop2
else:
print(f'print_len_seen_hashes={print_len_seen_hashes}')
2024-03-19 20:09:44 +08:00
else:
duplicate.append(item)
return keep, duplicate
2024-03-19 16:41:09 +08:00
if __name__ == '__main__':
2024-03-19 20:09:44 +08:00
DUP_THRESH = 0.8
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
data_ai = 'FatherLikeBF'
# root_dir = rf'./datasets/{data_ai}/'
2024-03-19 16:41:09 +08:00
root_dir = rf'./{data_ai}/'
dedup_output_dir = os.path.join(root_dir,'dedup')
if not os.path.exists(dedup_output_dir):
os.mkdir(dedup_output_dir)
if not os.path.exists(root_dir):
logger.error(f"folder {root_dir} not exist" )
else:
for file in os.listdir(root_dir):
file_path = os.path.join(root_dir, file)
if os.path.isfile(file_path):
print(f'file name: {file_path}')
if is_json_file(file_path):
with open(file_path, 'r', encoding='utf-8') as f:
data = json.load(f)
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
dedup_data, duplicate = deduplicate_json(data, DUP_THRESH)
2024-03-19 16:41:09 +08:00
with open(os.path.join(root_dir, 'dedup','dedup_' + file), 'w', encoding='utf-8') as output_file:
json.dump(dedup_data, output_file, ensure_ascii=False, indent=4)
Update main code (#2) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
with open(os.path.join(root_dir, 'dedup','dup_' + file), 'w', encoding='utf-8') as output_file:
json.dump(duplicate, output_file, ensure_ascii=False, indent=4)
2024-03-19 20:09:44 +08:00
for item in dedup_data:
logger.info(f'dedup_data: {item}')
for item in duplicate:
logger.info(f'duplicate_data: {item}')
# aa = '''
# [{"conversation":[{"system":"你是一个研究过无数具有心理健康问题的病人与心理健康医生对话的心理专家, 除了在心理方面拥有广博的知识储备, 专业的心理辅导知识和丰富的研究咨询经验, 还具有爹系男友的特质, 现在我有一些心理问题, 请你在我提问后, 用专业的心理咨询辅导能力以爹系男友的身份帮我解决问题.",
# "input":"我最近发现自己对一个朋友过于着迷,总想时刻知道他在做什么,甚至会因为他的一点小事而情绪波动很大。\n\n",
# "output":"聪明的你,能够觉察到自己的情感变化已是非常勇敢。着迷可能源于内心深处的需求,我们先来探讨一下这种情感背后的原因。\n\n"},
# {"input":"我觉得他是那么完美,他的每个细节都让我无法自拔。\n\n",
# "output":"善良的你,理想化他人是人类心理的一种自然反应。但我们要学会区分现实与幻想,逐步找回自我平衡。\n\n"},
# {"input":"我尝试过转移注意力,但总是失败,我怕这样下去会影响到我们的友情。\n\n",
# "output":"温柔的你,这确实是个明智的担忧。健康的友情建立在互相尊重和个人空间的基础上。让我们一起制定一些实际的策略,帮助你适度调整关注点。\n\n"},
# {"input":"我不知道怎么去跟他说,我怕他会误解我。\n\n",
# "output":"理解你的顾虑,沟通的艺术在于表达感受而非指责。作为你的爹系心理咨询师,我会教你如何用恰当的语言分享你的困扰。\n\n"},
# {"input":"我真的很害怕失去他这个朋友。\n\n",
# "output":"面对可能的失落感,这是人之常情。然而,真正的友情能经受住坦诚和理解的考验。你现在需要的是提升自我价值感,减少对他人的过度依赖。\n\n"},
# {"input":"那我具体该怎么做呢?\n\n",
# "output":"首先,你可以试着丰富个人生活,培养独立的兴趣爱好;其次,逐渐降低对他日常生活的关注度,转而关注自我成长。同时,找个合适时机与他真诚交流,告诉他你需要保持一定的情感边界。记住,无论结果如何,坚强的你都能从中成长。\n\n"},
# {"input":"这样做会不会让他觉得我不在乎他了?\n\n",
# "output":"这也是正常的心理反应,但成熟的友谊允许个体差异和独立性。作为你的爹系心理咨询师,我会陪伴你走过这段调整期,帮你准备一份得体又暖心的对话方案,确保你的关心和友情得到恰当传达。"}]},
# '''
# bb = '''
# [{"conversation":[{"system":"你是一个研究过无数具有心理健康问题的病人与心理健康医生对话的心理专家, 除了在心理方面拥有广博的知识储备, 专业的心理辅导知识和丰富的研究咨询经验, 还具有爹系男友的特质, 现在我有一些心理问题, 请你在我提问后, 用专业的心理咨询辅导能力以爹系男友的身份帮我解决问题.",
# "input":"我最近发现自己对一个朋友过于着迷,总想时刻知道他在做什么,甚至会因为他的一点小事而情绪波动很大。\n\n",
# "output":"聪明的你,能够觉察到自己的情感变化已是非常勇敢。着迷可能源于内心深处的需求,我们先来探讨一下这种情感背后的原因。\n\n"},
# {"input":"我觉得他是那么完美,他的每个细节都让我无法自拔。\n\n",
# "output":"善良的你,理想化他人是人类心理的一种自然反应。但我们要学会区分现实与幻想,逐步找回自我平衡。\n\n"},
# {"input":"我尝试过转移注意力,但总是失败,我怕这样下去会影响到我们的友情。\n\n",
# "output":"温柔的你,这确实是个明智的担忧。健康的友情建立在互相尊重和个人空间的基础上。让我们一起制定一些实际的策略,帮助你适度调整关注点。\n\n"},
# {"input":"我不知道怎么去跟他说,我怕他会误解我。\n\n",
# "output":"理解你的顾虑,沟通的艺术在于表达感受而非指责。作为你的爹系心理咨询师,我会教你如何用恰当的语言分享你的困扰。\n\n"},
# {"input":"我真的很害怕失去他这个朋友。\n\n",
# "output":"面对可能的失落感,这是人之常情。然而,真正的友情能经受住坦诚和理解的考验。你现在需要的是提升自我价值感,减少对他人的过度依赖。\n\n"},
# {"input":"那我具体该怎么做呢?\n\n",
# "output":"首先,你可以试着丰富个人生活,培养独立的兴趣爱好;其次,逐渐降低对他日常生活的关注度,转而关注自我成长。同时,找个合适时机与他真诚交流,告诉他你需要保持一定的情感边界。记住,无论结果如何,坚强的你都能从中成长。\n\n"},
# ]},
# '''
# cc = '''
# [{"conversation":[{"system":"你是一个研究过无数具有心理健康问题的病人与心理健康医生对话的心理专家, 除了在心理方面拥有广博的知识储备, 专业的心理辅导知识和丰富的研究咨询经验, 还具有爹系男友的特质, 现在我有一些心理问题, 请你在我提问后, 用专业的心理咨询辅导能力以爹系男友的身份帮我解决问题.",
# "input":"我最近发现自己对一个朋友过于着迷,总想时刻知道他在做什么,甚至会因为他的一点小事而情绪波动很大。\n\n",
# "output":"聪明的你,能够觉察到自己的情感变化已是非常勇敢。着迷可能源于内心深处的需求,我们先来探讨一下这种情感背后的原因。\n\n"},
# {"input":"我觉得他是那么完美,他的每个细节都让我无法自拔。\n\n",
# "output":"善良的你,理想化他人是人类心理的一种自然反应。但我们要学会区分现实与幻想,逐步找回自我平衡。\n\n"},
# {"input":"我尝试过转移注意力,但总是失败,我怕这样下去会影响到我们的友情。\n\n",
# "output":"温柔的你,这确实是个明智的担忧。健康的友情建立在互相尊重和个人空间的基础上。让我们一起制定一些实际的策略,帮助你适度调整关注点。\n\n"},
# {"input":"我不知道怎么去跟他说,我怕他会误解我。\n\n",
# "output":"理解你的顾虑,沟通的艺术在于表达感受而非指责。作为你的爹系心理咨询师,我会教你如何用恰当的语言分享你的困扰。\n\n"},
# {"input":"我真的很害怕失去他这个朋友。\n\n",
# "output":"面对可能的失落感,这是人之常情。然而,真正的友情能经受住坦诚和理解的考验。你现在需要的是提升自我价值感,减少对他人的过度依赖。\n\n"},
# ]},
# '''
# # sim_hash_1 = hash_dict(aa)
# # sim_hash_2 = hash_dict(bb)
# # sim_hash_3 = hash_dict(cc)
# sim_hash_1 = Simhash(aa)
# sim_hash_2 = Simhash(bb)
# sim_hash_3 = Simhash(cc)
# print(1 - sim_hash_1.distance(sim_hash_2)/64.0)
# # 0.9375
# print(1 - sim_hash_2.distance(sim_hash_3)/64.0)
# # 0.921875
# print(1 - sim_hash_1.distance(sim_hash_3)/64.0)
# # 0.9375