History

Anooyman de0674ccf7 Update main code (#2 ) * update rag/src/data_processing.py * Add files via upload allow user to load embedding & rerank models from cache * Add files via upload embedding_path = os.path.join(model_dir, 'embedding_model') rerank_path = os.path.join(model_dir, 'rerank_model') * 测试push dev 测试push dev * Add files via upload 两个母亲多轮对话数据集合并、清理和去重之后，得到 2439 条多轮对话数据（每条有6-8轮对话）。 * optimize deduplicate.py Add time print information save duplicate dataset as well remove print(content) * add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py * add full finetune code from internlm2 * other 2 configs for base model * update cli_internlm2.py three methods to load model 1. download model in openxlab 2. download model in modelscope 3. offline model * create upload_modelscope.py * add base model and update personal contributions * add README.md for Emollm_Scientist * Create README_internlm2_7b_base_qlora.md InternLM2 7B Base QLoRA 微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * [DOC]EmoLLM_Scientist微调指南 * update * [DOC]README_scientist.md * delete config * format update * upload xlab * add README_Model_Uploading.md and images * modelscope model upload * Modify Recent Updates * update daddy-like Boy-Friend EmoLLM * update model uploading with openxlab * update model uploading with openxlab --------- Co-authored-by: zealot52099 <songyan5209@163.com> Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com> Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com> Co-authored-by: HongCheng <kwchenghong@gmail.com>		2024-03-24 11:51:19 +08:00
..
processed	Update process_merge.py	2024-03-21 16:07:18 +09:00
aiwei.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data_pro.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data.json	update data.json (delete 4 empty data)	2024-03-21 15:56:54 +09:00
deduplicate.py	Update main code (#2 )	2024-03-24 11:51:19 +08:00
LICENSE	Update main code (#2 )	2024-03-24 11:51:19 +08:00
mother.json	Update main code (#2 )	2024-03-24 11:51:19 +08:00
multi_turn_dataset_1.json	upload smile.dataset	2024-02-28 17:44:48 +08:00
multi_turn_dataset_2.json	Add files via upload	2024-02-28 21:18:02 +08:00
README_EN.md	[DOC] update datasets/README_EN.md	2024-03-20 17:52:23 +08:00
README.md	Update main code (#2 )	2024-03-24 11:51:19 +08:00
scientist.json	1111	2024-03-20 23:25:07 +08:00
single_turn_dataset_1.json	Upload datasets	2024-02-27 22:01:53 +08:00
single_turn_dataset_2.json	Upload datasets	2024-02-27 22:01:53 +08:00
SoulStar_data.json	add SoulStar_data	2024-03-03 17:28:26 +08:00
tiangou.json	feat：Add new finetune configurations and datasets	2024-02-24 22:39:10 +08:00

README_EN.md

EmoLLM's datasets

Category of dataset: General and Role-play
Type of data: QA and Conversation
Summary: General(6 datasets), Role-play(3 datasets)

Type

QA: question-and-answer pair
Conversation: multi-turn consultation dialogue

Summary

Category	Dataset	Type	Total
General	data	Conversation	5600+
General	data_pro	Conversation	36500+
General	multi_turn_dataset_1	Conversation	36,000+
General	multi_turn_dataset_2	Conversation	27,000+
General	single_turn_dataset_1	QA	14000+
General	single_turn_dataset_2	QA	18300+
Role-play	aiwei	Conversation	4000+
Role-play	SoulStar	QA	11200+
Role-play	tiangou	Conversation	3900+
……	……	……	……

Source

General：

dataset data from this repo
dataset data_pro from this repo
dataset multi_turn_dataset_1 from Smile
dataset multi_turn_dataset_2 from CPsyCounD
dataset single_turn_dataset_1 from this repo
dataset single_turn_dataset_2 from this repo

Role-play：

dataset aiwei from this repo
dataset tiangou from this repo
dataset SoulStar from SoulStar

Dataset Deduplication： Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

https://algonotes.readthedocs.io/en/latest/Simhash.html

README_EN.md Unescape Escape

EmoLLM's datasets

Category

Type

Summary

Source

README_EN.md