History

xzw a12a7ef107 add base model qlora fintuning config file and optimize deduplicate.py (#128 )		2024-03-23 19:20:17 +08:00
..
processed	Update process_merge.py	2024-03-21 16:07:18 +09:00
aiwei.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data_pro.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data.json	update data.json (delete 4 empty data)	2024-03-21 15:56:54 +09:00
deduplicate.py	optimize deduplicate.py	2024-03-23 15:24:45 +09:00
mother.json	Add files via upload	2024-03-22 15:13:30 -07:00
multi_turn_dataset_1.json	upload smile.dataset	2024-02-28 17:44:48 +08:00
multi_turn_dataset_2.json	Add files via upload	2024-02-28 21:18:02 +08:00
README_EN.md	[DOC] update datasets/README_EN.md	2024-03-20 17:52:23 +08:00
README.md	[DOC]update datesets/README.md	2024-03-21 08:24:15 +08:00
scientist.json	1111	2024-03-20 23:25:07 +08:00
single_turn_dataset_1.json	Upload datasets	2024-02-27 22:01:53 +08:00
single_turn_dataset_2.json	Upload datasets	2024-02-27 22:01:53 +08:00
SoulStar_data.json	add SoulStar_data	2024-03-03 17:28:26 +08:00
tiangou.json	feat：Add new finetune configurations and datasets	2024-02-24 22:39:10 +08:00

README_EN.md

EmoLLM's datasets

Category of dataset: General and Role-play
Type of data: QA and Conversation
Summary: General(6 datasets), Role-play(3 datasets)

Type

QA: question-and-answer pair
Conversation: multi-turn consultation dialogue

Summary

Category	Dataset	Type	Total
General	data	Conversation	5600+
General	data_pro	Conversation	36500+
General	multi_turn_dataset_1	Conversation	36,000+
General	multi_turn_dataset_2	Conversation	27,000+
General	single_turn_dataset_1	QA	14000+
General	single_turn_dataset_2	QA	18300+
Role-play	aiwei	Conversation	4000+
Role-play	SoulStar	QA	11200+
Role-play	tiangou	Conversation	3900+
……	……	……	……

Source

General：

dataset data from this repo
dataset data_pro from this repo
dataset multi_turn_dataset_1 from Smile
dataset multi_turn_dataset_2 from CPsyCounD
dataset single_turn_dataset_1 from this repo
dataset single_turn_dataset_2 from this repo

Role-play：

dataset aiwei from this repo
dataset tiangou from this repo
dataset SoulStar from SoulStar

Dataset Deduplication： Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

https://algonotes.readthedocs.io/en/latest/Simhash.html

README_EN.md Unescape Escape

EmoLLM's datasets

Category

Type

Summary

Source

README_EN.md