History

MING_X 64aaa2442a Update README.md		2024-04-21 17:34:39 +08:00
..
processed	Add files via upload	2024-04-20 21:09:34 +09:00
aiwei.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data_pro.json	feat：Add new finetune configurations and datasets	2024-02-23 11:36:58 +08:00
data.json	update data.json (delete 4 empty data)	2024-03-21 15:56:54 +09:00
LICENSE	测试push dev	2024-03-22 20:45:13 +08:00
mother_v1.json	Add files via upload	2024-04-09 23:05:22 +08:00
mother_v2.json	Add files via upload	2024-04-09 23:05:22 +08:00
multi_turn_dataset_1.json	upload smile.dataset	2024-02-28 17:44:48 +08:00
multi_turn_dataset_2.json	Add files via upload	2024-02-28 21:18:02 +08:00
README_EN.md	Update README_EN.md	2024-04-21 17:33:33 +08:00
README.md	Update README.md	2024-04-21 17:34:39 +08:00
scientist.json	1111	2024-03-20 23:25:07 +08:00
self_cognition_EmoLLM.json	Add files via upload	2024-04-20 21:08:48 +09:00
single_turn_dataset_1.json	Upload datasets	2024-02-27 22:01:53 +08:00
single_turn_dataset_2.json	Upload datasets	2024-02-27 22:01:53 +08:00
SoulStar_data.json	add SoulStar_data	2024-03-03 17:28:26 +08:00
tiangou.json	feat：Add new finetune configurations and datasets	2024-02-24 22:39:10 +08:00

README_EN.md

EmoLLM's datasets

Category of dataset: General and Role-play
Type of data: QA and Conversation
Summary: General(6 datasets), Role-play(5 datasets)

Type

QA: question-and-answer pair
Conversation: multi-turn consultation dialogue

Summary

Category	Dataset	Type	Total
General	data	Conversation	5600+
General	data_pro	Conversation	36,500+
General	multi_turn_dataset_1	Conversation	36,000+
General	multi_turn_dataset_2	Conversation	27,000+
General	single_turn_dataset_1	QA	14,000+
General	single_turn_dataset_2	QA	18,300+
Role-play	aiwei	Conversation	4000+
Role-play	SoulStar	QA	11,200+
Role-play	tiangou	Conversation	3900+
Role-play	mother	Conversation	40,300+
Role-play	scientist	Conversation	28,400+
……	……	……	……

Source

General：

dataset data from this repo
dataset data_pro from this repo
dataset multi_turn_dataset_1 from Smile
dataset multi_turn_dataset_2 from CPsyCounD
dataset single_turn_dataset_1 from this repo
dataset single_turn_dataset_2 from this repo

Role-play：

dataset aiwei from this repo
dataset tiangou from this repo
dataset SoulStar from SoulStar
dataset mother from this repo
dataset scientist from this repo

Dataset Deduplication： Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced by adjusting the threshold.

README_EN.md Unescape Escape

EmoLLM's datasets

Category

Type

Summary

Source

README_EN.md