OliveSensorAPI/datasets/README_EN.md
2024-03-20 17:52:23 +08:00

48 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# EmoLLM's datasets
* Category of dataset: **General** and **Role-play**
* Type of data: **QA** and **Conversation**
* Summary: General(**6 datasets**), Role-play(**3 datasets**)
## Category
* **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
* **Role-play**: role-playing dataset, including character-specific conversation style data, etc.
## Type
* **QA**: question-and-answer pair
* **Conversation**: multi-turn consultation dialogue
## Summary
| Category | Dataset | Type | Total |
| :---------: | :-------------------: | :----------: | :-----: |
| *General* | data | Conversation | 5600+ |
| *General* | data_pro | Conversation | 36500+ |
| *General* | multi_turn_dataset_1 | Conversation | 36,000+ |
| *General* | multi_turn_dataset_2 | Conversation | 27,000+ |
| *General* | single_turn_dataset_1 | QA | 14000+ |
| *General* | single_turn_dataset_2 | QA | 18300+ |
| *Role-play* | aiwei | Conversation | 4000+ |
| *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ |
| …… | …… | …… | …… |
## Source
**General**
* dataset `data` from this repo
* dataset `data_pro` from this repo
* dataset `multi_turn_dataset_1` from [Smile](https://github.com/qiuhuachuan/smile)
* dataset `multi_turn_dataset_2` from [CPsyCounD](https://github.com/CAS-SIAT-XinHai/CPsyCoun)
* dataset `single_turn_dataset_1` from this repo
* dataset `single_turn_dataset_2` from this repo
**Role-play**
* dataset `aiwei` from this repo
* dataset `tiangou` from this repo
* dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
**Dataset Deduplication**
Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.
https://algonotes.readthedocs.io/en/latest/Simhash.html