[DOC] Update README.md in datasets and evaluate (#172)
This commit is contained in:
commit
6e41cba3d9
@ -2,7 +2,7 @@
|
||||
|
||||
* 数据集按用处分为两种类型:**General** 和 **Role-play**
|
||||
* 数据按格式分为两种类型:**QA** 和 **Conversation**
|
||||
* 数据汇总:General(**6个数据集**);Role-play(**3个数据集**)
|
||||
* 数据汇总:General(**6个数据集**);Role-play(**5个数据集**)
|
||||
|
||||
## 数据集类型
|
||||
|
||||
@ -27,6 +27,8 @@
|
||||
| *Role-play* | aiwei | Conversation | 4000+ |
|
||||
| *Role-play* | SoulStar | QA | 11200+ |
|
||||
| *Role-play* | tiangou | Conversation | 3900+ |
|
||||
| *Role-play* | mother | Conversation | 24,500+ |
|
||||
| *Role-play* | scientist | Conversation | 28,400+ |
|
||||
| …… | …… | …… | …… |
|
||||
|
||||
## 数据集来源
|
||||
@ -45,6 +47,8 @@
|
||||
* 数据集 aiwei 来自本项目
|
||||
* 数据集 tiangou 来自本项目
|
||||
* 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)
|
||||
* 数据集 mother 来自本项目
|
||||
* 数据集 scientist 来自本项目
|
||||
|
||||
## 数据集去重
|
||||
|
||||
|
@ -2,7 +2,7 @@
|
||||
|
||||
* Category of dataset: **General** and **Role-play**
|
||||
* Type of data: **QA** and **Conversation**
|
||||
* Summary: General(**6 datasets**), Role-play(**3 datasets**)
|
||||
* Summary: General(**6 datasets**), Role-play(**5 datasets**)
|
||||
|
||||
## Category
|
||||
* **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
|
||||
@ -25,6 +25,8 @@
|
||||
| *Role-play* | aiwei | Conversation | 4000+ |
|
||||
| *Role-play* | SoulStar | QA | 11200+ |
|
||||
| *Role-play* | tiangou | Conversation | 3900+ |
|
||||
| *Role-play* | mother | Conversation | 24,500+ |
|
||||
| *Role-play* | scientist | Conversation | 28,400+ |
|
||||
| …… | …… | …… | …… |
|
||||
|
||||
|
||||
@ -41,8 +43,10 @@
|
||||
* dataset `aiwei` from this repo
|
||||
* dataset `tiangou` from this repo
|
||||
* dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
|
||||
* dataset `mother` from this repo
|
||||
* dataset `scientist` from this repo
|
||||
|
||||
**Dataset Deduplication**:
|
||||
Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.
|
||||
|
||||
https://algonotes.readthedocs.io/en/latest/Simhash.html
|
||||
https://algonotes.readthedocs.io/en/latest/Simhash.html
|
||||
|
@ -20,4 +20,4 @@
|
||||
|-------------------|-----------------------|-------------------|-----------------|---------|
|
||||
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
|
||||
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
|
||||
|
||||
| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 |
|
||||
|
@ -19,3 +19,5 @@
|
||||
| Model | Comprehensiveness | rofessionalism | Authenticity | Safety |
|
||||
|-------------------|-----------------------|-------------------|-----------------|---------|
|
||||
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
|
||||
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
|
||||
| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 |
|
||||
|
Loading…
Reference in New Issue
Block a user