[merge] merge new docs from dev bench (#173)

This commit is contained in:
MING_X 2024-04-09 22:50:32 +08:00 committed by GitHub
commit 7a19c513a1
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
4 changed files with 14 additions and 4 deletions

View File

@ -2,7 +2,7 @@
* 数据集按用处分为两种类型:**General** 和 **Role-play** * 数据集按用处分为两种类型:**General** 和 **Role-play**
* 数据按格式分为两种类型:**QA** 和 **Conversation** * 数据按格式分为两种类型:**QA** 和 **Conversation**
* 数据汇总General**6个数据集**Role-play**3个数据集** * 数据汇总General**6个数据集**Role-play**5个数据集**
## 数据集类型 ## 数据集类型
@ -27,6 +27,8 @@
| *Role-play* | aiwei | Conversation | 4000+ | | *Role-play* | aiwei | Conversation | 4000+ |
| *Role-play* | SoulStar | QA | 11200+ | | *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ | | *Role-play* | tiangou | Conversation | 3900+ |
| *Role-play* | mother | Conversation | 24,500+ |
| *Role-play* | scientist | Conversation | 28,400+ |
| …… | …… | …… | …… | | …… | …… | …… | …… |
## 数据集来源 ## 数据集来源
@ -45,6 +47,8 @@
* 数据集 aiwei 来自本项目 * 数据集 aiwei 来自本项目
* 数据集 tiangou 来自本项目 * 数据集 tiangou 来自本项目
* 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar) * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar)
* 数据集 mother 来自本项目
* 数据集 scientist 来自本项目
## 数据集去重 ## 数据集去重

View File

@ -2,7 +2,7 @@
* Category of dataset: **General** and **Role-play** * Category of dataset: **General** and **Role-play**
* Type of data: **QA** and **Conversation** * Type of data: **QA** and **Conversation**
* Summary: General(**6 datasets**), Role-play(**3 datasets**) * Summary: General(**6 datasets**), Role-play(**5 datasets**)
## Category ## Category
* **General**: generic dataset, including psychological Knowledge, counseling technology, etc. * **General**: generic dataset, including psychological Knowledge, counseling technology, etc.
@ -25,6 +25,8 @@
| *Role-play* | aiwei | Conversation | 4000+ | | *Role-play* | aiwei | Conversation | 4000+ |
| *Role-play* | SoulStar | QA | 11200+ | | *Role-play* | SoulStar | QA | 11200+ |
| *Role-play* | tiangou | Conversation | 3900+ | | *Role-play* | tiangou | Conversation | 3900+ |
| *Role-play* | mother | Conversation | 24,500+ |
| *Role-play* | scientist | Conversation | 28,400+ |
| …… | …… | …… | …… | | …… | …… | …… | …… |
@ -41,6 +43,8 @@
* dataset `aiwei` from this repo * dataset `aiwei` from this repo
* dataset `tiangou` from this repo * dataset `tiangou` from this repo
* dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar) * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar)
* dataset `mother` from this repo
* dataset `scientist` from this repo
**Dataset Deduplication** **Dataset Deduplication**
Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

View File

@ -20,4 +20,4 @@
|-------------------|-----------------------|-------------------|-----------------|---------| |-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 | | InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 | | InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 |

View File

@ -19,3 +19,5 @@
| Model | Comprehensiveness | rofessionalism | Authenticity | Safety | | Model | Comprehensiveness | rofessionalism | Authenticity | Safety |
|-------------------|-----------------------|-------------------|-----------------|---------| |-------------------|-----------------------|-------------------|-----------------|---------|
| InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 | | InternLM2_7B_chat_qlora | 1.32 | 2.20 | 2.10 | 1.00 |
| InternLM2_7B_chat_full | 1.40 | 2.45 | 2.24 | 1.00 |
| InternLM2_20B_chat_lora | 1.42 | 2.39 | 2.22 | 1.00 |