[merge] merge new docs from dev bench (#173)
This commit is contained in:
		
						commit
						7a19c513a1
					
				| @ -2,7 +2,7 @@ | |||||||
| 
 | 
 | ||||||
| * 数据集按用处分为两种类型:**General** 和 **Role-play** | * 数据集按用处分为两种类型:**General** 和 **Role-play** | ||||||
| * 数据按格式分为两种类型:**QA** 和 **Conversation** | * 数据按格式分为两种类型:**QA** 和 **Conversation** | ||||||
| * 数据汇总:General(**6个数据集**);Role-play(**3个数据集**) | * 数据汇总:General(**6个数据集**);Role-play(**5个数据集**) | ||||||
| 
 | 
 | ||||||
| ## 数据集类型 | ## 数据集类型 | ||||||
| 
 | 
 | ||||||
| @ -27,6 +27,8 @@ | |||||||
| | *Role-play* |         aiwei         | Conversation |  4000+  | | | *Role-play* |         aiwei         | Conversation |  4000+  | | ||||||
| | *Role-play* |       SoulStar        |      QA      | 11200+  | | | *Role-play* |       SoulStar        |      QA      | 11200+  | | ||||||
| | *Role-play* |        tiangou        | Conversation |  3900+  | | | *Role-play* |        tiangou        | Conversation |  3900+  | | ||||||
|  | | *Role-play* |        mother         | Conversation | 24,500+ | | ||||||
|  | | *Role-play* |       scientist       | Conversation | 28,400+ | | ||||||
| |     ……      |          ……           |      ……      |   ……    | | |     ……      |          ……           |      ……      |   ……    | | ||||||
| 
 | 
 | ||||||
| ## 数据集来源 | ## 数据集来源 | ||||||
| @ -45,6 +47,8 @@ | |||||||
| * 数据集 aiwei 来自本项目 | * 数据集 aiwei 来自本项目 | ||||||
| * 数据集 tiangou 来自本项目 | * 数据集 tiangou 来自本项目 | ||||||
| * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar) | * 数据集 SoulStar 来源 [SoulStar](https://github.com/Nobody-ML/SoulStar) | ||||||
|  | * 数据集 mother 来自本项目 | ||||||
|  | * 数据集 scientist 来自本项目 | ||||||
| 
 | 
 | ||||||
| ## 数据集去重 | ## 数据集去重 | ||||||
| 
 | 
 | ||||||
|  | |||||||
| @ -2,7 +2,7 @@ | |||||||
| 
 | 
 | ||||||
| * Category of dataset: **General** and **Role-play** | * Category of dataset: **General** and **Role-play** | ||||||
| * Type of data: **QA** and **Conversation** | * Type of data: **QA** and **Conversation** | ||||||
| * Summary: General(**6 datasets**), Role-play(**3 datasets**) | * Summary: General(**6 datasets**), Role-play(**5 datasets**) | ||||||
| 
 | 
 | ||||||
|  ## Category |  ## Category | ||||||
| * **General**: generic dataset, including psychological Knowledge, counseling technology, etc. | * **General**: generic dataset, including psychological Knowledge, counseling technology, etc. | ||||||
| @ -25,6 +25,8 @@ | |||||||
| | *Role-play* |         aiwei         | Conversation |  4000+  | | | *Role-play* |         aiwei         | Conversation |  4000+  | | ||||||
| | *Role-play* |       SoulStar        |      QA      | 11200+  | | | *Role-play* |       SoulStar        |      QA      | 11200+  | | ||||||
| | *Role-play* |        tiangou        | Conversation |  3900+  | | | *Role-play* |        tiangou        | Conversation |  3900+  | | ||||||
|  | | *Role-play* |        mother         | Conversation | 24,500+ | | ||||||
|  | | *Role-play* |       scientist       | Conversation | 28,400+ | | ||||||
| |     ……      |          ……           |      ……      |   ……    | | |     ……      |          ……           |      ……      |   ……    | | ||||||
| 
 | 
 | ||||||
| 
 | 
 | ||||||
| @ -41,6 +43,8 @@ | |||||||
| * dataset `aiwei` from this repo | * dataset `aiwei` from this repo | ||||||
| * dataset `tiangou` from this repo | * dataset `tiangou` from this repo | ||||||
| * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar) | * dataset `SoulStar` from [SoulStar](https://github.com/Nobody-ML/SoulStar) | ||||||
|  | * dataset `mother` from this repo | ||||||
|  | * dataset `scientist` from this repo | ||||||
| 
 | 
 | ||||||
| **Dataset Deduplication**: | **Dataset Deduplication**: | ||||||
| Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. | Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold. | ||||||
|  | |||||||
| @ -20,4 +20,4 @@ | |||||||
| |-------------------|-----------------------|-------------------|-----------------|---------| | |-------------------|-----------------------|-------------------|-----------------|---------| | ||||||
| | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    | | | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    | | ||||||
| | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    | | | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    | | ||||||
| 
 | | InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    | | ||||||
|  | |||||||
| @ -19,3 +19,5 @@ | |||||||
| |       Model       |    Comprehensiveness  |   rofessionalism  |  Authenticity   | Safety  | | |       Model       |    Comprehensiveness  |   rofessionalism  |  Authenticity   | Safety  | | ||||||
| |-------------------|-----------------------|-------------------|-----------------|---------| | |-------------------|-----------------------|-------------------|-----------------|---------| | ||||||
| | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    | | | InternLM2_7B_chat_qlora |      1.32       |        2.20       |      2.10       | 1.00    | | ||||||
|  | | InternLM2_7B_chat_full  |      1.40       |        2.45       |      2.24       | 1.00    | | ||||||
|  | | InternLM2_20B_chat_lora |      1.42       |        2.39       |      2.22       | 1.00    | | ||||||
|  | |||||||
		Loading…
	
		Reference in New Issue
	
	Block a user
	 MING_X
						MING_X