OliveSensorAPI/datasets
Anooyman de0674ccf7
Update main code (#2)
* update rag/src/data_processing.py

* Add files via upload

allow user to load embedding & rerank models from cache

* Add files via upload

embedding_path = os.path.join(model_dir, 'embedding_model')  
rerank_path = os.path.join(model_dir, 'rerank_model')

* 测试push dev

测试push dev

* Add files via upload

两个母亲多轮对话数据集合并、清理和去重之后,得到 2439 条多轮对话数据(每条有6-8轮对话)。

* optimize deduplicate.py

Add time print information
save duplicate dataset as well
remove print(content)

* add base model qlora fintuning config file: internlm2_7b_base_qlora_e10_M_1e4_32_64.py

* add full finetune code from internlm2

* other 2 configs for base model

* update cli_internlm2.py

 three methods to load model

1. download model in openxlab
2. download model in modelscope
3. offline model

* create upload_modelscope.py

* add base model and update personal contributions

* add README.md for Emollm_Scientist

* Create README_internlm2_7b_base_qlora.md

InternLM2 7B Base QLoRA 微调指南

* [DOC]EmoLLM_Scientist微调指南

* [DOC]EmoLLM_Scientist微调指南

* [DOC]EmoLLM_Scientist微调指南

* [DOC]EmoLLM_Scientist微调指南

* [DOC]EmoLLM_Scientist微调指南

* [DOC]EmoLLM_Scientist微调指南

* update

* [DOC]README_scientist.md

* delete config

* format update

* upload xlab

* add README_Model_Uploading.md and images

* modelscope model upload

* Modify Recent Updates

* update daddy-like Boy-Friend EmoLLM

* update model uploading with openxlab

* update model uploading with openxlab

---------

Co-authored-by: zealot52099 <songyan5209@163.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
Co-authored-by: zealot52099 <67356208+zealot52099@users.noreply.github.com>
Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com>
Co-authored-by: HongCheng <kwchenghong@gmail.com>
2024-03-24 11:51:19 +08:00
..
processed Update process_merge.py 2024-03-21 16:07:18 +09:00
aiwei.json feat:Add new finetune configurations and datasets 2024-02-23 11:36:58 +08:00
data_pro.json feat:Add new finetune configurations and datasets 2024-02-23 11:36:58 +08:00
data.json update data.json (delete 4 empty data) 2024-03-21 15:56:54 +09:00
deduplicate.py Update main code (#2) 2024-03-24 11:51:19 +08:00
LICENSE Update main code (#2) 2024-03-24 11:51:19 +08:00
mother.json Update main code (#2) 2024-03-24 11:51:19 +08:00
multi_turn_dataset_1.json upload smile.dataset 2024-02-28 17:44:48 +08:00
multi_turn_dataset_2.json Add files via upload 2024-02-28 21:18:02 +08:00
README_EN.md [DOC] update datasets/README_EN.md 2024-03-20 17:52:23 +08:00
README.md Update main code (#2) 2024-03-24 11:51:19 +08:00
scientist.json 1111 2024-03-20 23:25:07 +08:00
single_turn_dataset_1.json Upload datasets 2024-02-27 22:01:53 +08:00
single_turn_dataset_2.json Upload datasets 2024-02-27 22:01:53 +08:00
SoulStar_data.json add SoulStar_data 2024-03-03 17:28:26 +08:00
tiangou.json feat:Add new finetune configurations and datasets 2024-02-24 22:39:10 +08:00

EmoLLM's datasets

  • Category of dataset: General and Role-play
  • Type of data: QA and Conversation
  • Summary: General(6 datasets), Role-play(3 datasets)

Category

  • General: generic dataset, including psychological Knowledge, counseling technology, etc.
  • Role-play: role-playing dataset, including character-specific conversation style data, etc.

Type

  • QA: question-and-answer pair
  • Conversation: multi-turn consultation dialogue

Summary

Category Dataset Type Total
General data Conversation 5600+
General data_pro Conversation 36500+
General multi_turn_dataset_1 Conversation 36,000+
General multi_turn_dataset_2 Conversation 27,000+
General single_turn_dataset_1 QA 14000+
General single_turn_dataset_2 QA 18300+
Role-play aiwei Conversation 4000+
Role-play SoulStar QA 11200+
Role-play tiangou Conversation 3900+
…… …… …… ……

Source

General

  • dataset data from this repo
  • dataset data_pro from this repo
  • dataset multi_turn_dataset_1 from Smile
  • dataset multi_turn_dataset_2 from CPsyCounD
  • dataset single_turn_dataset_1 from this repo
  • dataset single_turn_dataset_2 from this repo

Role-play

  • dataset aiwei from this repo
  • dataset tiangou from this repo
  • dataset SoulStar from SoulStar

Dataset Deduplication Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

https://algonotes.readthedocs.io/en/latest/Simhash.html