OliveSensorAPI/datasets
Anooyman 14890fad56
Update code (#8)
* feat: add agents/actions/write_markdown

* [ADD] add evaluation result of base model on 5/10 epochs

* Rename mother.json to mother_v1_2439.json

* Add files via upload

* [DOC] update README

* Update requirements.txt

update mpi4py installation

* Update README_EN.md

update English comma

* Update README.md

基于母亲角色的多轮对话模型微调完毕。已上传到 Huggingface。

* 多轮对话母亲角色的微调的脚本

* Update README.md

加上了王几行XING 和 思在 的作者信息

* Update README_EN.md

* Update README.md

* Update README_EN.md

* Update README_EN.md

* Changes to be committed:
	modified:   .gitignore
	modified:   README.md
	modified:   README_EN.md
	new file:   assets/EmoLLM_transparent.png
	deleted:    assets/Shusheng.jpg
	new file:   assets/Shusheng.png
	new file:   assets/aiwei_demo1.gif
	new file:   assets/aiwei_demo2.gif
	new file:   assets/aiwei_demo3.gif
	new file:   assets/aiwei_demo4.gif

* Update README.md

rectify aiwei_demo.gif

* Update README.md

rectify aiwei_demo style

* Changes to be committed:
	modified:   README.md
	modified:   README_EN.md

* Changes to be committed:
	modified:   README.md
	modified:   README_EN.md

* [Doc] update readme

* [Doc] update readme

* Update README.md

* Update README_EN.md

* Update README.md

* Update README_EN.md

* Delete datasets/mother_v1_2439.json

* Rename mother_v2_3838.json to mother_v2.json

* Delete datasets/mother_v2.json

* Add files via upload

* Update README.md

* Update README_EN.md

* [Doc] Update README_EN.md

minor fix

* InternLM2-Base-7B QLoRA微调模型 链接和测评结果更新

* add download_model.py script, automatic download of model libraries

* 清除图片的黑边、更新作者信息
	modified:   README.md
	new file:   assets/aiwei_demo.gif
	deleted:    assets/aiwei_demo1.gif
	modified:   assets/aiwei_demo2.gif
	modified:   assets/aiwei_demo3.gif
	modified:   assets/aiwei_demo4.gif

* rectify aiwei_demo transparent

* transparent

* modify: aiwei_demo table--->div

* modified:   aiwei_demo

* modify: div ---> table

* modified:   README.md

* modified:   README_EN.md

* update model config file links

* Create internlm2_20b_chat_lora_alpaca_e3.py

20b模型的配置文件

* update model config file links

update model config file links

* Revert "update model config file links"

---------

Co-authored-by: jujimeizuo <fengzetao.zed@foxmail.com>
Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com>
Co-authored-by: Zeyu Ba <72795264+ZeyuBa@users.noreply.github.com>
Co-authored-by: Bryce Wang <90940753+brycewang2018@users.noreply.github.com>
Co-authored-by: zealot52099 <songyan5209@163.com>
Co-authored-by: HongCheng <kwchenghong@gmail.com>
Co-authored-by: Yicong <yicooong@qq.com>
Co-authored-by: Yicooong <54353406+Yicooong@users.noreply.github.com>
Co-authored-by: aJupyter <ajupyter@163.com>
Co-authored-by: MING_X <119648793+MING-ZCH@users.noreply.github.com>
Co-authored-by: Ikko Eltociear Ashimine <eltociear@gmail.com>
Co-authored-by: HatBoy <null2none@163.com>
Co-authored-by: ZhouXinAo <142309012+zxazys@users.noreply.github.com>
2024-04-14 10:09:17 +08:00
..
processed Update process_merge.py 2024-03-21 16:07:18 +09:00
aiwei.json feat:Add new finetune configurations and datasets 2024-02-23 11:36:58 +08:00
data_pro.json feat:Add new finetune configurations and datasets 2024-02-23 11:36:58 +08:00
data.json update data.json (delete 4 empty data) 2024-03-21 15:56:54 +09:00
deduplicate.py Update main code (#2) 2024-03-24 11:51:19 +08:00
LICENSE Update main code (#2) 2024-03-24 11:51:19 +08:00
mother_v1.json Update code (#8) 2024-04-14 10:09:17 +08:00
mother_v2.json Update code (#8) 2024-04-14 10:09:17 +08:00
multi_turn_dataset_1.json upload smile.dataset 2024-02-28 17:44:48 +08:00
multi_turn_dataset_2.json Add files via upload 2024-02-28 21:18:02 +08:00
README_EN.md Update code (#8) 2024-04-14 10:09:17 +08:00
README.md Update code (#8) 2024-04-14 10:09:17 +08:00
scientist.json 1111 2024-03-20 23:25:07 +08:00
single_turn_dataset_1.json Upload datasets 2024-02-27 22:01:53 +08:00
single_turn_dataset_2.json Upload datasets 2024-02-27 22:01:53 +08:00
SoulStar_data.json add SoulStar_data 2024-03-03 17:28:26 +08:00
tiangou.json feat:Add new finetune configurations and datasets 2024-02-24 22:39:10 +08:00

EmoLLM's datasets

  • Category of dataset: General and Role-play
  • Type of data: QA and Conversation
  • Summary: General(6 datasets), Role-play(5 datasets)

Category

  • General: generic dataset, including psychological Knowledge, counseling technology, etc.
  • Role-play: role-playing dataset, including character-specific conversation style data, etc.

Type

  • QA: question-and-answer pair
  • Conversation: multi-turn consultation dialogue

Summary

Category Dataset Type Total
General data Conversation 5600+
General data_pro Conversation 36,500+
General multi_turn_dataset_1 Conversation 36,000+
General multi_turn_dataset_2 Conversation 27,000+
General single_turn_dataset_1 QA 14,000+
General single_turn_dataset_2 QA 18,300+
Role-play aiwei Conversation 4000+
Role-play SoulStar QA 11,200+
Role-play tiangou Conversation 3900+
Role-play mother Conversation 40,300+
Role-play scientist Conversation 28,400+
…… …… …… ……

Source

General

  • dataset data from this repo
  • dataset data_pro from this repo
  • dataset multi_turn_dataset_1 from Smile
  • dataset multi_turn_dataset_2 from CPsyCounD
  • dataset single_turn_dataset_1 from this repo
  • dataset single_turn_dataset_2 from this repo

Role-play

  • dataset aiwei from this repo
  • dataset tiangou from this repo
  • dataset SoulStar from SoulStar
  • dataset mother from this repo
  • dataset scientist from this repo

Dataset Deduplication Combine absolute matching with fuzzy matching (Simhash) algorithms to deduplicate the dataset, thereby enhancing the effectiveness of the fine-tuning model. While ensuring the high quality of the dataset, the risk of losing important data due to incorrect matches can be reduced via adjusting the threshold.

https://algonotes.readthedocs.io/en/latest/Simhash.html