History

xzw 543923dd47 Merge pull request #54 from Anooyman/main Update multi QA generation process		2024-03-10 21:51:46 +08:00
..
config	Add concurrent functions (#1 )	2024-03-09 16:44:59 +08:00
model	Upload QA generation pipeline	2024-03-07 17:56:07 +08:00
util	Update multi QA generation process	2024-03-10 13:10:24 +08:00
main.py	Update multi QA generation process	2024-03-10 13:10:24 +08:00
README_EN.md	新增ENmd文档	2024-03-10 15:52:18 +08:00
README.md	Added parameters for control	2024-03-08 18:40:07 +08:00
requirements.txt	QA Generation - Update requirements.txt	2024-03-07 23:58:05 +08:00
system_prompt_v1_EN.md	新增ENmd文档	2024-03-10 15:52:18 +08:00
system_prompt_v1.md	Rename system_prompt.md to system_prompt_v1.md	2024-03-07 22:40:28 +08:00
system_prompt_v2_EN.md	新增ENmd文档	2024-03-10 15:52:18 +08:00
system_prompt_v2.md	Create system_prompt_v2.md	2024-03-07 22:52:32 +08:00

README_EN.md

QA Generation Pipeline

1. Use method

Check whether the dependencies in requirements.txt are satisfied.
Adjust the system_promptin the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA.
Put the txt file into the data folder in the same directory as model.
Configure the required API KEY in config/config.py and start from main.py. The generated QA pairs are stored in the jsonl format under data/generated.

1.1 API KEY obtaining method

Currently only qwen is included.

1.1.1 Qwen

Tomodel service spirit product - API - KEY management (aliyun.com)，click on "create a new API - KEY", Fill in the obtained API KEY to DASHSCOPE_API_KEY in config/config.py.

2. Precautions

2.1 The System Prompt is displayed

Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.

2.2 Sliding Window

Both window_size and overlap_size of the sliding window can be changed in the get_txt_content function in util/data_loader.py. Currently it is a sliding window divided by sentence.

2.3 Corpus Format

At present, only txt format is supported, and the cleaned book text can be placed under the data folder, and the program will recursively retrieve all txt files under the folder.

TODO

Support more models (Gemini, GPT, ChatGLM...)
Support multi-threaded call model
Support more text formats (PDF...)
Support more ways to split text

README_EN.md Unescape Escape