.. | ||
config | ||
model | ||
util | ||
main.py | ||
README_EN.md | ||
README.md | ||
requirements.txt | ||
system_prompt_v1_EN.md | ||
system_prompt_v1.md | ||
system_prompt_v2_EN.md | ||
system_prompt_v2.md |
QA Generation Pipeline
1. Use method
- Check whether the dependencies in
requirements.txt
are satisfied. - Adjust the
system_prompt
in the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA. - Put the txt file into the
data
folder in the same directory asmodel
. - Configure the required API KEY in
config/config.py
and start frommain.py
. The generated QA pairs are stored in the jsonl format underdata/generated
.
1.1 API KEY obtaining method
Currently only qwen is included.
1.1.1 Qwen
Tomodel service spirit product - API - KEY management (aliyun.com),click on "create a new API - KEY", Fill in the obtained API KEY to DASHSCOPE_API_KEY
in config/config.py
.
2. Precautions
2.1 The System Prompt is displayed
Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.
2.2 Sliding Window
Both window_size
and overlap_size
of the sliding window can be changed in the get_txt_content
function in util/data_loader.py.
Currently it is a sliding window divided by sentence.
2.3 Corpus Format
At present, only txt format is supported, and the cleaned book text can be placed under the data
folder, and the program will recursively retrieve all txt files under the folder.
TODO
- Support more models (Gemini, GPT, ChatGLM...)
- Support multi-threaded call model
- Support more text formats (PDF...)
- Support more ways to split text