OliveSensorAPI/scripts/qa_generation/README_EN.md

# QA Generation Pipeline

## 1. Use method

1. Check whether the dependencies in  `requirements.txt` are satisfied.
2. Adjust the `system_prompt`in the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA.
3. Put the txt file into the `data` folder in the same directory as `model`.
4. Configure the required API KEY in `config/config.py` and start from `main.py`. The generated QA pairs are stored in the jsonl format under `data/generated`.

### 1.1 API KEY obtaining method

Currently only qwen is included.

#### 1.1.1 Qwen

To[model service spirit product - API - KEY management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey)，click on "create a new API - KEY", Fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.

## 2. Precautions

### 2.1 The System Prompt is displayed

Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.

### 2.2 Sliding Window

Both `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py.` Currently it is a sliding window divided by sentence.

### 2.3 Corpus Format

At present, only txt format is supported, and the cleaned book text can be placed under the `data` folder, and the program will recursively retrieve all txt files under the folder.

## TODO

1. Support more models (Gemini, GPT, ChatGLM...)
2. Support multi-threaded call model
3. Support more text formats (PDF...)
4. Support more ways to split text
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								# QA Generation Pipeline
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								## 1. Use method
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+. Check whether the dependencies in  `requirements.txt` are satisfied.
 . Adjust the `system_prompt`in the code to ensure that it is consistent with the latest version of the repo to ensure the diversity and stability of the generated QA.
 . Put the txt file into the `data` folder in the same directory as `model`.
 . Configure the required API KEY in `config/config.py` and start from `main.py`. The generated QA pairs are stored in the jsonl format under `data/generated`.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								### 1.1 API KEY obtaining method
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								Currently only qwen is included.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								#### 1.1.1 Qwen
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								To[model service spirit product - API - KEY management (aliyun.com)](https://dashscope.console.aliyun.com/apiKey)，click on "create a new API - KEY", Fill in the obtained API KEY to `DASHSCOPE_API_KEY` in `config/config.py`.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								## 2. Precautions
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								### 2.1 The System Prompt is displayed
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								Note that the current parsing scheme is based on the premise that the model generates json blocks of markdown wraps, and you need to make sure that this remains the case when you change the system prompt.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								### 2.2 Sliding Window
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								Both `window_size` and `overlap_size` of the sliding window can be changed in the `get_txt_content` function in `util/data_loader.py.` Currently it is a sliding window divided by sentence.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								### 2.3 Corpus Format
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								At present, only txt format is supported, and the cleaned book text can be placed under the `data` folder, and the program will recursively retrieve all txt files under the folder.
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+								## TODO
-												新增ENmd文档

											
										
										
											2024-03-10 15:52:18 +08:00
-												QA clean

											
										
										
											2024-03-16 13:12:15 +08:00
+. Support more models (Gemini, GPT, ChatGLM...)
 . Support multi-threaded call model
 . Support more text formats (PDF...)
 . Support more ways to split text