commit
						8a36a3bd9a
					
				| @ -1,11 +0,0 @@ | |||||||
| # 清洗 QA 对 |  | ||||||
| 调用qwen去判断当前QA对是否属于心理学范畴,去除非心理学范畴的 QA 对 |  | ||||||
| 
 |  | ||||||
| ## Step 1 |  | ||||||
| 1. 准备好需要清洗的 QA 对数据 |  | ||||||
| 2. 将该数据放进 model 同级 data 文件夹下 |  | ||||||
| 3. 根据文件夹名去修改 config/config.py 中的 judge_dir。我个人没有对文件名进行更改,所以我的judge_dir是 judge_dir = os.path.join(data_dir, '数据整合') |  | ||||||
| 
 |  | ||||||
| ## Step 2 |  | ||||||
| 1. 运行QA_clean.py即可 |  | ||||||
| 2. 清洗完的 QA 对会以 jsonl 的格式存在 data/cleaned 下 |  | ||||||
| @ -93,3 +93,34 @@ | |||||||
| ## **步骤四:清洗QA对** | ## **步骤四:清洗QA对** | ||||||
| 
 | 
 | ||||||
| - 清洗目的 | - 清洗目的 | ||||||
|  | 
 | ||||||
|  |   - 提高提取的QA数据质量,清理掉与心理学无关的QA对 | ||||||
|  | 
 | ||||||
|  | - 清洗方法 | ||||||
|  | 
 | ||||||
|  |   - 使用Prompt方法,驱动LLM对给出的QA对进行判断 | ||||||
|  | 
 | ||||||
|  |   - **参考Prompt** | ||||||
|  | 
 | ||||||
|  |   - ```markdown | ||||||
|  |     你是一名经验丰富的心理咨询师,熟悉心理学相关知识。根据我提供的 QA 对,来判断这个 QA 对是否属于心理学范畴。 | ||||||
|  |      | ||||||
|  |     标准如下: | ||||||
|  |      | ||||||
|  |     - 若当前 QA 对属于心理学范畴,则返回1 | ||||||
|  |     - 若当前 QA 对不属于心理学范畴,则返回0 | ||||||
|  |      | ||||||
|  |      | ||||||
|  |     以下是给定的心理学 QA 对内容: | ||||||
|  |     ``` | ||||||
|  | 
 | ||||||
|  | - 清洗工具 | ||||||
|  |   - 配置`config/config.py` 中的 `DASHSCOPE_API_KEY`,`API_KEY`获取方法见步骤三 | ||||||
|  |   - 使用提供的清洗脚本[QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py) | ||||||
|  | 
 | ||||||
|  | - 使用方法 | ||||||
|  |   - 准备好需要清洗的 QA 对数据 | ||||||
|  |   - 将该数据放进 model 同级 data 文件夹下 | ||||||
|  |   - 根据文件夹名去修改 `config/config.py` 中的 `judge_dir`。 | ||||||
|  |   - 如存储数据的文件名为`xxx`,则`judge_dir`是 `judge_dir = os.path.join(data_dir, 'xxx')` | ||||||
|  |   - 清洗完的 QA 对会以 `jsonl` 的格式存在 `data/cleaned` 下 | ||||||
|  | |||||||
| @ -93,3 +93,40 @@ Using books specialized in psychology to build QA knowledge pairs for RAG to pro | |||||||
| ## **Step 4: Cleaning of QA pairs** | ## **Step 4: Cleaning of QA pairs** | ||||||
| 
 | 
 | ||||||
| - Purpose of cleaning | - Purpose of cleaning | ||||||
|  |   - Improve the quality of extracted QA data and clean out QA pairs that are not relevant to psychology | ||||||
|  | 
 | ||||||
|  | - Cleaning Methods | ||||||
|  | 
 | ||||||
|  |   - Use the Prompt method to drive the LLM to make a judgment on the given QA pairs | ||||||
|  | 
 | ||||||
|  |   - **Reference to Prompt** | ||||||
|  | 
 | ||||||
|  |   - ```markdown | ||||||
|  |     You are an experienced counselor and are familiar with psychology. Based on the QA pair I have provided, determine if this QA pair is psychological in nature. | ||||||
|  |      | ||||||
|  |     The criteria are as follows: | ||||||
|  |      | ||||||
|  |     - If the current QA pair belongs to the category of psychology, then return 1 | ||||||
|  |     - If the current QA pair does not belong to the category of psychology, then return 0. | ||||||
|  |      | ||||||
|  |      | ||||||
|  |     The following is the content of the given psychology QA pair: | ||||||
|  |     ``` | ||||||
|  | 
 | ||||||
|  | - Cleaning Tools | ||||||
|  | 
 | ||||||
|  |   - Configure `DASHSCOPE_API_KEY` in `config/config.py`, see step 3 for how to get `API_KEY`. | ||||||
|  | 
 | ||||||
|  |   - Use the provided cleaning script [QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py) | ||||||
|  | 
 | ||||||
|  | - How to use | ||||||
|  | 
 | ||||||
|  |   - Prepare the QA pair data to be cleaned | ||||||
|  | 
 | ||||||
|  |   - Put the data into the data folder of the same level as the model. | ||||||
|  | 
 | ||||||
|  |   - Modify `judge_dir` in `config/config.py` according to the folder name. | ||||||
|  | 
 | ||||||
|  |   - If the file name of the stored data is `xxx`, then `judge_dir` is `judge_dir = os.path.join(data_dir, 'xxx')`. | ||||||
|  | 
 | ||||||
|  |   - The cleaned QA pairs are stored as `jsonl` under `data/cleaned`. | ||||||
|  | |||||||
		Loading…
	
		Reference in New Issue
	
	Block a user
	 xzw
						xzw