Update RAG README

2024-03-17 20:37:26 +08:00 · 2024-03-17 20:37:26 +08:00 · 88218bfd4b
commit 88218bfd4b
parent 4473c924f7
3 changed files with 68 additions and 11 deletions
--- a/scripts/qa_generation/Clean_QA.md
+++ b/scripts/qa_generation/Clean_QA.md
@ -1,11 +0,0 @@
-# 清洗 QA 对
-调用qwen去判断当前QA对是否属于心理学范畴，去除非心理学范畴的 QA 对
-
-## Step 1
-1. 准备好需要清洗的 QA 对数据
-2. 将该数据放进 model 同级 data 文件夹下
-3. 根据文件夹名去修改 config/config.py 中的 judge_dir。我个人没有对文件名进行更改，所以我的judge_dir是 judge_dir = os.path.join(data_dir, '数据整合')
-
-## Step 2
-1. 运行QA_clean.py即可
-2. 清洗完的 QA 对会以 jsonl 的格式存在 data/cleaned 下
--- a/scripts/qa_generation/README.md
+++ b/scripts/qa_generation/README.md
@ -93,3 +93,34 @@
 ## **步骤四：清洗QA对**

 - 清洗目的
+
+  - 提高提取的QA数据质量，清理掉与心理学无关的QA对
+
+- 清洗方法
+
+  - 使用Prompt方法，驱动LLM对给出的QA对进行判断
+
+  - **参考Prompt**
+
+  - ```markdown
+    你是一名经验丰富的心理咨询师，熟悉心理学相关知识。根据我提供的 QA 对，来判断这个 QA 对是否属于心理学范畴。
+    
+    标准如下：
+    
+    - 若当前 QA 对属于心理学范畴，则返回1
+    - 若当前 QA 对不属于心理学范畴，则返回0
+    
+    
+    以下是给定的心理学 QA 对内容：
+    ```
+
+- 清洗工具
+  - 配置`config/config.py` 中的 `DASHSCOPE_API_KEY`,`API_KEY`获取方法见步骤三
+  - 使用提供的清洗脚本[QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py)
+
+- 使用方法
+  - 准备好需要清洗的 QA 对数据
+  - 将该数据放进 model 同级 data 文件夹下
+  - 根据文件夹名去修改 `config/config.py` 中的 `judge_dir`。
+  - 如存储数据的文件名为`xxx`，则`judge_dir`是 `judge_dir = os.path.join(data_dir, 'xxx')`
+  - 清洗完的 QA 对会以 `jsonl` 的格式存在 `data/cleaned` 下
--- a/scripts/qa_generation/README_EN.md
+++ b/scripts/qa_generation/README_EN.md
@ -93,3 +93,40 @@ Using books specialized in psychology to build QA knowledge pairs for RAG to pro
 ## **Step 4: Cleaning of QA pairs**

 - Purpose of cleaning
+  - Improve the quality of extracted QA data and clean out QA pairs that are not relevant to psychology
+
+- Cleaning Methods
+
+  - Use the Prompt method to drive the LLM to make a judgment on the given QA pairs
+
+  - **Reference to Prompt**
+
+  - ```markdown
+    You are an experienced counselor and are familiar with psychology. Based on the QA pair I have provided, determine if this QA pair is psychological in nature.
+    
+    The criteria are as follows:
+    
+    - If the current QA pair belongs to the category of psychology, then return 1
+    - If the current QA pair does not belong to the category of psychology, then return 0.
+    
+    
+    The following is the content of the given psychology QA pair:
+    ```
+
+- Cleaning Tools
+
+  - Configure `DASHSCOPE_API_KEY` in `config/config.py`, see step 3 for how to get `API_KEY`.
+
+  - Use the provided cleaning script [QA_Clear](https://github.com/SmartFlowAI/EmoLLM/blob/main/scripts/qa_generation/QA_clean.py)
+
+- How to use
+
+  - Prepare the QA pair data to be cleaned
+
+  - Put the data into the data folder of the same level as the model.
+
+  - Modify `judge_dir` in `config/config.py` according to the folder name.
+
+  - If the file name of the stored data is `xxx`, then `judge_dir` is `judge_dir = os.path.join(data_dir, 'xxx')`.
+
+  - The cleaned QA pairs are stored as `jsonl` under `data/cleaned`.