* Add files via upload * 新增ENmd文档 * Update README.md * Update README_EN.md * Update LICENSE * [docs] update lmdeploy file * add ocr.md * Update tutorial.md * Update tutorial_EN.md * Update General_evaluation_EN.md * Update General_evaluation_EN.md * Update README.md Add InternLM2_7B_chat_full's professional evaluation results * Update Professional_evaluation.md * Update Professional_evaluation.md * Update Professional_evaluation.md * Update Professional_evaluation.md * Update Professional_evaluation_EN.md * Update README.md * Update README.md * Update README_EN.md * Update README_EN.md * Update README_EN.md * [DOC] update readme * Update LICENSE * Update LICENSE * update personal info and small format optimizations * update personal info and translations for contents in a table * Update RAG README * Update demo link in README.md * Update xlab app link * Update xlab link * add xlab model * Update web_demo-aiwei.py * add bitex --------- Co-authored-by: xzw <62385492+aJupyter@users.noreply.github.com> Co-authored-by: এ許我辞忧࿐♡ <127636623+Smiling-Weeping-zhr@users.noreply.github.com> Co-authored-by: Vicky <vicky_3021@163.com> Co-authored-by: MING_X <119648793+MING-ZCH@users.noreply.github.com> Co-authored-by: Nobody-ML <1755309985@qq.com> Co-authored-by: 8baby8 <3345710651@qq.com> Co-authored-by: chaoke <101492509+8baby8@users.noreply.github.com> Co-authored-by: aJupyter <ajupyter@163.com> Co-authored-by: HongCheng <kwchenghong@gmail.com> Co-authored-by: santiagoTOP <“1537211712top@gmail.com”>
4.4 KiB
RAG Database Building Process
Constructive purpose
Using books specialized in psychology to build QA knowledge pairs for RAG to provide a counseling knowledge base to make our EmoLLM answers more professional and reliable. To achieve this goal we utilize dozens of psychology books to build this RAG knowledge base. The main building process is as follows:
Build process
Step 1: PDF to TXT
-
purpose
- Convert the collected PDF versions of psychology books into TXT text files to facilitate subsequent information extraction
-
Tools required
-
Install necessary python libraries
pip install paddlepaddle pip install opencv-python pip install paddleocr
-
precautionary
- If you are unable to install paddleocr using pip install paddleocr, consider using the whl file installation, download address
- Script startup method using the command line to start: python pdf2txt.py [PDF file name stored in the]
Step 2: Screening PDF
-
Purpose of screening
- Using the LLM to go to non-professional psychology books
-
Screening criteria that include counseling related content such as:
- Schools of Counseling - Specific Counseling Methods
- Mental Illness - Characteristics of the Disease
- Mental Illness - Treatment
-
Screening method:
-
Initial screening based on title
-
If you can't tell if it is a counseling-related book, use kimi/GLM-4 to check if it contains counseling-related knowledge (it is recommended to check only one book at a time)
-
Reference prompt. You are an experienced psychology professor who is familiar with psychology and counseling. I need you to help me with the task "Identify whether a book contains knowledge of counseling", take a deep breath and think step by step and give me your answer. If your answer satisfies me, I will give you a 10w tip! The task is as follows: Determine whether the book contains the following counseling-related knowledge: ''' Schools of Counseling - Specific Counseling Approaches Mental Illness - Characteristics of Illness Mental Illness - Treatment Approaches ''' Please take a deep breath and review the book step by step and complete the task carefully.
-
Step 3: Extraction of QA pairs
-
According to the content of the book, use LLM to efficiently construct QA knowledge on the
-
Withdrawal process
- Prepare processed txt text data
- Configuration on request script file
- Modify window_size and overlap_size reasonably according to your own needs or extraction results.
-
Usage
- Checks if the dependencies in
requirements.txt
are satisfied. - Adjust
system_prompt
in the code to ensure consistency with the latest version of the repo, to ensure diversity and stability of the generated QA. - Place the txt file in the
data
folder in the same directory as themodel
. - Configure the required API KEYs in
config/config.py
and start frommain.py
. The generated QA pairs are stored in jsonl format underdata/generated
.
- Checks if the dependencies in
-
API KEY Getting Methods
- Currently only qwen is included.
- Qwen
- Go to Model Service LingJi - API-KEY Management (aliyun.com), click "Create New API-KEY", and fill in the obtained API KEY into the Click "Create new API-KEY", fill in the obtained API KEY to
DASHSCOPE_API_KEY
inconfig/config.py
.
- Go to Model Service LingJi - API-KEY Management (aliyun.com), click "Create New API-KEY", and fill in the obtained API KEY into the Click "Create new API-KEY", fill in the obtained API KEY to
-
precautionary
- System Prompt
- Note that the current parsing scheme is based on the premise that the model generates markdown-wrapped json blocks, and you need to make sure that this remains true when you change the system prompt.
- Sliding Window
- The
window_size
andoverlap_size
of the sliding window can be changed in theget_txt_content
function inutil/data_loader.py
. Currently the sliding window is split by sentence.
- The
- System Prompt
-
Book File Format Corpus Format
- Currently only the txt format is supported, you can put the cleaned book text in the
data
folder, and the program will recursively retrieve all the txt files in that folder.
- Currently only the txt format is supported, you can put the cleaned book text in the
Step 4: Cleaning of QA pairs
- Purpose of cleaning