增加glm-4-9b-chat微调文档 (#251)
This commit is contained in:
		
						commit
						eb24f4b691
					
				| @ -297,7 +297,7 @@ git clone https://github.com/SmartFlowAI/EmoLLM.git | |||||||
| |         [dream00001](https://github.com/dream00001)          |                  南开大学在读硕士                  |                                                              |                  前后端开发                   | | |         [dream00001](https://github.com/dream00001)          |                  南开大学在读硕士                  |                                                              |                  前后端开发                   | | ||||||
| |     [王几行XING](https://zhihu.com/people/brycewang1898)     |                  北京大学硕士毕业                  |                                                              |         清洗数据、LLM微调、前后端开发         | | |     [王几行XING](https://zhihu.com/people/brycewang1898)     |                  北京大学硕士毕业                  |                                                              |         清洗数据、LLM微调、前后端开发         | | ||||||
| |                            [思在]                            |            北京大学硕士毕业(微软美国)            |                                                              |              LLM微调、前后端开发              | | |                            [思在]                            |            北京大学硕士毕业(微软美国)            |                                                              |              LLM微调、前后端开发              | | ||||||
| 
 | |       [TingWei](https://github.com/wwewwt)        |                  电子科技大学硕士毕业士                  |     微信公众号:AI大模型在手                                                         |                     微调                      | | ||||||
| ### 版权说明 | ### 版权说明 | ||||||
| 
 | 
 | ||||||
| 该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE) | 该项目签署了 MIT 授权许可,详情请参阅 [LICENSE](https://github.com/SmartFlowAI/EmoLLM/blob/main/LICENSE) | ||||||
|  | |||||||
| @ -299,6 +299,7 @@ This project uses Git for version control. You can see the currently available v | |||||||
| |         [dream00001](https://github.com/dream00001)          |             Nankai University, Master's student              |                                                              |              Front-end and back-end development              | | |         [dream00001](https://github.com/dream00001)          |             Nankai University, Master's student              |                                                              |              Front-end and back-end development              | | ||||||
| |         [王几行XING](zhihu.com/people/brycewang1898)         |             Peking University, Master's graduate             |                                                              | Data Processing, LLM finetuning, Front-end and back-end development | | |         [王几行XING](zhihu.com/people/brycewang1898)         |             Peking University, Master's graduate             |                                                              | Data Processing, LLM finetuning, Front-end and back-end development | | ||||||
| |                            [思在]                            |       Peking University, Master's graduate (Microsoft)       |                                                              |      LLM finetuning, Front-end and back-end development      | | |                            [思在]                            |       Peking University, Master's graduate (Microsoft)       |                                                              |      LLM finetuning, Front-end and back-end development      | | ||||||
|  | |       [TingWei](https://github.com/wwewwt)        |                  University Of Electronic Science And Technology Of China,Master's graduate                  |                                                                   |                     LLM finetuning                      | | ||||||
| 
 | 
 | ||||||
| ### Copyright Notice | ### Copyright Notice | ||||||
| 
 | 
 | ||||||
|  | |||||||
							
								
								
									
										286
									
								
								doc/GLM-4-9B-chat Lora 微调(llama-factory).md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										286
									
								
								doc/GLM-4-9B-chat Lora 微调(llama-factory).md
									
									
									
									
									
										Normal file
									
								
							| @ -0,0 +1,286 @@ | |||||||
|  | # GLM4-9B-chat Lora 微调. | ||||||
|  | 
 | ||||||
|  | 介绍如何基于 llama-factory 框架,对 glm-4-9b-chat 模型进行 Lora 微调。Lora 是一种高效微调方法,深入了解其原理可参见博客:[知乎|深入浅出 Lora](https://zhuanlan.zhihu.com/p/650197598)。 | ||||||
|  | 
 | ||||||
|  | ## 一、环境准备 | ||||||
|  | 我们实践了两种平台进行选择 | ||||||
|  | *  在[autodl](https://www.autodl.com/)平台中租一个3090等24G显存的显卡机器,如下图所示镜像选择`PyTorch`-->`2.0.0`-->`3.8(ubuntu20.04)`-->`11.8` | ||||||
|  |  | ||||||
|  |    | ||||||
|  |    | ||||||
|  | *  在 [InternStudio](https://studio.intern-ai.org.cn/) 平台中选择 A100(1/4) 的配置,如下图所示镜像选择 `Cuda11.7-conda`,如下图所示: | ||||||
|  |  | ||||||
|  | 在Terminal中,进行pip换源和安装依赖包 | ||||||
|  | 
 | ||||||
|  | ## 环境配置 | ||||||
|  | 
 | ||||||
|  | 在完成基本环境配置和本地模型部署的情况下,你还需要安装一些第三方库,可以使用以下命令: | ||||||
|  | 
 | ||||||
|  | ```bash | ||||||
|  | python -m pip install --upgrade pip | ||||||
|  | # 更换 pypi 源加速库的安装 | ||||||
|  | pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple | ||||||
|  | 
 | ||||||
|  | # 安装 LLaMA-Factory | ||||||
|  | git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git | ||||||
|  | cd LLaMA-Factory | ||||||
|  | pip install -e ".[torch,metrics]" | ||||||
|  | #上面这步操作会完成torch、transformers、datasets等相关依赖包的安装 | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ## 二、模型下载 | ||||||
|  | 
 | ||||||
|  | 使用 `modelscope` 中的`snapshot_download`函数下载模型,第一个参数为模型名称,参数`cache_dir`为模型的下载路径。 | ||||||
|  | 
 | ||||||
|  | 在 `/root/autodl-tmp` 路径下新建 `download.py` 文件并在其中输入以下内容,粘贴代码后记得保存文件,如下图所示。并运行 `python /root/autodl-tmp/download.py`执行下载,模型大小为 14 GB,下载模型大概需要 10~20 分钟 | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | import torch | ||||||
|  | from modelscope import snapshot_download, AutoModel, AutoTokenizer | ||||||
|  | import os | ||||||
|  | model_dir = snapshot_download('ZhipuAI/glm-4-9b-chat', cache_dir='/root/autodl-tmp', revision='master') | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 三、指令集构建 —— Alpaca 格式 | ||||||
|  | 
 | ||||||
|  |  LLaMA-Factory 支持 alpaca 格式和 sharegpt 格式的数据集,本次微调我们使用 alpaca 格式 | ||||||
|  | 
 | ||||||
|  | ### 指令监督微调数据格式说明 | ||||||
|  | 
 | ||||||
|  | 在指令监督微调时,`instruction` 列对应的内容会与 `input` 列对应的内容拼接后作为人类指令,即人类指令为 `instruction\ninput`。而 `output` 列对应的内容为模型回答。 | ||||||
|  | 
 | ||||||
|  | 如果指定,`system` 列对应的内容将被作为系统提示词。 | ||||||
|  | 
 | ||||||
|  | `history` 列是由多个字符串二元组构成的列表,分别代表历史消息中每轮对话的指令和回答。注意在指令监督微调时,历史消息中的回答内容**也会被用于模型学习**。 | ||||||
|  | 
 | ||||||
|  | ```json | ||||||
|  | [ | ||||||
|  |   { | ||||||
|  |     "instruction": "人类指令(必填)", | ||||||
|  |     "input": "人类输入(选填)", | ||||||
|  |     "output": "模型回答(必填)", | ||||||
|  |     "system": "系统提示词(选填)", | ||||||
|  |     "history": [ | ||||||
|  |       ["第一轮指令(选填)", "第一轮回答(选填)"], | ||||||
|  |       ["第二轮指令(选填)", "第二轮回答(选填)"] | ||||||
|  |     ] | ||||||
|  |   } | ||||||
|  | ] | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ### 单轮对话数据的格式转换 | ||||||
|  | 使用以下程序将[数据集](../datasets/)转换成 alpaca 格式 | ||||||
|  | ```python | ||||||
|  | import json | ||||||
|  | import re | ||||||
|  | 
 | ||||||
|  | # 选择要格式转换的数据集 | ||||||
|  | file_name = "single_turn_dataset_1.json" | ||||||
|  | #file_name = "single_turn_dataset_2.json" | ||||||
|  | 
 | ||||||
|  | system_prompt = "如果要添加系统提示词,请放在这里" | ||||||
|  | 
 | ||||||
|  | with open(f'../{file_name}', 'rt', encoding='utf-8') as file: | ||||||
|  |     data = json.load(file) | ||||||
|  | 
 | ||||||
|  | converted_data = [{"instruction": item["prompt"],  | ||||||
|  |                    "input": "",  | ||||||
|  |                    "output": item["completion"], | ||||||
|  |                    "system": system_prompt | ||||||
|  |                   } for item in data] | ||||||
|  | 
 | ||||||
|  | for i in range(len(converted_data)): | ||||||
|  | 
 | ||||||
|  |     # 数据清洗-去掉特殊符号 | ||||||
|  |     if "🐳" in converted_data[i]["output"]: | ||||||
|  |         converted_data[i]["output"] = converted_data[i]["output"].replace("🐳", "") | ||||||
|  |          | ||||||
|  |     # 数据清洗-去掉“你好,我是红烧肉”,会影响大模型的自我认知 | ||||||
|  |     if '好,我是' in converted_data[i]["output"]: | ||||||
|  |         converted_data[i]["output"] = converted_data[i]["output"].strip() | ||||||
|  |         intro_pattern = r"^[^\n]+\n" | ||||||
|  |         converted_data[i]["output"] = re.sub(intro_pattern, "", converted_data[i]["output"]).strip()  | ||||||
|  | 
 | ||||||
|  | with open(f'./processed/{file_name}', 'w', encoding='utf-8') as f: | ||||||
|  |     json.dump(converted_data, f, ensure_ascii=False, indent=4) | ||||||
|  | print(f'./processed/{file_name} Done') | ||||||
|  | 
 | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ### 多轮对话数据的格式转换 | ||||||
|  | 使用以下程序将[数据集](../datasets/)转换成 alpaca 格式 | ||||||
|  | ```python | ||||||
|  | from tqdm import tqdm | ||||||
|  | import json | ||||||
|  | 
 | ||||||
|  | # 选择要格式转换的数据集 | ||||||
|  | file_name = "data.json" | ||||||
|  | #file_name = "data_pro.json" | ||||||
|  | #file_name = "multi_turn_dataset_1.json" | ||||||
|  | #file_name = "multi_turn_dataset_2.json" | ||||||
|  | #file_name = "aiwei.json" | ||||||
|  | 
 | ||||||
|  | system_prompt = "如果要添加系统提示词,请放在这里" | ||||||
|  | 
 | ||||||
|  | with open(f'../{file_name}', 'rt', encoding='utf-8') as file: | ||||||
|  |     data = json.load(file) | ||||||
|  | 
 | ||||||
|  | # 遍历原始数据,进行格式转换 | ||||||
|  | 
 | ||||||
|  | # 转换后的数据格式 | ||||||
|  | converted_data = [] | ||||||
|  | for item in tqdm(data): | ||||||
|  |     conversation = item['conversation'] | ||||||
|  |     history = [(c['input'], c['output']) for c in conversation[:-1]] | ||||||
|  |     last_item = conversation[-1] | ||||||
|  |     converted_data.append({ | ||||||
|  |         "instruction": last_item['input'], | ||||||
|  |         "input": "", | ||||||
|  |         "output": last_item['output'], | ||||||
|  |         "system": system_prompt, | ||||||
|  |         "history": history | ||||||
|  |     }) | ||||||
|  |     # 将转换后的数据转换为JSON格式 | ||||||
|  |     converted_json = json.dumps(converted_data, ensure_ascii=False, indent=4) | ||||||
|  | 
 | ||||||
|  | with open(f'./processed/{file_name}', 'w', encoding='utf-8') as f: | ||||||
|  |     json.dump(converted_data, f, ensure_ascii=False, indent=4) | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ### 角色扮演数据的格式转换 | ||||||
|  | 代码同上,根据原数据集是单轮对话还是多轮对话来选择。注意设置各个角色的“system_prompt”。 | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ### 数据集合并 | ||||||
|  | 为了方便处理(不想在LLaMA-Factory中添加太多的数据集),这里将所有已经处理好的 alpaca 格式的数据集(每一个数据集文件都是一个json字符串)合并成一个文件(一个大的json字符串),合并代码如下: | ||||||
|  | ```python | ||||||
|  | import json | ||||||
|  | 
 | ||||||
|  | # 初始化一个空列表来存储所有数据 | ||||||
|  | merged_data = [] | ||||||
|  | file_list = [ | ||||||
|  |     "single_turn_dataset_1.json", | ||||||
|  |     "single_turn_dataset_2.json", | ||||||
|  |     "self_cognition_EmoLLM.json", | ||||||
|  |     "ruozhiba_raw.json", | ||||||
|  |     "data.json", | ||||||
|  |     "data_pro.json", | ||||||
|  |     "multi_turn_dataset_1.json", | ||||||
|  |     "multi_turn_dataset_2.json", | ||||||
|  |     "aiwei.json", | ||||||
|  |     "tiangou.json", | ||||||
|  |     "SoulStar_data.json", | ||||||
|  |     "mother_v2.json", | ||||||
|  |     "scientist.json" | ||||||
|  | ] | ||||||
|  | 
 | ||||||
|  | # 遍历所有文件并读取数据 | ||||||
|  | for filename in file_list: | ||||||
|  |     with open(f"./processed/{filename}", 'r', encoding='utf-8') as file: | ||||||
|  |         data = json.load(file) | ||||||
|  |         merged_data.extend(data) | ||||||
|  | 
 | ||||||
|  | # 将合并后的数据写入新的 JSON 文件 | ||||||
|  | with open('emo_glm4_merged_data.json', 'w', encoding='utf-8') as output_file: | ||||||
|  |     json.dump(merged_data, output_file, ensure_ascii=False, indent=4) | ||||||
|  | 
 | ||||||
|  | print("合并完成,已保存到 emo_glm4_merged_data.json 文件中。") | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 
 | ||||||
|  | ### 将数据集配置到LLaMA-Factory 中 | ||||||
|  | 
 | ||||||
|  | 修改 LLaMa-Factory 目录中的 data/dataset_info.json 文件,在其中添加: | ||||||
|  | 
 | ||||||
|  | ```json | ||||||
|  | "emo_merged": { | ||||||
|  |   "file_name": "emo_glm4_merged_data.json文件的绝对路径", | ||||||
|  |   } | ||||||
|  | } | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 四、微调模型 | ||||||
|  | 在 LLaMA-Factory 目录中新建配置文件 emo_glm4_lora_sft.yaml : | ||||||
|  | ```python | ||||||
|  | ### model | ||||||
|  | model_name_or_path: glm-4-9b-chat模型地址的绝对路径 | ||||||
|  | 
 | ||||||
|  | ### method | ||||||
|  | stage: sft | ||||||
|  | do_train: true | ||||||
|  | finetuning_type: lora | ||||||
|  | lora_target: all | ||||||
|  | 
 | ||||||
|  | ### dataset | ||||||
|  | # dataset 要和 data/dataset_info.json 中添加的信息保持一致 | ||||||
|  | dataset: emo_merged | ||||||
|  | template: glm4 | ||||||
|  | cutoff_len: 2048 | ||||||
|  | max_samples: 1000 | ||||||
|  | overwrite_cache: true | ||||||
|  | preprocessing_num_workers: 16 | ||||||
|  | 
 | ||||||
|  | ### output | ||||||
|  | # output_dir是模型训练过程中的checkpoint,训练日志等的保存目录 | ||||||
|  | output_dir: saves/emo-glm4-epoch10/lora/sft | ||||||
|  | logging_steps: 10 | ||||||
|  | #save_steps: 500 | ||||||
|  | plot_loss: true | ||||||
|  | overwrite_output_dir: true | ||||||
|  | save_strategy: epoch | ||||||
|  | 
 | ||||||
|  | ### train | ||||||
|  | per_device_train_batch_size: 1 | ||||||
|  | gradient_accumulation_steps: 8 | ||||||
|  | learning_rate: 1.0e-4 | ||||||
|  | num_train_epochs: 10.0 | ||||||
|  | lr_scheduler_type: cosine | ||||||
|  | warmup_ratio: 0.1 | ||||||
|  | fp16: true | ||||||
|  | 
 | ||||||
|  | ### eval | ||||||
|  | do_eval: false | ||||||
|  | val_size: 0.1 | ||||||
|  | per_device_eval_batch_size: 1 | ||||||
|  | eval_strategy: steps | ||||||
|  | eval_steps: 10 | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 执行以下命令开始微调: | ||||||
|  | ```bash | ||||||
|  | cd LLaMA-Factory | ||||||
|  | llamafactory-cli train glm4_emo_lora_sft.yaml | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 训练完成后,在 LLaMA-Factory 目录中新建配置文件 emo_glm4_lora_sft_export.yaml: | ||||||
|  | 
 | ||||||
|  | ```python | ||||||
|  | ### model | ||||||
|  | model_name_or_path: glm-4-9b-chat模型地址的绝对路径 | ||||||
|  | # 刚才emo_glm4_lora_sft.yaml文件中的 output_dir | ||||||
|  | adapter_name_or_path: saves/emo-glm4-epoch10/lora/sft | ||||||
|  | template: glm4 | ||||||
|  | finetuning_type: lora | ||||||
|  | 
 | ||||||
|  | ### export | ||||||
|  | export_dir: models/EmoLLM-glm-4-9b-chat | ||||||
|  | export_size: 2 | ||||||
|  | export_device: cpu | ||||||
|  | export_legacy_format: false | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | ## 五、合并模型 | ||||||
|  | 
 | ||||||
|  | 执行以下命令开始合并模型: | ||||||
|  | ```bash | ||||||
|  | cd LLaMA-Factory | ||||||
|  | llamafactory-cli export emo_glm4_lora_sft_export.yaml | ||||||
|  | ``` | ||||||
|  | 
 | ||||||
|  | 在 models/EmoLLM-glm-4-9b-chat 目录中就可以获得经过Lora微调后的完整模型。 | ||||||
|  | 
 | ||||||
|  | 模型权重已开源:[ModelScope](https://www.modelscope.cn/models/wwewwt/EmoLLM-glm-4-9b-chat/summary) | ||||||
		Loading…
	
		Reference in New Issue
	
	Block a user
	 xzw
						xzw