OliveSensorAPI/deploy/Imdeploy_EN.md
2024-03-10 15:52:18 +08:00

228 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

![](../assets/emoxlmdeploy.png)
# Local deployment of LMDeploy
## 1.Environment configuration
<details>
<summary>Specific deployment environment</summary>
Package Version
------------------------- -----------
accelerate 0.27.2
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.14.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.2.0
async-timeout 4.0.3
attrs 23.2.0
blinker 1.7.0
Brotli 1.0.9
cachetools 5.3.3
certifi 2023.11.17
cffi 1.16.0
charset-normalizer 2.0.4
click 8.1.7
contourpy 1.2.0
crcmod 1.7
cryptography 41.0.3
cycler 0.12.1
datasets 2.17.0
dill 0.3.8
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.109.2
ffmpy 0.3.2
filelock 3.13.1
fire 0.5.0
flash-attn 2.4.2
fonttools 4.49.0
frozenlist 1.4.1
fsspec 2023.10.0
fuzzywuzzy 0.18.0
gitdb 4.0.11
GitPython 3.1.42
gmpy2 2.1.2
gradio 3.50.2
gradio_client 0.6.1
h11 0.14.0
httpcore 1.0.3
httpx 0.26.0
huggingface-hub 0.20.3
idna 3.4
importlib-metadata 6.11.0
importlib-resources 6.1.1
Jinja2 3.1.2
jmespath 0.10.0
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lmdeploy 0.2.4
markdown-it-py 3.0.0
MarkupSafe 2.1.1
matplotlib 3.8.3
mdurl 0.1.2
mkl-fft 1.3.8
mkl-random 1.2.4
mkl-service 2.4.0
mmengine-lite 0.10.3
mpmath 1.3.0
multidict 6.0.5
multiprocess 0.70.16
networkx 3.1
ninja 1.11.1.1
numpy 1.26.2
nvidia-cublas-cu11 11.11.3.6
nvidia-cuda-runtime-cu11 11.8.89
nvidia-nccl-cu11 2.19.3
openxlab 0.0.34
orjson 3.9.14
oss2 2.17.0
packaging 23.2
pandas 2.2.0
peft 0.8.2
Pillow 9.5.0
pip 23.3.1
platformdirs 4.2.0
protobuf 4.25.3
psutil 5.9.8
pyarrow 15.0.0
pyarrow-hotfix 0.6
pybind11 2.11.1
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.1
pydantic_core 2.16.2
pydeck 0.8.1b0
pydub 0.25.1
Pygments 2.17.2
Pympler 1.0.1
pynvml 11.5.0
pyOpenSSL 23.2.0
pyparsing 3.1.1
PySocks 1.7.1
python-dateutil 2.8.2
python-multipart 0.0.9
pytz 2023.4
pytz-deprecation-shim 0.1.0.post0
PyYAML 6.0.1
referencing 0.33.0
regex 2023.12.25
requests 2.28.2
rich 13.4.2
rpds-py 0.18.0
safetensors 0.4.2
semantic-version 2.10.0
sentencepiece 0.1.99
setuptools 60.2.0
shortuuid 1.0.11
six 1.16.0
smmap 5.0.1
sniffio 1.3.0
starlette 0.36.3
streamlit 1.24.0
sudo 1.0.0
sympy 1.11.1
tenacity 8.2.3
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
toml 0.10.2
tomli 2.0.1
toolz 0.12.1
torch 2.0.1
torchaudio 2.0.2
torchvision 0.15.2
tornado 6.4
tqdm 4.65.2
transformers 4.37.1
triton 2.2.0
typing_extensions 4.9.0
tzdata 2024.1
tzlocal 4.3.1
urllib3 1.26.18
uvicorn 0.27.1
validators 0.22.0
watchdog 4.0.0
websockets 11.0.3
wheel 0.41.2
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.17.0
</details>
lmdeploy has not been installed. We will install it manually next. It is recommended to install the latest stable version. If you use the InternStudio development environment, run the following command first. Otherwise, an error occurs.
```
# Resolved ModuleNotFoundError: No module named 'packaging' problem
pip install packaging
# Use flash_attn's precompiled package to solve slow installation problems
pip install /root/share/wheels/flash_attn-2.4.2+cu118torch2.0cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
```
Because the default installation is the runtime dependency package, but we also need to deploy and quantify here, so select [all] here. You can then examine the lmdeploy package again, as shown in the following figure
```
pip install 'lmdeploy[all]==v0.1.0'
```
However, the 0.1.0 version of lmdeploy does not support the Turbomind conversion of InternLM2-7B-chat
Note that the version of lmdeploy needs to be updated:
```
# We used version 0.2.4 of lmdeploy
pip install --upgrade lmdeploy
```
## 2.Model transformation
To use TurboMind inference model, it is necessary to convert the model into TurboMind format first, which supports online conversion and offline conversion. Online conversion can directly load the Huggingface model, and offline conversion needs to save the model before loading.
TurboMind is an efficient inference engine for LLM inference, based on Nvidia's FasterTransformer. Its main features include: LLaMa structural model support, persistent batch inference mode and scalable KV cache manager.
### 2.1 Online conversion
lmdeploy supports direct reading of Huggingface model weights. Currently, three types are supported:
The models quantified by lmdeploy on huggingface.co are llama2-70b-4bit and internlm-chat-20b-4bit
Other LM models on huggingface.co, such as Qwen/ QWEN-7B-chat
An example is as follows:
```
# Requires a network environment with access to Huggingface
lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b
lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
```
The above two lines show how to directly load Huggingface's model, the first to load the version quantified using lmdeploy, and the second to load the other LLM models.
We can also launch the local Huggingface model directly, as shown below.
```
lmdeploy chat turbomind /EmoLLM --model-name internlm2-chat-7b
```
The preceding commands start a local dialog interface. You can use Bash to talk to LLM.
### 2.2 Offline conversion
The offline transformation requires converting the model to the lmdeploy TurboMind format before starting the service, as shown below.
```
# Transform modelFastTransformer格式 TurboMind
lmdeploy convert internlm2-chat-7b /EmoLLM
```
Upon completion, a workspace folder will be generated in the current directory. These are the files that TurboMind and Triton need for "model inference."
## 3.Run locally
### 3.1 TurboMind Inference + Command line local dialog
After the model transformation is complete, we have the conditions to use model inference, and then we can proceed to the real model inference.
Let's try Bash Local Chat first, and then use Local Chat to call TurboMind instead of API Server. In simple terms, TurboMind is executed directly by command line code. So, there is a difference between the actual architecture diagram and the previous one.
There are several ways to run it, such as Turbomind, PyTorch, DeepSpeed. But PyTorch and DeepSpeed are actually Huggingface's Transformers package, PyTorch means the native Transformer package, DeepSpeed means the use of DeepSpeed as an inference framework. Pytorch/DeepSpeed is currently weak and does not have production capacity, so it is not recommended to use.
Run the following command.
```
# Turbomind + Bash Local Chat
lmdeploy chat turbomind ./workspace
```
To exit, enter exit and press enter twice. At this point, the Server is the locally run model (TurboMind), and the command line can be seen as the front end.
### 3.2 TurboMind Inference + API service
In the above part, we tried to start the Client directly using the command line. Next, we tried how to use lmdepoy to service it.
First, start the service with the following command.
```
lmdeploy serve api_server ./workspace --server-name 0.0.0.0 --server-port ${server_port} --tp 1
```
Details please see [documents](https://lmdeploy.readthedocs.io/zh-cn/stable/serving/restful_api.html)