[昇腾推理优化] 基于昇腾npu的mooncake部署指导手册
本文解决了vllm-ascend官方镜像环境不匹配问题,详细记录了mooncake配置vllm-ascend的完整过程。首先完成mooncake环境搭建,包括安装依赖、编译和启动服务;随后解决vllm和torch版本冲突问题,统一降级到兼容版本;最后通过lmcache benchmark测试验证效果。测试结果表明:1)不使用mooncake时,TTFT增加拐点与NPU内存容量匹配;2)使用moon
解决主要问题:
- vllm-ascend官方镜像环境不匹配问题,
- mooncake配置
- vllm-ascend启动环境配置
并在最后进行了效果评价,效果基本符合预期。
详细配置步骤如下:
1.mooncake环境搭建
1.1 mooncake安装
需要下载安装包,编辑安装环境。直接pip不可用
git clone https://github.com/kvcache-ai/Mooncake.git
apt-get install mpich libmpich-dev -y
cd Mooncake
bash dependencies.sh -y
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON ##昇腾环境要注意
make -j
make instal
安装成功检测
import mooncake
无报错
1.2 mooncake启动
##启动命令
mooncake_master --port 50088 --eviction_high_watermark_ratio 0.9 --eviction_ratio 0.05 --rpc_thread_num 32 --metrics_port 10022
##log
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20251205 13:07:34.889658 162445 master.cpp:133] port is deprecated, use rpc_port instead
I20251205 13:07:34.889760 162445 master.cpp:296] Master service started on port 50058, enable_gc=0, max_threads=32, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.9, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=32, rpc_port=50058, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, cluster_id=mooncake_cluster, memory_allocator=offset
I20251205 13:07:34.908459 162445 rpc_service.cpp:181] HTTP metrics server started on port 9003
I20251205 13:07:34.908710 162453 rpc_service.cpp:49] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
2.效果测试
准备mooncake配置文件,后续提供给推理服务。样例如下:
{
"local_hostname": "xxxxxx",
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"global_segment_size": 62474836480,
"master_server_address": "xxxxxxx:50088",
"use_ascend_direct":true
}
2.1 vllm-ascend+mooncake
环境构建
Vllm-ascend发布的0.11.rc3版本已内置了mooncake组件。
但是镜像内的vllm和vllm-ascend版本不匹配。
需要重新安装先关组件
#pip list |grep torch
torch 2.7.1+cpu
torch_npu 2.7.1
torchvision 0.22.1
# pip list |grep vllm
vllm 0.11.2+empty /vllm-workspace/vllm
vllm_ascend 0.11.0rc3 /vllm-workspace/vllm-ascend
需要降低vllm版本到0.11.0。
如果只对vllm降级,也会报错
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch-npu 2.7.1 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.
vllm-ascend 0.11.0rc3 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.
vllm和torch相关的版本也有匹配关系,需要对相关版本做统一的更新:
pip install pip install vllm==0.11.0 torch_npu==2.8.0
安装完成后使用vllm serve进行测试
vllm serve --help
INFO 12-08 13:40:01 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 12-08 13:40:01 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 12-08 13:40:01 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-08 13:40:01 [__init__.py:207] Platform plugin ascend is activated
WARNING 12-08 13:40:07 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 12-08 13:40:08 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
usage: vllm serve [model_tag] [options]
Launch a local OpenAI-compatible API server to serve LLM
completions via HTTP. Defaults to Qwen/Qwen3-0.6B if no model is specified.
Search by using: `--help=<ConfigGroup>` to explore options by section (e.g.,
--help=ModelConfig, --help=Frontend)
Use `--help=all` to show all available flags at once.
Config Groups:
positional arguments
options
Frontend Arguments for the OpenAI-compatible frontend server.
ModelConfig Configuration for the model.
LoadConfig Configuration for loading the model weights.
StructuredOutputsConfig Dataclass which contains structured outputs config for the engine.
ParallelConfig Configuration for the distributed execution.
CacheConfig Configuration for the KV cache.
MultiModalConfig Controls the behavior of multimodal models.
LoRAConfig Configuration for LoRA.
ObservabilityConfig Configuration for observability - metrics and tracing.
SchedulerConfig Scheduler configuration.
VllmConfig Dataclass which contains all vllm-related configuration. This
simplifies passing around the distinct configurations in the codebase.
For full list: vllm serve --help=all
For a section: vllm serve --help=ModelConfig (case-insensitive)
For a flag: vllm serve --help=max-model-len (_ or - accepted)
Documentation: https://docs.vllm.ai
mooncake配置文件,配置100GB ssd容量
{
"local_hostname": "1.1.1.1",
"metadata_server": "P2PHANDSHAKE",
"protocol": "ascend",
"device_name": "",
"global_segment_size": 105474836480,
"master_server_address": "2.2.2.2:50058",
"use_ascend_direct":true,
"alloc_in_same_node": true
}
服务运行
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export PYTHONPATH=$PYTHONPATH:/vllm-workspace/vllm
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/opt/files/src/kv-cache/conf/mooncake_vllm.json"
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000
vllm serve /opt/models/Qwen2p5-7B-Instruct/ --served-model-name qwen2p5-7b-mooncake --dtype bfloat16 --max-model-len 32768 --tensor-parallel-size 1 --host 0.0.0.0 --port 31001 --enforce-eager --enable-prefix-caching --block-size 128 --max-num-batched-tokens 8192 --gpu-memory-utilization 0.59 --kv-transfer-config '{ "kv_connector": "MooncakeConnectorStoreV1", "kv_role": "kv_both", "kv_connector_extra_config": { "use_layerwise": false, "mooncake_rpc_port": "0", "load_async": true, "register_buffer": true } }'
启动过程关键log:
#### 推理框架关键log:
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:29 [factory.py:51] Creating v1 connector with name: MooncakeConnectorV1 and engine_id: 3e9a8a19-e986-4353-b826-e25fe09b5146
(EngineCore_DP0 pid=2392) WARNING 12-08 14:01:29 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20251208 14:01:29.602275 2392 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251208 14:01:29.602348 2392 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 30.189.250.94 port: 12001
I20251208 14:01:29.602769 2392 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 30.189.250.94:16341
I20251208 14:01:29.602919 2392 ascend_direct_transport.cpp:86] install AscendDirectTransport for: 30.189.250.94:16341
I20251208 14:01:29.602973 2392 ascend_direct_transport.cpp:477] Find available between 26000 and 27000
I20251208 14:01:29.603039 2392 ascend_direct_transport.cpp:442] AscendDirectTransport set segment desc: host_ip=30.189.250.94, host_port=26957, deviceLogicId=0
I20251208 14:01:29.603081 2392 ascend_direct_transport.cpp:164] Set adxl.BufferPool to:4:8
I20251208 14:01:29.611251 2392 ascend_direct_transport.cpp:177] Success to initialize adxl engine:30.189.250.94:26957 with device_id:0
I20251208 14:01:29.611310 2392 ascend_direct_transport.cpp:186] Set connection timeout to:10000
I20251208 14:01:29.611330 2392 ascend_direct_transport.cpp:195] Set transfer timeout to:10000
I20251208 14:01:29.613250 2591 ascend_direct_transport.cpp:512] AscendDirectTransport worker thread started
I20251208 14:01:29.613384 2392 client_metric.cpp:76] Client metrics enabled (default enabled)
....
....
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [worker_v1.py:256] Available memory: 21181885132, total memory: 65452113920
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1087] GPU KV cache size: 369,280 tokens
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 11.27x
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:102] num_blocks: 2975, block_shape: torch.Size([128, 4, 128])
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:105] Registering KV_Caches. use_mla: False, shape torch.Size([2885,, 128, 4, 128])
### mooncake——master log
### 推理框架成功启动后,
I20251208 14:01:26.540841 484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 0 B | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:29.618252 490 master_service.cpp:651] Storage root directory or cluster ID is not set. persisting data is disabled.
I20251208 14:01:36.541100 484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:46.541324 484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0, | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0), | Eviction: Success/Attempts=0/0, keys=0, size=0 B
3.效果评价
使用lmcache的benchmark,long-doc-qa测试方法,构建数据集分别两次调用同一个服务,测试ttft的时延。
模型:qwen2.5-7B
测试样本集数据特点:
输入10000,输出50,
20G的存储空间可以保持370KB的token的kvcache,大约35条样本。
工具——kvcahe计算器
https://docs.lmcache.ai/getting_started/kv_cache_calculator.html
3.1 vllm-ascend+mooncake
| 样本数量 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | 110 |
| HBM-20GB | 0.4104 | 0.3742 | 0.3371 | 3.0918 | 3.1023 | 2.9289 | 2.8651 | 2.8182 | 2.7941 | 2.7379 | 2.6827 |
| HBM-40GB | 0.4176 | 0.3368 | 0.2921 | 0.3228 | 0.3171 | 0.2584 | 0.2509 | 2.7229 | 2.7128 | 2.6875 | 2.6654 |
| HBM-20GB+mooncake-20GB | 0.5016 | 0.4471 | 0.4231 | 3.6502 | 3.158 | 3.1518 | 3.0134 | 3.1044 | 2.9536 | 2.9289 | 2.9066 |
| HBM-20GB+mooncake-40GB | 0.5126 | 0.5122 | 0.4367 | 0.5718 | 0.5514 | 0.5507 | 3.027 | 3.1057 | 2.8675 | 2.8953 | 2.8716 |
| HBM-20GB+mooncake-60GB | 0.5043 | 0.4528 | 0.425 | 0.5332 | 0.5146 | 0.5031 | 0.5487 | 0.5292 | 0.5367 | 0.5358 | 2.8285 |
解读:
1.不使用mooncake时,ttft增加的拐点符合预期,和npuMemsize能保存的样本数量一致。
2.使用mooncake时,ttft增加的拐点符合预期,和推理服务配置的mooncake存储空间能保存的样本数量一致。
更多推荐


所有评论(0)