[昇腾推理优化] 基于昇腾npu的mooncake部署指导手册

本文解决了vllm-ascend官方镜像环境不匹配问题，详细记录了mooncake配置vllm-ascend的完整过程。首先完成mooncake环境搭建，包括安装依赖、编译和启动服务；随后解决vllm和torch版本冲突问题，统一降级到兼容版本；最后通过lmcache benchmark测试验证效果。测试结果表明：1)不使用mooncake时，TTFT增加拐点与NPU内存容量匹配；2)使用moon

有来有去9527

968人浏览 · 2025-12-08 16:18:55

有来有去9527 · 2025-12-08 16:18:55 发布

解决主要问题：

vllm-ascend官方镜像环境不匹配问题，
mooncake配置
vllm-ascend启动环境配置

并在最后进行了效果评价，效果基本符合预期。

详细配置步骤如下：

1.mooncake环境搭建

1.1 mooncake安装

需要下载安装包，编辑安装环境。直接pip不可用

git clone https://github.com/kvcache-ai/Mooncake.git

apt-get install mpich libmpich-dev -y

cd Mooncake
bash dependencies.sh -y
mkdir build
cd build
cmake .. -DUSE_ASCEND_DIRECT=ON    ##昇腾环境要注意
make -j
make instal

安装成功检测

import mooncake
无报错

1.2 mooncake启动

##启动命令
mooncake_master --port 50088 --eviction_high_watermark_ratio  0.9 --eviction_ratio 0.05  --rpc_thread_num 32 --metrics_port 10022

##log
WARNING: Logging before InitGoogleLogging() is written to STDERR
W20251205 13:07:34.889658 162445 master.cpp:133] port is deprecated, use rpc_port instead
I20251205 13:07:34.889760 162445 master.cpp:296] Master service started on port 50058, enable_gc=0, max_threads=32, enable_metric_reporting=1, metrics_port=9003, default_kv_lease_ttl=5000, default_kv_soft_pin_ttl=1800000, allow_evict_soft_pinned_objects=1, eviction_ratio=0.05, eviction_high_watermark_ratio=0.9, enable_ha=0, etcd_endpoints=, client_ttl=10, rpc_thread_num=32, rpc_port=50058, rpc_address=0.0.0.0, rpc_conn_timeout_seconds=0, rpc_enable_tcp_no_delay=1, cluster_id=mooncake_cluster, memory_allocator=offset
I20251205 13:07:34.908459 162445 rpc_service.cpp:181] HTTP metrics server started on port 9003
I20251205 13:07:34.908710 162453 rpc_service.cpp:49] Master Metrics: Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B

2.效果测试

准备mooncake配置文件，后续提供给推理服务。样例如下：

{
    "local_hostname": "xxxxxx",                  
    "metadata_server": "P2PHANDSHAKE",                
    "protocol": "ascend",
    "device_name": "",
    "global_segment_size": 62474836480,
    "master_server_address": "xxxxxxx:50088",
    "use_ascend_direct":true
}

2.1 vllm-ascend+mooncake

环境构建

ascend/vllm-ascend · Quay

Vllm-ascend发布的0.11.rc3版本已内置了mooncake组件。

但是镜像内的vllm和vllm-ascend版本不匹配。

需要重新安装先关组件

#pip list |grep torch
torch                             2.7.1+cpu
torch_npu                         2.7.1
torchvision                       0.22.1

# pip list |grep vllm 
vllm                              0.11.2+empty  /vllm-workspace/vllm
vllm_ascend                       0.11.0rc3     /vllm-workspace/vllm-ascend

需要降低vllm版本到0.11.0。

如果只对vllm降级，也会报错

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torch-npu 2.7.1 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.
vllm-ascend 0.11.0rc3 requires torch==2.7.1, but you have torch 2.8.0 which is incompatible.

vllm和torch相关的版本也有匹配关系，需要对相关版本做统一的更新：

pip install pip install vllm==0.11.0 torch_npu==2.8.0

安装完成后使用vllm serve进行测试

vllm serve --help
INFO 12-08 13:40:01 [__init__.py:36] Available plugins for group vllm.platform_plugins:
INFO 12-08 13:40:01 [__init__.py:38] - ascend -> vllm_ascend:register
INFO 12-08 13:40:01 [__init__.py:41] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 12-08 13:40:01 [__init__.py:207] Platform plugin ascend is activated
WARNING 12-08 13:40:07 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcudart.so.12: cannot open shared object file: No such file or directory')
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLMoeForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLMoeForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl_without_padding:AscendQwen3VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen2_5OmniModel is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_omni_thinker:AscendQwen2_5OmniThinkerForConditionalGeneration.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture DeepseekV32ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v3_2:CustomDeepseekV3ForCausalLM.
WARNING 12-08 13:40:07 [registry.py:582] Model architecture Qwen3NextForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen3_next:CustomQwen3NextForCausalLM.
INFO 12-08 13:40:08 [importing.py:63] Triton not installed or not compatible; certain GPU-related functions will not be available.
usage: vllm serve [model_tag] [options]

Launch a local OpenAI-compatible API server to serve LLM
completions via HTTP. Defaults to Qwen/Qwen3-0.6B if no model is specified.

Search by using: `--help=<ConfigGroup>` to explore options by section (e.g.,
--help=ModelConfig, --help=Frontend)
  Use `--help=all` to show all available flags at once.

Config Groups:
  positional arguments    
  options                 
  Frontend                Arguments for the OpenAI-compatible frontend server.
  ModelConfig             Configuration for the model.
  LoadConfig              Configuration for loading the model weights.
  StructuredOutputsConfig Dataclass which contains structured outputs config for the engine.
  ParallelConfig          Configuration for the distributed execution.
  CacheConfig             Configuration for the KV cache.
  MultiModalConfig        Controls the behavior of multimodal models.
  LoRAConfig              Configuration for LoRA.
  ObservabilityConfig     Configuration for observability - metrics and tracing.
  SchedulerConfig         Scheduler configuration.
  VllmConfig              Dataclass which contains all vllm-related configuration. This
      simplifies passing around the distinct configurations in the codebase.
      

For full list:            vllm serve --help=all
For a section:            vllm serve --help=ModelConfig    (case-insensitive)
For a flag:               vllm serve --help=max-model-len  (_ or - accepted)
Documentation:            https://docs.vllm.ai

mooncake配置文件，配置100GB ssd容量

{
    "local_hostname": "1.1.1.1",                  
    "metadata_server": "P2PHANDSHAKE",                
    "protocol": "ascend",
    "device_name": "",
    "global_segment_size": 105474836480,
    "master_server_address": "2.2.2.2:50058",
    "use_ascend_direct":true,
    "alloc_in_same_node": true
}

服务运行

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

export PYTHONPATH=$PYTHONPATH:/vllm-workspace/vllm

export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/python/site-packages:$LD_LIBRARY_PATH
export MOONCAKE_CONFIG_PATH="/opt/files/src/kv-cache/conf/mooncake_vllm.json"
export ACL_OP_INIT_MODE=1
export ASCEND_BUFFER_POOL=4:8
export ASCEND_CONNECT_TIMEOUT=10000
export ASCEND_TRANSFER_TIMEOUT=10000


vllm serve /opt/models/Qwen2p5-7B-Instruct/     --served-model-name qwen2p5-7b-mooncake     --dtype bfloat16      --max-model-len 32768      --tensor-parallel-size 1     --host 0.0.0.0      --port 31001      --enforce-eager      --enable-prefix-caching      --block-size 128      --max-num-batched-tokens 8192      --gpu-memory-utilization 0.59      --kv-transfer-config '{        "kv_connector": "MooncakeConnectorStoreV1",        "kv_role": "kv_both",        "kv_connector_extra_config": {            "use_layerwise": false,            "mooncake_rpc_port": "0",            "load_async": true,            "register_buffer": true        }    }'

启动过程关键log：

#### 推理框架关键log：

(EngineCore_DP0 pid=2392) INFO 12-08 14:01:29 [factory.py:51] Creating v1 connector with name: MooncakeConnectorV1 and engine_id: 3e9a8a19-e986-4353-b826-e25fe09b5146
(EngineCore_DP0 pid=2392) WARNING 12-08 14:01:29 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
WARNING: Logging before InitGoogleLogging() is written to STDERR
I20251208 14:01:29.602275  2392 transfer_engine.cpp:486] Metrics reporting is disabled (set MC_TE_METRIC=1 to enable)
I20251208 14:01:29.602348  2392 transfer_engine.cpp:91] Transfer Engine parseHostNameWithPort. server_name: 30.189.250.94 port: 12001
I20251208 14:01:29.602769  2392 transfer_engine.cpp:146] Transfer Engine RPC using P2P handshake, listening on 30.189.250.94:16341
I20251208 14:01:29.602919  2392 ascend_direct_transport.cpp:86] install AscendDirectTransport for: 30.189.250.94:16341
I20251208 14:01:29.602973  2392 ascend_direct_transport.cpp:477] Find available between 26000 and 27000
I20251208 14:01:29.603039  2392 ascend_direct_transport.cpp:442] AscendDirectTransport set segment desc: host_ip=30.189.250.94, host_port=26957, deviceLogicId=0
I20251208 14:01:29.603081  2392 ascend_direct_transport.cpp:164] Set adxl.BufferPool to:4:8
I20251208 14:01:29.611251  2392 ascend_direct_transport.cpp:177] Success to initialize adxl engine:30.189.250.94:26957 with device_id:0
I20251208 14:01:29.611310  2392 ascend_direct_transport.cpp:186] Set connection timeout to:10000
I20251208 14:01:29.611330  2392 ascend_direct_transport.cpp:195] Set transfer timeout to:10000
I20251208 14:01:29.613250  2591 ascend_direct_transport.cpp:512] AscendDirectTransport worker thread started
I20251208 14:01:29.613384  2392 client_metric.cpp:76] Client metrics enabled (default enabled)

....
....

(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [worker_v1.py:256] Available memory: 21181885132, total memory: 65452113920
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1087] GPU KV cache size: 369,280 tokens
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [kv_cache_utils.py:1091] Maximum concurrency for 32,768 tokens per request: 11.27x
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:102] num_blocks: 2975, block_shape: torch.Size([128, 4, 128])
(EngineCore_DP0 pid=2392) INFO 12-08 14:01:39 [mooncake_engine.py:105] Registering KV_Caches. use_mla: False, shape torch.Size([2885,, 128, 4, 128])


### mooncake——master log
### 推理框架成功启动后，

I20251208 14:01:26.540841   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 0 B | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:29.618252   490 master_service.cpp:651] Storage root directory or cluster ID is not set. persisting data is disabled.
I20251208 14:01:36.541100   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B
I20251208 14:01:46.541324   484 rpc_service.cpp:40] Master Metrics: Mem Storage: 0 B / 98.23 GB (0.0%) | SSD Storage: 0 B / 0 B | Keys: 0 (soft-pinned: 0) | Requests (Success/Total): PutStart=0/0, PutEnd=0/0, PutRevoke=0/0, Get=0/0, Exist=0/0, Del=0/0, DelAll=0/0,  | Batch Requests (Req=Success/PartialSuccess/Total, Item=Success/Total): PutStart:(Req=0/0/0, Item=0/0), PutEnd:(Req=0/0/0, Item=0/0), PutRevoke:(Req=0/0/0, Item=0/0), Get:(Req=0/0/0, Item=0/0), ExistKey:(Req=0/0/0, Item=0/0),  | Eviction: Success/Attempts=0/0, keys=0, size=0 B

3.效果评价

使用lmcache的benchmark，long-doc-qa测试方法，构建数据集分别两次调用同一个服务，测试ttft的时延。

Benchmarking | LMCache

模型：qwen2.5-7B

测试样本集数据特点：

输入10000，输出50，

20G的存储空间可以保持370KB的token的kvcache，大约35条样本。

工具——kvcahe计算器

https://docs.lmcache.ai/getting_started/kv_cache_calculator.html

3.1 vllm-ascend+mooncake

样本数量	10	20	30	40	50	60	70	80	90	100	110
HBM-20GB	0.4104	0.3742	0.3371	3.0918	3.1023	2.9289	2.8651	2.8182	2.7941	2.7379	2.6827
HBM-40GB	0.4176	0.3368	0.2921	0.3228	0.3171	0.2584	0.2509	2.7229	2.7128	2.6875	2.6654
HBM-20GB+mooncake-20GB	0.5016	0.4471	0.4231	3.6502	3.158	3.1518	3.0134	3.1044	2.9536	2.9289	2.9066
HBM-20GB+mooncake-40GB	0.5126	0.5122	0.4367	0.5718	0.5514	0.5507	3.027	3.1057	2.8675	2.8953	2.8716
HBM-20GB+mooncake-60GB	0.5043	0.4528	0.425	0.5332	0.5146	0.5031	0.5487	0.5292	0.5367	0.5358	2.8285

解读：

1.不使用mooncake时，ttft增加的拐点符合预期，和npuMemsize能保存的样本数量一致。

2.使用mooncake时，ttft增加的拐点符合预期，和推理服务配置的mooncake存储空间能保存的样本数量一致。

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

鸿蒙 5.0+ Ability 与路由管理实战：页面跳转 + 生命周期全解析（认证 12% 考点版）

本文详细讲解HarmonyOS多页面应用开发核心知识，包括Ability组件配置、路由跳转和生命周期管理三大重点。首先介绍Ability的概念与module.json5配置方法，详解standard/singleTask/singleInstance三种启动模式的应用场景。其次演示路由跳转实战，包括pushUrl带参跳转、replaceUrl替换页面和back返回传值等核心API用法。最后完整解析

人工智能6S服务平台

鸿蒙 Electron 边缘计算赋能：工业物联网场景下的本地化智能实战

本文探讨鸿蒙Electron在工业物联网边缘计算中的应用，提出基于"采集-处理-联动"三层架构的本地化智能方案。通过轻量化部署（<50MB）、多协议兼容（Modbus/OPC UA/MQTT）和端侧AI引擎，实现工业设备数据实时采集、本地分析（延迟<10ms）和智能联动。核心创新包括：1）边缘节点断网续跑能力；2）轻量化异常检测模型本地推理；3）分布式设备无缝协同。