本文介绍了在昇腾双机8卡服务器上部署DeepSeek-V4-Flash-W8A8

一、模型

模型路径:/root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp
(70 个 safetensor 分片,280GB,需提前在两台机器上都下载好)

二、Step 1:准备工作(两台机器同时做)

1.1 确认网络

# 确认两台机器内网互通
ping -c 2 {内网ip}   # 从 Node1 上 ping Node0
ping -c 2 {内网ip}  # 从 Node0 上 ping Node1

1.2 下载模型(两台机器都要有)

modelscope download --model Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp

默认下载到:/root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp

1.3 拉取镜像(两台机器同时拉)

# 如果官方 quay.io 慢,用镜像加速
docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.13.0rc3
docker tag m.daocloud.io/quay.io/ascend/vllm-ascend:v0.13.0rc3 quay.io/ascend/vllm-ascend:v0.13.0rc3

镜像:quay.io/ascend/vllm-ascend:v0.13.0rc3


三、Step 2:启动容器(两台机器都要做)

容器命名建议

  • Node0:vllm-ascend-deepseek-v4-node0
  • Node1:vllm-ascend-deepseek-v4-node1

完整 docker run 命令(两台一样)

docker run -d \
  --name vllm-ascend-deepseek-v4 \
  --net=host \
  --shm-size=1g \
  --device /dev/davinci0 --device /dev/davinci1 --device /dev/davinci2 --device /dev/davinci3 \
  --device /dev/davinci4 --device /dev/davinci5 --device /dev/davinci6 --device /dev/davinci7 \
  --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc \
  -v /usr/local/dcmi:/usr/local/dcmi \
  -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
  -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
  -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
  -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /etc/hccn.conf:/etc/hccn.conf \
  -v /root/.cache:/root/.cache \
  quay.io/ascend/vllm-ascend:v0.13.0rc3 bash -c "sleep infinity"

四、Step 3:启动 vLLM(关键:启动顺序 + 环境变量)

核心环境变量(两台都要设置)

export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1

DP=2 多机通信环境变量(两台都要设置)

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP="<本机内网bond1 IP>"      # Node0: 10.25.66.9  Node1: 10.25.66.11
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=200               
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

启动顺序:Node1(worker)先启,Node0(master)后启,间隔 5 秒


4.1 Node1(worker,rank=1)先启动

在 Node1上执行:

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP="{内网ip}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

docker exec -d vllm-ascend-deepseek-v4 bash -c '
vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp \
 --host 0.0.0.0 \
 --port 8005 \
 --headless \
 --data-parallel-size 2 \
 --data-parallel-size-local 1 \
 --data-parallel-start-rank 1 \
 --data-parallel-address 10.25.66.9 \
 --data-parallel-rpc-port 13389 \
 --tensor-parallel-size 8 \
 --quantization ascend \
 --seed 1024 \
 --served-model-name ds \
 --max-num-seqs 64 \
 --max-model-len 131072 \
 --max-num-batched-tokens 8192 \
 --trust-remote-code \
 --chat-template /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja \
 --async-scheduling \
 --no-enable-prefix-caching \
 --gpu-memory-utilization 0.94 \
 --compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
 --additional-config "{\"enable_cpu_binding\": \"true\", \"multistream_overlap_shared_expert\": true}" \
 --speculative-config "{\"num_speculative_tokens\": 3, \"method\": \"deepseek_mtp\"}"
'

等待 5 秒,再启动 Node0。


4.2 Node0(master,rank=0)等 5 秒后启动

在 Node0上执行:

export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP="{内网ip}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10

docker exec -d vllm-ascend-deepseek-v4 bash -c '
vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp \
 --host 0.0.0.0 \
 --port 8005 \
 --data-parallel-size 2 \
 --data-parallel-size-local 1 \
 --data-parallel-start-rank 0 \
 --data-parallel-address 10.25.66.9 \
 --data-parallel-rpc-port 13389 \
 --tensor-parallel-size 8 \
 --quantization ascend \
 --seed 1024 \
 --served-model-name ds \
 --max-num-seqs 64 \
 --max-model-len 131072 \
 --max-num-batched-tokens 8192 \
 --trust-remote-code \
 --chat-template /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja \
 --async-scheduling \
 --no-enable-prefix-caching \
 --gpu-memory-utilization 0.94 \
 --compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
 --additional-config "{\"enable_cpu_binding\": \"true\", \"multistream_overlap_shared_expert\": true}" \
 --speculative-config "{\"num_speculative_tokens\": 3, \"method\": \"deepseek_mtp\"}"
'

注意:Node0 没有 --headless


五、Step 4:验证

# 从任意一台机器验证
curl http://{内网ip}:8005/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
   "model": "ds",
   "messages": [{"role": "user", "content": "Hello, who are you?"}],
   "max_tokens": 100,
   "temperature": 0
 }'

返回 "I'm DeepSeek..."即成功。

Function Call 验证

curl http://{内网ip}:8005/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
   "model": "ds",
   "messages": [{"role": "user", "content": "What is the weather in Shanghai? Use the get_weather tool."}],
   "tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get current weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}}],
   "max_tokens": 200,
   "temperature": 0
 }'

六、已知问题

1:HCCL 错误 code 7(端口绑定冲突)

  • 原因:加了 --enable-expert-parallel,DeepSeek-V4-Flash 本身 MoE 已内置 expert parallelism,与该参数冲突
  • 解决:不要加 --enable-expert-parallel

2:启动顺序错误

  • Node1(rank=1)必须先启动,等待约 5 秒,再启动 Node0(rank=0)
  • 如果反过来,Node0 会报连接失败

3:NPU 僵尸进程(drvErr=-8020)

  • 表现:宿主机 npu-smi info正常,容器内报错 dcmi model initialized failed, because the device is used. ret is -8020
  • 解决:
cat /proc/uda/namespace_node  # 找到残留的 root_tgid
kill -9 <root_tgid>           # root_tgid=1 的是系统进程,不能杀
Logo

作为“人工智能6S店”的官方数字引擎,为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐