DeepSeek-V4-Flash-W8A8 双机 DP=2 部署
本文介绍了在昇腾双机8卡服务器上部署DeepSeek-V4-Flash-W8A8。
·
本文介绍了在昇腾双机8卡服务器上部署DeepSeek-V4-Flash-W8A8
一、模型
模型路径:/root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp (70 个 safetensor 分片,280GB,需提前在两台机器上都下载好)
二、Step 1:准备工作(两台机器同时做)
1.1 确认网络
# 确认两台机器内网互通
ping -c 2 {内网ip} # 从 Node1 上 ping Node0
ping -c 2 {内网ip} # 从 Node0 上 ping Node1
1.2 下载模型(两台机器都要有)
modelscope download --model Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp
默认下载到:/root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp
1.3 拉取镜像(两台机器同时拉)
# 如果官方 quay.io 慢,用镜像加速 docker pull m.daocloud.io/quay.io/ascend/vllm-ascend:v0.13.0rc3 docker tag m.daocloud.io/quay.io/ascend/vllm-ascend:v0.13.0rc3 quay.io/ascend/vllm-ascend:v0.13.0rc3
镜像:quay.io/ascend/vllm-ascend:v0.13.0rc3
三、Step 2:启动容器(两台机器都要做)
容器命名建议
- Node0:
vllm-ascend-deepseek-v4-node0 - Node1:
vllm-ascend-deepseek-v4-node1
完整 docker run 命令(两台一样)
docker run -d \ --name vllm-ascend-deepseek-v4 \ --net=host \ --shm-size=1g \ --device /dev/davinci0 --device /dev/davinci1 --device /dev/davinci2 --device /dev/davinci3 \ --device /dev/davinci4 --device /dev/davinci5 --device /dev/davinci6 --device /dev/davinci7 \ --device /dev/davinci_manager --device /dev/devmm_svm --device /dev/hisi_hdc \ -v /usr/local/dcmi:/usr/local/dcmi \ -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ -v /etc/ascend_install.info:/etc/ascend_install.info \ -v /etc/hccn.conf:/etc/hccn.conf \ -v /root/.cache:/root/.cache \ quay.io/ascend/vllm-ascend:v0.13.0rc3 bash -c "sleep infinity"
四、Step 3:启动 vLLM(关键:启动顺序 + 环境变量)
核心环境变量(两台都要设置)
export USE_MULTI_BLOCK_POOL=1 export OMP_PROC_BIND=false export OMP_NUM_THREADS=10 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export ACL_OP_INIT_MODE=1 export TRITON_ALL_BLOCKS_PARALLEL=1
DP=2 多机通信环境变量(两台都要设置)
export HCCL_OP_EXPANSION_MODE="AIV" export HCCL_IF_IP="<本机内网bond1 IP>" # Node0: 10.25.66.9 Node1: 10.25.66.11 export GLOO_SOCKET_IFNAME="bond1" export TP_SOCKET_IFNAME="bond1" export HCCL_SOCKET_IFNAME="bond1" export HCCL_BUFFSIZE=200 export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export HCCL_CONNECT_TIMEOUT=120 export HCCL_INTRA_PCIE_ENABLE=1 export HCCL_INTRA_ROCE_ENABLE=0 export ACL_OP_INIT_MODE=1 export TRITON_ALL_BLOCKS_PARALLEL=1 export USE_MULTI_BLOCK_POOL=1 export OMP_PROC_BIND=false export OMP_NUM_THREADS=10
启动顺序:Node1(worker)先启,Node0(master)后启,间隔 5 秒
4.1 Node1(worker,rank=1)先启动
在 Node1上执行:
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP="{内网ip}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
docker exec -d vllm-ascend-deepseek-v4 bash -c '
vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp \
--host 0.0.0.0 \
--port 8005 \
--headless \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 1 \
--data-parallel-address 10.25.66.9 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name ds \
--max-num-seqs 64 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--chat-template /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja \
--async-scheduling \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.94 \
--compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
--additional-config "{\"enable_cpu_binding\": \"true\", \"multistream_overlap_shared_expert\": true}" \
--speculative-config "{\"num_speculative_tokens\": 3, \"method\": \"deepseek_mtp\"}"
'
等待 5 秒,再启动 Node0。
4.2 Node0(master,rank=0)等 5 秒后启动
在 Node0上执行:
export HCCL_OP_EXPANSION_MODE="AIV"
export HCCL_IF_IP="{内网ip}"
export GLOO_SOCKET_IFNAME="bond1"
export TP_SOCKET_IFNAME="bond1"
export HCCL_SOCKET_IFNAME="bond1"
export HCCL_BUFFSIZE=200
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
export HCCL_CONNECT_TIMEOUT=120
export HCCL_INTRA_PCIE_ENABLE=1
export HCCL_INTRA_ROCE_ENABLE=0
export ACL_OP_INIT_MODE=1
export TRITON_ALL_BLOCKS_PARALLEL=1
export USE_MULTI_BLOCK_POOL=1
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=10
docker exec -d vllm-ascend-deepseek-v4 bash -c '
vllm serve /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp \
--host 0.0.0.0 \
--port 8005 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--data-parallel-start-rank 0 \
--data-parallel-address 10.25.66.9 \
--data-parallel-rpc-port 13389 \
--tensor-parallel-size 8 \
--quantization ascend \
--seed 1024 \
--served-model-name ds \
--max-num-seqs 64 \
--max-model-len 131072 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--chat-template /root/.cache/modelscope/hub/models/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja \
--async-scheduling \
--no-enable-prefix-caching \
--gpu-memory-utilization 0.94 \
--compilation-config "{\"cudagraph_mode\": \"FULL_DECODE_ONLY\"}" \
--additional-config "{\"enable_cpu_binding\": \"true\", \"multistream_overlap_shared_expert\": true}" \
--speculative-config "{\"num_speculative_tokens\": 3, \"method\": \"deepseek_mtp\"}"
'
注意:Node0 没有 --headless。
五、Step 4:验证
# 从任意一台机器验证
curl http://{内网ip}:8005/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ds",
"messages": [{"role": "user", "content": "Hello, who are you?"}],
"max_tokens": 100,
"temperature": 0
}'
返回 "I'm DeepSeek..."即成功。
Function Call 验证
curl http://{内网ip}:8005/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "ds",
"messages": [{"role": "user", "content": "What is the weather in Shanghai? Use the get_weather tool."}],
"tools": [{"type": "function", "function": {"name": "get_weather", "description": "Get current weather for a city", "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"]}}}],
"max_tokens": 200,
"temperature": 0
}'
六、已知问题
1:HCCL 错误 code 7(端口绑定冲突)
- 原因:加了
--enable-expert-parallel,DeepSeek-V4-Flash 本身 MoE 已内置 expert parallelism,与该参数冲突 - 解决:不要加
--enable-expert-parallel
2:启动顺序错误
- Node1(rank=1)必须先启动,等待约 5 秒,再启动 Node0(rank=0)
- 如果反过来,Node0 会报连接失败
3:NPU 僵尸进程(drvErr=-8020)
- 表现:宿主机
npu-smi info正常,容器内报错dcmi model initialized failed, because the device is used. ret is -8020 - 解决:
cat /proc/uda/namespace_node # 找到残留的 root_tgid kill -9 <root_tgid> # root_tgid=1 的是系统进程,不能杀
更多推荐



所有评论(0)