华为昇腾910B服务器上部署Qwen3-30B-A3B并使用EvalScope推理性能测试
摘要:本文介绍了在华为昇腾910B显卡上使用MindIE和vllm-ascend推理引擎运行Qwen3-30B-A3B大模型的完整流程。测试环境配置8张910B显卡,其中0-3卡供MindIE使用,4-7卡供vllm-ascend使用。详细说明了MindIE的容器部署、配置修改和服务启动步骤,以及vllm-ascend的快速部署方法。使用EvalScope工具进行性能测试,结果显示在1024输入上
·
使用 MindIE、vllm-ascend 推理引擎在华为昇腾910B显卡上运行 Qwen3-30B-A3B 模型,然后简单测试下推理性能
一、准备
1.1 环境信息
| 模型 | Qwen3-30B-A3B | MindIE 运行该模型需要至少2张卡,推荐4张 |
| 服务器型号 | Atlas 800I A2 | 1台 |
| 显卡 | 910B4 | 64G/张 |
| 驱动 | >=24.1.0 | |
| MindIE | >=2.1.RC1 | |
| vllm-ascend | v0.11.0rc0 |
服务器一共8张卡,0-3 给 MindIE 使用,4-7 给 vllm-ascend 使用
1.2 模型准备
modelscope download --model Qwen/Qwen3-30B-A3B --local_dir /model/Qwen3-30B-A3B
chmod -R 750 /model/Qwen3-30B-A3B
Bash
二、使用 MindIE 运行
2.1 启动容器
docker run -itd --privileged --name=qwen3-30b-a3b-mindie --net=host --shm-size=500g \
--device=/dev/davinci0 \
--device=/dev/davinci1 \
--device=/dev/davinci2 \
--device=/dev/davinci3 \
--device=/dev/davinci_manager \
--device=/dev/devmm_svm \
--device=/dev/hisi_hdc \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-v /usr/local/sbin/:/usr/local/sbin/ \
-v /var/log/npu/slog/:/var/log/npu/slog \
-v /var/log/npu/profiling/:/var/log/npu/profiling \
-v /var/log/npu/dump/:/var/log/npu/dump \
-v /var/log/npu/:/usr/slog \
-v /etc/hccn.conf:/etc/hccn.conf \
-v /model:/model \
swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-800I-A2-py311-openeuler24.03-lts \
/bin/bash
Bash
2.2 修改配置
docker exec -it qwen3-30b-a3b-mindie bash
cd /usr/local/Ascend/mindie/latest/mindie-service
vim conf/config.json
Bash
必须修改的参数如下(没有按照在配置文件中的顺序),其他参数可根据需要自行修改
"httpsEnabled":false,禁用 HTTPS"npuDeviceIds":[[0,1,2,3]],NPU 卡号,下标从0开始"modelName":qwen3-30b-a3b,模型名称,后续调用模型服务时使用"modelWeightPath":/model/Qwen3-30B-A3B,挂载到容器内模型权重路径"worldSize":4,模型使用的 NPU 卡总数量
ASCEND_RT_VISIBLE_DEVICES选择物理卡,npuDeviceIds使用逻辑卡。即无论ASCEND_RT_VISIBLE_DEVICES选择了什么卡,npuDeviceIds下标一律从0开始。- NPU 可以被多个容器挂载,但只能被一个容器使用。
测试时使用的配置文件如下:
{
"Version" : "1.0.0",
"ServerConfig" :
{
"ipAddress" : "0.0.0.0",
"managementIpAddress" : "127.0.0.2",
"port" : 1025,
"managementPort" : 1026,
"metricsPort" : 1027,
"allowAllZeroIpListening" : true,
"maxLinkNum" : 1000,
"httpsEnabled" : false,
"fullTextEnabled" : false,
"tlsCaPath" : "security/ca/",
"tlsCaFile" : ["ca.pem"],
"tlsCert" : "security/certs/server.pem",
"tlsPk" : "security/keys/server.key.pem",
"tlsPkPwd" : "security/pass/key_pwd.txt",
"tlsCrlPath" : "security/certs/",
"tlsCrlFiles" : ["server_crl.pem"],
"managementTlsCaFile" : ["management_ca.pem"],
"managementTlsCert" : "security/certs/management/server.pem",
"managementTlsPk" : "security/keys/management/server.key.pem",
"managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
"managementTlsCrlPath" : "security/management/certs/",
"managementTlsCrlFiles" : ["server_crl.pem"],
"kmcKsfMaster" : "tools/pmt/master/ksfa",
"kmcKsfStandby" : "tools/pmt/standby/ksfb",
"inferMode" : "standard",
"interCommTLSEnabled" : true,
"interCommPort" : 1121,
"interCommTlsCaPath" : "security/grpc/ca/",
"interCommTlsCaFiles" : ["ca.pem"],
"interCommTlsCert" : "security/grpc/certs/server.pem",
"interCommPk" : "security/grpc/keys/server.key.pem",
"interCommPkPwd" : "security/grpc/pass/key_pwd.txt",
"interCommTlsCrlPath" : "security/grpc/certs/",
"interCommTlsCrlFiles" : ["server_crl.pem"],
"openAiSupport" : "vllm",
"tokenTimeout" : 600,
"e2eTimeout" : 600,
"distDPServerEnabled":false
},
"BackendConfig" : {
"backendName" : "mindieservice_llm_engine",
"modelInstanceNumber" : 1,
"npuDeviceIds" : [[0,1,2,3]],
"tokenizerProcessNumber" : 8,
"multiNodesInferEnabled" : false,
"multiNodesInferPort" : 1120,
"interNodeTLSEnabled" : true,
"interNodeTlsCaPath" : "security/grpc/ca/",
"interNodeTlsCaFiles" : ["ca.pem"],
"interNodeTlsCert" : "security/grpc/certs/server.pem",
"interNodeTlsPk" : "security/grpc/keys/server.key.pem",
"interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt",
"interNodeTlsCrlPath" : "security/grpc/certs/",
"interNodeTlsCrlFiles" : ["server_crl.pem"],
"interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
"interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb",
"ModelDeployConfig" :
{
"maxSeqLen" : 8192,
"maxInputTokenLen" : 6144,
"truncation" : false,
"ModelConfig" : [
{
"modelInstanceType" : "Standard",
"modelName" : "qwen3-30b-a3b",
"modelWeightPath" : "/model/Qwen3-30B-A3B",
"worldSize" : 4,
"cpuMemSize" : 5,
"npuMemSize" : -1,
"backendType" : "atb",
"trustRemoteCode" : false,
"async_scheduler_wait_time": 120,
"kv_trans_timeout": 10,
"kv_link_timeout": 1080
}
]
},
"ScheduleConfig" :
{
"templateType" : "Standard",
"templateName" : "Standard_LLM",
"cacheBlockSize" : 128,
"maxPrefillBatchSize" : 10,
"maxPrefillTokens" : 6144,
"prefillTimeMsPerReq" : 150,
"prefillPolicyType" : 0,
"decodeTimeMsPerReq" : 50,
"decodePolicyType" : 0,
"maxBatchSize" : 200,
"maxIterTimes" : 2048,
"maxPreemptCount" : 0,
"supportSelectBatch" : false,
"maxQueueDelayMicroseconds" : 5000
}
}
}
JSON
2.3 启动模型服务
docker exec -it qwen3-30b-a3b-mindie bash
export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
cd /usr/local/Ascend/mindie/latest/mindie-service
./bin/mindieservice_daemon
Bash
日志中输出 Daemon start success!表示模型服务已经正常启动
三、使用 vllm-ascend 运行
使用 vllm-ascend 运行模型比较方便,启动容器时将 4-7 卡挂载到容器,模型服务将会使用 4-7 卡
export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0
docker run -itd \
--name qwen3-30b-a3b-vllm-ascend \
--shm-size=1g \
--device /dev/davinci4 \
--device /dev/davinci5 \
--device /dev/davinci6 \
--device /dev/davinci7 \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-v /model:/model \
-p 8000:8000 \
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
-e VLLM_USE_MODELSCOPE=True \
-it $IMAGE \
vllm serve /model/Qwen3-30B-A3B --served-model-name qwen3-30b-a3b --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --enable_expert_parallel --gpu-memory-utilization 0.9 --enable-prefix-caching --enable-chunked-prefill
Bash
四、推理性能测试
4.1 测试工具
测试工具选择 EvalScope,示例脚本如下:
EvalScope 支持命令行、Python两种执行方式,这两种方式对结果不会有影响,这里使用 Python 脚本的方式,再自行写脚本批量执行
# llm-bench.py
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments
task_cfg = Arguments(
parallel=[1],
number=[10],
model='qwen3-30b-a3b',
url='http://127.0.0.1:8000/v1/chat/completions',
api='openai',
dataset='random',
min_tokens=1024,
max_tokens=1024,
prefix_length=0,
min_prompt_length=1024,
max_prompt_length=1024,
tokenizer_path='/model/Qwen3-30B-A3B',
extra_args={'ignore_eos': True}
)
results = run_perf_benchmark(task_cfg)
Bash
执行:
python3 llm-bench.py
Bash
4.2 测试结果
测试结果如下,从结果上来看,MindIE 整体表现要比 vllm-ascend 好一些
- 输入上下文为1024,输出上下文为256
- 到448并发后 vllm-ascend 异常退出,所以暂时只测试以下情况
- 结果仅供参考
| MindIE | MindIE | MindIE | vllm-ascend | vllm-ascend | vllm-ascend | ||
|---|---|---|---|---|---|---|---|
| batch size并发数 | requests 请求数 | TTFT(s) | TPOT(s) | throughout(tokens/s) | TTFT(s) | TPOT(s) | throughout(tokens/s) |
| 1 | 50 | 0.0958 | 0.0219 | 229.0896 | 0.2579 | 0.0216 | 225.4876 |
| 16 | 50 | 0.4537 | 0.0284 | 2198.1873 | 2.2589 | 0.0319 | 1623.3467 |
| 32 | 64 | 0.7763 | 0.0372 | 3988.11 | 5.4252 | 0.0336 | 2922.9758 |
| 64 | 128 | 1.3651 | 0.0506 | 5727.4234 | 8.9482 | 0.0471 | 3900.7664 |
| 96 | 192 | 1.9668 | 0.061 | 6987.6632 | 12.3918 | 0.0593 | 4449.3021 |
| 128 | 256 | 2.5768 | 0.0708 | 7896.7329 | 17.6229 | 0.0733 | 4491.8562 |
| 192 | 384 | 3.7235 | 0.096 | 8679.6975 | 28.9566 | 0.0905 | 4709.636 |
| 224 | 448 | 3.9525 | 0.1051 | 8017.7596 | 33.6394 | 0.1131 | 4579.7825 |
| 256 | 512 | 4.1468 | 0.1167 | 8398.7053 | 28.033 | 0.1085 | 5866.5547 |
| 320 | 640 | 4.7421 | 0.1451 | 8281.9079 | 35.1066 | 0.1014 | 5879.3744 |
| 384 | 768 | 5.5375 | 0.1748 | 8702.9784 | 39.3735 | 0.1053 | 6652.2664 |
| 448 | 896 | 6.0939 | 0.2022 | 8498.057 | 56.4764 | 0.1272 | 4731.1409 |
更多推荐




所有评论(0)