使用 MindIE、vllm-ascend 推理引擎在华为昇腾910B显卡上运行 Qwen3-30B-A3B 模型,然后简单测试下推理性能

一、准备

1.1 环境信息

模型 Qwen3-30B-A3B MindIE 运行该模型需要至少2张卡,推荐4张
服务器型号 Atlas 800I A2 1台
显卡 910B4 64G/张
驱动 >=24.1.0
MindIE >=2.1.RC1
vllm-ascend v0.11.0rc0

服务器一共8张卡,0-3 给 MindIE 使用,4-7 给 vllm-ascend 使用

1.2 模型准备

modelscope download --model Qwen/Qwen3-30B-A3B --local_dir /model/Qwen3-30B-A3B

chmod -R 750 /model/Qwen3-30B-A3B

Bash

二、使用 MindIE 运行

2.1 启动容器

docker run -itd --privileged --name=qwen3-30b-a3b-mindie --net=host --shm-size=500g \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
    -v /usr/local/sbin/:/usr/local/sbin/ \
    -v /var/log/npu/slog/:/var/log/npu/slog \
    -v /var/log/npu/profiling/:/var/log/npu/profiling \
    -v /var/log/npu/dump/:/var/log/npu/dump \
    -v /var/log/npu/:/usr/slog \
    -v /etc/hccn.conf:/etc/hccn.conf \
    -v /model:/model \
    swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-800I-A2-py311-openeuler24.03-lts \
    /bin/bash

Bash

2.2 修改配置

docker exec -it qwen3-30b-a3b-mindie bash

cd /usr/local/Ascend/mindie/latest/mindie-service
vim conf/config.json

Bash

必须修改的参数如下(没有按照在配置文件中的顺序),其他参数可根据需要自行修改

  • "httpsEnabled"false,禁用 HTTPS
  • "npuDeviceIds"[[0,1,2,3]],NPU 卡号,下标从0开始
  • "modelName"qwen3-30b-a3b,模型名称,后续调用模型服务时使用
  • "modelWeightPath"/model/Qwen3-30B-A3B,挂载到容器内模型权重路径
  • "worldSize"4,模型使用的 NPU 卡总数量
  1. ASCEND_RT_VISIBLE_DEVICES选择物理卡,npuDeviceIds使用逻辑卡。即无论 ASCEND_RT_VISIBLE_DEVICES选择了什么卡,npuDeviceIds下标一律从0开始。
  2. NPU 可以被多个容器挂载,但只能被一个容器使用。

测试时使用的配置文件如下:

{
    "Version" : "1.0.0",

    "ServerConfig" :
    {
        "ipAddress" : "0.0.0.0",
        "managementIpAddress" : "127.0.0.2",
        "port" : 1025,
        "managementPort" : 1026,
        "metricsPort" : 1027,
        "allowAllZeroIpListening" : true,
        "maxLinkNum" : 1000,
        "httpsEnabled" : false,
        "fullTextEnabled" : false,
        "tlsCaPath" : "security/ca/",
        "tlsCaFile" : ["ca.pem"],
        "tlsCert" : "security/certs/server.pem",
        "tlsPk" : "security/keys/server.key.pem",
        "tlsPkPwd" : "security/pass/key_pwd.txt",
        "tlsCrlPath" : "security/certs/",
        "tlsCrlFiles" : ["server_crl.pem"],
        "managementTlsCaFile" : ["management_ca.pem"],
        "managementTlsCert" : "security/certs/management/server.pem",
        "managementTlsPk" : "security/keys/management/server.key.pem",
        "managementTlsPkPwd" : "security/pass/management/key_pwd.txt",
        "managementTlsCrlPath" : "security/management/certs/",
        "managementTlsCrlFiles" : ["server_crl.pem"],
        "kmcKsfMaster" : "tools/pmt/master/ksfa",
        "kmcKsfStandby" : "tools/pmt/standby/ksfb",
        "inferMode" : "standard",
        "interCommTLSEnabled" : true,
        "interCommPort" : 1121,
        "interCommTlsCaPath" : "security/grpc/ca/",
        "interCommTlsCaFiles" : ["ca.pem"],
        "interCommTlsCert" : "security/grpc/certs/server.pem",
        "interCommPk" : "security/grpc/keys/server.key.pem",
        "interCommPkPwd" : "security/grpc/pass/key_pwd.txt",
        "interCommTlsCrlPath" : "security/grpc/certs/",
        "interCommTlsCrlFiles" : ["server_crl.pem"],
        "openAiSupport" : "vllm",
        "tokenTimeout" : 600,
        "e2eTimeout" : 600,
        "distDPServerEnabled":false
    },

    "BackendConfig" : {
        "backendName" : "mindieservice_llm_engine",
        "modelInstanceNumber" : 1,
        "npuDeviceIds" : [[0,1,2,3]],
        "tokenizerProcessNumber" : 8,
        "multiNodesInferEnabled" : false,
        "multiNodesInferPort" : 1120,
        "interNodeTLSEnabled" : true,
        "interNodeTlsCaPath" : "security/grpc/ca/",
        "interNodeTlsCaFiles" : ["ca.pem"],
        "interNodeTlsCert" : "security/grpc/certs/server.pem",
        "interNodeTlsPk" : "security/grpc/keys/server.key.pem",
        "interNodeTlsPkPwd" : "security/grpc/pass/mindie_server_key_pwd.txt",
        "interNodeTlsCrlPath" : "security/grpc/certs/",
        "interNodeTlsCrlFiles" : ["server_crl.pem"],
        "interNodeKmcKsfMaster" : "tools/pmt/master/ksfa",
        "interNodeKmcKsfStandby" : "tools/pmt/standby/ksfb",
        "ModelDeployConfig" :
        {
            "maxSeqLen" : 8192,
            "maxInputTokenLen" : 6144,
            "truncation" : false,
            "ModelConfig" : [
                {
                    "modelInstanceType" : "Standard",
                    "modelName" : "qwen3-30b-a3b",
                    "modelWeightPath" : "/model/Qwen3-30B-A3B",
                    "worldSize" : 4,
                    "cpuMemSize" : 5,
                    "npuMemSize" : -1,
                    "backendType" : "atb",
                    "trustRemoteCode" : false,
                    "async_scheduler_wait_time": 120,
                    "kv_trans_timeout": 10,
                    "kv_link_timeout": 1080
                }
            ]
        },

        "ScheduleConfig" :
        {
            "templateType" : "Standard",
            "templateName" : "Standard_LLM",
            "cacheBlockSize" : 128,

            "maxPrefillBatchSize" : 10,
            "maxPrefillTokens" : 6144,
            "prefillTimeMsPerReq" : 150,
            "prefillPolicyType" : 0,

            "decodeTimeMsPerReq" : 50,
            "decodePolicyType" : 0,

            "maxBatchSize" : 200,
            "maxIterTimes" : 2048,
            "maxPreemptCount" : 0,
            "supportSelectBatch" : false,
            "maxQueueDelayMicroseconds" : 5000
        }
    }
}

JSON

2.3 启动模型服务

docker exec -it qwen3-30b-a3b-mindie bash

export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3
cd /usr/local/Ascend/mindie/latest/mindie-service
./bin/mindieservice_daemon

Bash

日志中输出 Daemon start success!表示模型服务已经正常启动

三、使用 vllm-ascend 运行

使用 vllm-ascend 运行模型比较方便,启动容器时将 4-7 卡挂载到容器,模型服务将会使用 4-7 卡

export IMAGE=quay.io/ascend/vllm-ascend:v0.11.0rc0
docker run -itd \
    --name qwen3-30b-a3b-vllm-ascend \
    --shm-size=1g \
    --device /dev/davinci4 \
    --device /dev/davinci5 \
    --device /dev/davinci6 \
    --device /dev/davinci7 \
    --device /dev/davinci_manager \
    --device /dev/devmm_svm \
    --device /dev/hisi_hdc \
    -v /usr/local/dcmi:/usr/local/dcmi \
    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
    -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
    -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /root/.cache:/root/.cache \
    -v /model:/model \
    -p 8000:8000 \
    -e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
    -e VLLM_USE_MODELSCOPE=True \
    -it $IMAGE \
    vllm serve /model/Qwen3-30B-A3B --served-model-name qwen3-30b-a3b --host 0.0.0.0 --port 8000 --tensor-parallel-size 4 --enable_expert_parallel --gpu-memory-utilization 0.9 --enable-prefix-caching --enable-chunked-prefill

Bash

四、推理性能测试

4.1 测试工具

测试工具选择 EvalScope,示例脚本如下:

EvalScope 支持命令行、Python两种执行方式,这两种方式对结果不会有影响,这里使用 Python 脚本的方式,再自行写脚本批量执行

# llm-bench.py
from evalscope.perf.main import run_perf_benchmark
from evalscope.perf.arguments import Arguments

task_cfg = Arguments(
    parallel=[1],
    number=[10],
    model='qwen3-30b-a3b',
    url='http://127.0.0.1:8000/v1/chat/completions',
    api='openai',
    dataset='random',
    min_tokens=1024,
    max_tokens=1024,
    prefix_length=0,
    min_prompt_length=1024,
    max_prompt_length=1024,
    tokenizer_path='/model/Qwen3-30B-A3B',
    extra_args={'ignore_eos': True}
)

results = run_perf_benchmark(task_cfg)

Bash

执行:

python3 llm-bench.py

Bash

4.2 测试结果

测试结果如下,从结果上来看,MindIE 整体表现要比 vllm-ascend 好一些

  1. 输入上下文为1024,输出上下文为256
  2. 到448并发后 vllm-ascend 异常退出,所以暂时只测试以下情况
  3. 结果仅供参考
MindIE MindIE MindIE vllm-ascend vllm-ascend vllm-ascend
batch size并发数 requests 请求数 TTFT(s) TPOT(s) throughout(tokens/s) TTFT(s) TPOT(s) throughout(tokens/s)
1 50 0.0958 0.0219 229.0896 0.2579 0.0216 225.4876
16 50 0.4537 0.0284 2198.1873 2.2589 0.0319 1623.3467
32 64 0.7763 0.0372 3988.11 5.4252 0.0336 2922.9758
64 128 1.3651 0.0506 5727.4234 8.9482 0.0471 3900.7664
96 192 1.9668 0.061 6987.6632 12.3918 0.0593 4449.3021
128 256 2.5768 0.0708 7896.7329 17.6229 0.0733 4491.8562
192 384 3.7235 0.096 8679.6975 28.9566 0.0905 4709.636
224 448 3.9525 0.1051 8017.7596 33.6394 0.1131 4579.7825
256 512 4.1468 0.1167 8398.7053 28.033 0.1085 5866.5547
320 640 4.7421 0.1451 8281.9079 35.1066 0.1014 5879.3744
384 768 5.5375 0.1748 8702.9784 39.3735 0.1053 6652.2664
448 896 6.0939 0.2022 8498.057 56.4764 0.1272 4731.1409
Logo

作为“人工智能6S店”的官方数字引擎,为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐