华为昇腾910B GPU服务器初始化准备

本文介绍了华为昇腾910B GPU服务器的环境配置流程：1）安装驱动与固件前需准备依赖工具，并设置文件权限和用户；2）安装CANN工具包并配置环境变量；3）可选安装nputop监控工具；4）配置NPU网卡IP与路由；5）部署Docker运行时环境，实现容器使用GPU资源；6）在K8S中安装DevicePlugin组件，使集群能够调度NPU资源。最后提供了验证各组件是否正常工作的检查方法。

INSNNP_LZM

486人浏览 · 2025-11-19 14:04:20

INSNNP_LZM · 2025-11-19 14:04:20 发布

介绍如何在华为昇腾 910B GPU 服务器上安装驱动、固件、CANN等基础工具，NPU卡相关配置以及配置可以让容器使用GPU资源的环境配置
说明：

这里将 GPU 与 NPU 两个名词混用了
下面的步骤是基于 Atlas 800I A2 推理服务器操作

一、驱动查找固件安装

需要先在服务器上安装驱动、固件等才可以让程序使用GPU相关的资源

1.1 下载驱动与固件

驱动与固件下载页面（下载时需登录）：https://www.hiascend.com/hardware/firmware-drivers/community

根据自身情况选择服务器型号、驱动与固件版本、组件、软件包格式后即可看到驱动和固件的下载地址

1.2 安装驱动与固件

# 安装之前需要先安装 gcc gcc-c++ make 等依赖，Ubuntu 22.04 可以通过安装 build-essential 这个包来安装
apt install build-essential

useradd HwHiAiUser

# 为驱动/固件安装文件添加可执行权限
chmod +x Ascend-hdk-910b-npu-driver_25.2.0_linux-aarch64.run
chmod +x Ascend-hdk-910b-npu-firmware_7.7.0.6.236.run

# 安装驱动
./Ascend-hdk-910b-npu-driver_25.2.0_linux-aarch64.run --full --install-for-all
# 安装固件
./Ascend-hdk-910b-npu-firmware_7.7.0.6.236.run --full

# 安装之后需要重启服务器
reboot

Bash

重启服务器之后即可执行npu-smi info命令查看显卡相关信息

npu-smi info

Bash

1.3 安装 CANN

CANN查找页面：https://www.hiascend.com/developer/download/community

选择「产品系列」与「产品型号」后点击「查找配套资源」可以跳转到对应的资源下载页面

下载 Ascend-cann-toolkit_x.x.x_linux-aarch64.run即可

下载过之后执行下面命令安装即可

chmod +x Ascend-cann-toolkit_8.2.RC2_linux-aarch64.run
./Ascend-cann-toolkit_8.2.RC2_linux-aarch64.run --install

Bash

配置CANN环境变量，把下面这个命令添加到~/.bashrc

source /usr/local/Ascend/ascend-toolkit/set_env.sh

Bash

1.4 卸载驱动与固件

这里仅记录下相关步骤，非必要不用执行

卸载时需要先卸载固件再卸载驱动（与安装顺序相反）

1.4.1 方式一：通过脚本卸载

# 卸载固件（卸载时需要先卸载固件，再卸载驱动）
cd /usr/local/Ascend/firmware/script
bash uninstall.sh

# 卸载驱动
cd /usr/local/Ascend/driver/script
bash uninstall.sh

# 卸载 Ascend-toolkit，注意将 "<version>" 换成自己环境的版本
cd /usr/local/Ascend/ascend-toolkit/<version>
bash cann_uninstall.sh

Bash

1.4.2 方式二：通过驱动文件卸载

# 安装固件
./Ascend-hdk-910b-npu-firmware_7.7.0.6.236.run --uninstall

# 安装驱动
./Ascend-hdk-910b-npu-driver_25.2.0_linux-aarch64.run --uninstall

# 卸载 CANN-toolkit
./Ascend-cann-toolkit_8.2.RC2_linux-aarch64.run --uninstall

Bash

1.5 安装nputop（可选）

使用npu-smi info工具查看显卡信息没有那么方便，使用[nputop](https://github.com/youyve/nputop)这个工具可以实时的查看显卡的资源使用情况

# 下载
pip install ascend-nputop

# 使用
nputop

Bash

二、NPU网卡配置

2.1 配置 Device IP 与掩码信息

示例配置，在一台机器上为8张显卡配置IP与子网掩码（-i指定卡号）

具体IP需要根据自身环境确定

hccn_tool -i 0 -ip -s address 192.168.190.101 netmask 255.255.255.0
hccn_tool -i 1 -ip -s address 192.168.190.102 netmask 255.255.255.0
hccn_tool -i 2 -ip -s address 192.168.190.103 netmask 255.255.255.0
hccn_tool -i 3 -ip -s address 192.168.190.104 netmask 255.255.255.0
hccn_tool -i 4 -ip -s address 192.168.190.105 netmask 255.255.255.0
hccn_tool -i 5 -ip -s address 192.168.190.106 netmask 255.255.255.0
hccn_tool -i 6 -ip -s address 192.168.190.107 netmask 255.255.255.0
hccn_tool -i 7 -ip -s address 192.168.190.108 netmask 255.255.255.0

Bash

配置后检查

for i in {0..7}; do hccn_tool -i ${i} -ip -g; done

Bash

2.2 配置路由

具体IP需要根据自身环境确定

hccn_tool -i 0 -gateway -s gateway 192.168.190.1
hccn_tool -i 1 -gateway -s gateway 192.168.190.1
hccn_tool -i 2 -gateway -s gateway 192.168.190.1
hccn_tool -i 3 -gateway -s gateway 192.168.190.1
hccn_tool -i 4 -gateway -s gateway 192.168.190.1
hccn_tool -i 5 -gateway -s gateway 192.168.190.1
hccn_tool -i 6 -gateway -s gateway 192.168.190.1
hccn_tool -i 7 -gateway -s gateway 192.168.190.1

Bash

配置后检查

for i in {0..7}; do hccn_tool -i ${i} -gateway -g; done

Bash

2.3 配置检测对象IP

具体IP需要根据自身环境确定

hccn_tool -i 0 -netdetect -s address 192.168.190.1
hccn_tool -i 1 -netdetect -s address 192.168.190.1
hccn_tool -i 2 -netdetect -s address 192.168.190.1
hccn_tool -i 3 -netdetect -s address 192.168.190.1
hccn_tool -i 4 -netdetect -s address 192.168.190.1
hccn_tool -i 5 -netdetect -s address 192.168.190.1
hccn_tool -i 6 -netdetect -s address 192.168.190.1
hccn_tool -i 7 -netdetect -s address 192.168.190.1

Bash

配置后检查

for i in {0..7}; do hccn_tool -i ${i} -netdetect -g; done

Bash

三、配置在 Docker 中使用 GPU 资源

在Docker或者k8s中运行使用昇腾GPU显卡资源，需要先给容器运行时安装 Ascend-docker-runtime

如果后续需要在 K8S 中运行使用昇腾GPU显卡的Pod或者还有其他需求，则需要安装 Ascend Device Plugin、Ascend operator、npu-exporter 等，建议这些组件的版本一致，在昇腾 mind-cluster 这个仓库中提供了这些组件的安装包（新版本中mind-cluster仓库从gitee迁移到了gitcode，老一些点版本的安装包可以在gitee中找到）

3.1 下载 Ascend-docker-runtime

在 mind-cluster 仓库的 release 仓库中下载所需版本即可

3.2 安装 Ascend-docker-runtime

安装可分为 Docker 场景和 Containerd 场景

3.2.1 Docker 场景安装

安装过程中会自动修改/etc/docker/daemon.json文件

# 增加执行权限
chmod u+x Ascend-docker-runtime_{version}_linux-{arch}.run

# 执行如下命令，校验软件包安装文件的一致性和完整性
./Ascend-docker-runtime_{version}_linux-{arch}.run --check

# 安装到默认路径下，执行以下命令。如果需要安装到指定路径，需要指定 --install-path=<path> 参数
./Ascend-docker-runtime_{version}_linux-{arch}.run --install [--install-path=<path>]

# 执行以下命令，使Ascend Docker Runtime生效
systemctl daemon-reload && systemctl restart docker

Bash

3.2.2 containerd 安装

执行以下命令，修改Containerd配置文件

Containerd无默认配置文件时，依次执行以下命令，创建并修改配置文件

mkdir /etc/containerd
containerd config default > /etc/containerd/config.toml
vim /etc/containerd/config.toml

Bash

Containerd已有配置文件时，打开并修改配置文件

# vim /etc/containerd/config.toml
...
 [plugins."io.containerd.monitor.v1.cgroups"] 
   no_prometheus = false 
 [plugins."io.containerd.runtime.v1.linux"] 
   shim = "containerd-shim" 
   runtime = "/usr/local/Ascend/Ascend-Docker-Runtime/ascend-docker-runtime" 
   runtime_root = "" 
   no_shim = false 
   shim_debug = false 
 [plugins."io.containerd.runtime.v2.task"] 
   platforms = ["linux/amd64"] 
...

YAML

执行以下命令，重启Containerd

systemctl daemon-reload && systemctl restart containerd

Bash

3.3 运行使用 GPU 资源容器（Docker）

可以参考下面命令创建容器

--device为映射的设备，/dev/davinci0为显卡

docker run -itd --privileged --net=host --name mindie --shm-size=50g \
    -e ASCEND_RT_VISIBLE_DEVICES="0,1" \
    --device=/dev/davinci0 \
    --device=/dev/davinci1 \
    --device=/dev/davinci2 \
    --device=/dev/davinci3 \
    --device=/dev/davinci4 \
    --device=/dev/davinci5 \
    --device=/dev/davinci6 \
    --device=/dev/davinci7 \
    --device=/dev/davinci_manager \
    --device=/dev/devmm_svm \
    --device=/dev/hisi_hdc \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
    -v /usr/local/sbin/:/usr/local/sbin/ \
    swr.cn-south-1.myhuaweicloud.com/ascendhub/mindie:2.1.RC1-800I-A2-py311-openeuler24.03-lts \
    /bin/bash

# 查看容器内显卡信息
docker exec -it mindie npu-smi info

四、安装 Ascend Device Plugin

4.1 准备文件

Device Plugin k8s yaml 文件：https://gitee.com/ascend/mind-cluster/tree/master/component/ascend-device-plugin/build

910B 显卡安装 Device Plugin 的文件：https://gitee.com/ascend/mind-cluster/blob/master/component/ascend-device-plugin/build/ascendplugin-910.yaml

Ascend Device Plugin 镜像仓库：https://www.hiascend.com/developer/ascendhub/detail/a592da7bd2ab4dffa8864abd4eac5068

下载后可以重新 tag 一下

docker pull swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v6.0.0.SPC3

docker tag swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-k8sdeviceplugin:v6.0.0.SPC3 ascend-k8sdeviceplugin:v6.0.0.SPC3

Bash

4.2 安装 Ascend Device Plugin

# 给 910B GPU 节点打上对应的标签
kubectl label nodes <node> accelerator=huawei-Ascend910

# 安装 Ascend Device Plugin（提前准备下镜像并修改文件中标签）
kubectl apply -f ascendplugin-910.yaml

# 检查 device plugin Pod
kubectl get pod -n kube-system -o wide | grep ascend-device-plugin

Bash

4.3 安装后检查

安装成功后，可以通过 describe 节点查看节点可分配的资源

kubectl get pods -n kube-system -o wide | grep ascend-device-plugin

kubectl describe node <node> | grep -i <Ascend910>
  huawei.com/Ascend910:     8  # K8s已感知到该节点可供分配的NPU总个数为8

Bash

4.4 运行使用 GPU 资源的 Pod

# 只需要给使用昇腾 GPU 的 Pod 的资源请求部分加上昇腾 GPU 的配置即可
resources:
  requests:
    huawei.com/Ascend910: 8                 # Number of required NPUs. The maximum value is 8. You can add lines below to configure resources such as memory and CPU.
  limits:
    huawei.com/Ascend910: 8

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

鸿蒙应用开发UI渲染异常排查：布局失效、闪烁与错位分析

初步诊断[ ] 使用ArkUI Inspector检查组件树结构 [ ] 验证@State/@Prop数据流是否正确 [ ] 检查布局约束和尺寸计算深度分析[ ] 使用ComponentUtils.getRectangleById获取布局信息 [ ] 检查onAreaChange回调中的布局逻辑 [ ] 验证动画和过渡效果配置性能优化[ ] 检查列表渲染性能（LazyForEach使用） [ ]

人工智能6S服务平台

鸿蒙分布式调试挑战：跨设备数据流转与连接稳定性

可视化优先：通过调用链追踪让数据流可视化端到端监控：建立完整的性能监控体系防御性编程：预设故障处理和数据一致性保障机制渐进式优化：从基础功能到高级特性的分层调试。

人工智能6S服务平台

华为昇腾910B服务器上部署Qwen3-30B-A3B并使用EvalScope推理性能测试

摘要：本文介绍了在华为昇腾910B显卡上使用MindIE和vllm-ascend推理引擎运行Qwen3-30B-A3B大模型的完整流程。测试环境配置8张910B显卡，其中0-3卡供MindIE使用，4-7卡供vllm-ascend使用。详细说明了MindIE的容器部署、配置修改和服务启动步骤，以及vllm-ascend的快速部署方法。使用EvalScope工具进行性能测试，结果显示在1024输入上