昇腾 AscendCL 模型推理实战：从代码实现到算子协同全解析

本文通过完整代码实现了 AscendCL 模型推理的全流程，从环境初始化到资源释放，再到与自定义Add算子的协同，覆盖了昇腾推理开发的核心实操点。掌握这套代码框架后，你可快速适配其他模型（如 ResNet、BERT）的推理开发，真正落地昇腾 AI 全栈的应用能力。2025年昇腾CANN训练营第二季，基于CANN开源开放全场景，推出0基础入门系列、码力全开特辑、开发者案例等专题课程，助力不同阶段开发

梧桐ty

445人浏览 · 2025-11-27 23:57:43

梧桐ty · 2025-11-27 23:57:43 发布

在前文对 AscendCL 推理流程的理论拆解中，我们明确了核心步骤；本文将通过完整代码段落地昇腾 AscendCL 模型推理实战：从代码实现到算子协同全解析 AscendCL 模型推理全流程，结合包含自定义`Add`算子的模型，从环境初始化到资源释放逐一实现，让你真正上手昇腾推理开发。

一、前置准备

1.1 环境依赖

已安装 CANN toolkit（版本≥6.0）
昇腾硬件（或 Ascend Simulator）
编译好的.om 模型文件（包含自定义Add算子）

1.2 核心头文件

#include <iostream>
#include <acl/acl.h>
#include <acl/acl_mdl.h>
#include <vector>
// 自定义算子相关头文件（若需直接调用算子）
#include "add_kernel.h"

二、AscendCL 推理全流程代码实现

2.1 步骤 1：AscendCL 初始化与资源准备

// 全局变量定义
aclrtContext g_context = nullptr;
aclrtStream g_stream = nullptr;
uint32_t g_device_id = 0;
aclmdlHandle g_model_handle = nullptr;

// 初始化AscendCL
bool InitAscendCL() {
    // 1. 初始化AscendCL
    aclError ret = aclInit(nullptr);
    if (ret != ACL_SUCCESS) {
        std::cerr << "AscendCL初始化失败，错误码：" << ret << std::endl;
        return false;
    }

    // 2. 设置当前设备
    ret = aclrtSetDevice(g_device_id);
    if (ret != ACL_SUCCESS) {
        std::cerr << "设置设备失败，错误码：" << ret << std::endl;
        return false;
    }

    // 3. 创建上下文
    ret = aclrtCreateContext(&g_context, g_device_id);
    if (ret != ACL_SUCCESS) {
        std::cerr << "创建上下文失败，错误码：" << ret << std::endl;
        return false;
    }

    // 4. 创建流（Stream）
    ret = aclrtCreateStream(&g_stream);
    if (ret != ACL_SUCCESS) {
        std::cerr << "创建流失败，错误码：" << ret << std::endl;
        return false;
    }

    std::cout << "AscendCL初始化成功！" << std::endl;
    return true;
}

代码解释：

aclInit：初始化 AscendCL 环境；
aclrtSetDevice：绑定当前进程到指定昇腾设备；
aclrtCreateContext：创建上下文，用于隔离设备资源；
aclrtCreateStream：创建执行流，管理 Device 侧任务异步执行。

2.2 步骤 2：加载.om 模型文件

bool LoadModel(const std::string& model_path) {
    // 1. 加载模型
    aclError ret = aclmdlLoadFromFile(model_path.c_str(), &g_model_handle);
    if (ret != ACL_SUCCESS) {
        std::cerr << "加载模型失败，错误码：" << ret << std::endl;
        return false;
    }

    // 2. 获取模型输入/输出信息（以Add算子模型为例，输入2个float32向量，输出1个float32向量）
    size_t input_num = aclmdlGetInputNum(g_model_handle);
    size_t output_num = aclmdlGetOutputNum(g_model_handle);
    std::cout << "模型输入数量：" << input_num << "，输出数量：" << output_num << std::endl;

    return true;
}

代码解释：

aclmdlLoadFromFile：从文件加载编译好的.om 模型（包含自定义Add算子）；
aclmdlGetInputNum/OutputNum：获取模型输入输出张量数量，用于后续内存分配。

2.3 步骤 3：内存分配与数据传输

以 “输入两个长度为 1024 的 float32 向量，输出相加结果” 为例：

// 定义输入输出数据
const int DATA_LEN = 1024;
float* g_host_input1 = nullptr;
float* g_host_input2 = nullptr;
float* g_host_output = nullptr;
void* g_device_input1 = nullptr;
void* g_device_input2 = nullptr;
void* g_device_output = nullptr;

bool AllocMemoryAndCopyData() {
    // 1. Host侧内存分配
    g_host_input1 = new float[DATA_LEN];
    g_host_input2 = new float[DATA_LEN];
    g_host_output = new float[DATA_LEN];

    // 初始化输入数据（示例：input1[i] = i，input2[i] = 2*i）
    for (int i = 0; i < DATA_LEN; i++) {
        g_host_input1[i] = static_cast<float>(i);
        g_host_input2[i] = static_cast<float>(2 * i);
    }

    // 2. Device侧内存分配（获取模型输入输出内存大小）
    size_t input1_size = aclmdlGetInputSizeByIndex(g_model_handle, 0);
    size_t input2_size = aclmdlGetInputSizeByIndex(g_model_handle, 1);
    size_t output_size = aclmdlGetOutputSizeByIndex(g_model_handle, 0);

    aclError ret = aclrtMalloc(&g_device_input1, input1_size, ACL_MEM_MALLOC_NORMAL_ONLY);
    ret |= aclrtMalloc(&g_device_input2, input2_size, ACL_MEM_MALLOC_NORMAL_ONLY);
    ret |= aclrtMalloc(&g_device_output, output_size, ACL_MEM_MALLOC_NORMAL_ONLY);
    if (ret != ACL_SUCCESS) {
        std::cerr << "Device侧内存分配失败，错误码：" << ret << std::endl;
        return false;
    }

    // 3. Host→Device数据拷贝
    ret = aclrtMemcpy(g_device_input1, input1_size, g_host_input1, input1_size, ACL_MEMCPY_HOST_TO_DEVICE);
    ret |= aclrtMemcpy(g_device_input2, input2_size, g_host_input2, input2_size, ACL_MEMCPY_HOST_TO_DEVICE);
    if (ret != ACL_SUCCESS) {
        std::cerr << "Host到Device数据拷贝失败，错误码：" << ret << std::endl;
        return false;
    }

    std::cout << "内存分配与数据拷贝成功！" << std::endl;
    return true;
}

代码解释：

aclmdlGetInputSizeByIndex：通过索引获取模型输入张量的内存大小；
aclrtMalloc：在 Device 侧分配内存（ACL_MEM_MALLOC_NORMAL_ONLY表示仅在 Device 内存分配）；
aclrtMemcpy：实现 Host 与 Device 间的数据传输（ACL_MEMCPY_HOST_TO_DEVICE表示 Host 到 Device）。

2.4 步骤 4：模型推理执行

bool ExecuteModelInference() {
    // 1. 准备模型输入输出数据结构
    std::vector<void*> input_buffers = {g_device_input1, g_device_input2};
    std::vector<void*> output_buffers = {g_device_output};

    // 2. 执行模型推理（异步执行，需等待流完成）
    aclError ret = aclmdlExecute(g_model_handle, input_buffers.data(), output_buffers.data());
    if (ret != ACL_SUCCESS) {
        std::cerr << "模型推理执行失败，错误码：" << ret << std::endl;
        return false;
    }

    // 3. 等待流中任务完成
    ret = aclrtSynchronizeStream(g_stream);
    if (ret != ACL_SUCCESS) {
        std::cerr << "流同步失败，错误码：" << ret << std::endl;
        return false;
    }

    // 4. Device→Host结果拷贝
    size_t output_size = aclmdlGetOutputSizeByIndex(g_model_handle, 0);
    ret = aclrtMemcpy(g_host_output, output_size, g_device_output, output_size, ACL_MEMCPY_DEVICE_TO_HOST);
    if (ret != ACL_SUCCESS) {
        std::cerr << "Device到Host结果拷贝失败，错误码：" << ret << std::endl;
        return false;
    }

    // 打印前10个结果（验证Add算子是否生效：output[i] = input1[i] + input2[i]）
    std::cout << "推理结果（前10个）：" << std::endl;
    for (int i = 0; i < 10; i++) {
        std::cout << "input1[" << i << "] + input2[" << i << "] = " << g_host_output[i] << std::endl;
    }

    return true;
}

代码解释：

aclmdlExecute：执行模型推理，传入输入输出 Device 侧内存地址；
aclrtSynchronizeStream：等待流中所有任务执行完成（保证推理结果已生成）；
aclrtMemcpy（ACL_MEMCPY_DEVICE_TO_HOST）：将 Device 侧推理结果拷贝回 Host 侧。

2.5 步骤 5：资源释放

void ReleaseAllResources() {
    // 1. 释放模型
    if (g_model_handle != nullptr) {
        aclmdlUnload(g_model_handle);
        g_model_handle = nullptr;
    }

    // 2. 释放Device侧内存
    if (g_device_input1 != nullptr) {
        aclrtFree(g_device_input1);
        g_device_input1 = nullptr;
    }
    if (g_device_input2 != nullptr) {
        aclrtFree(g_device_input2);
        g_device_input2 = nullptr;
    }
    if (g_device_output != nullptr) {
        aclrtFree(g_device_output);
        g_device_output = nullptr;
    }

    // 3. 释放Host侧内存
    if (g_host_input1 != nullptr) {
        delete[] g_host_input1;
        g_host_input1 = nullptr;
    }
    if (g_host_input2 != nullptr) {
        delete[] g_host_input2;
        g_host_input2 = nullptr;
    }
    if (g_host_output != nullptr) {
        delete[] g_host_output;
        g_host_output = nullptr;
    }

    // 4. 释放流和上下文
    if (g_stream != nullptr) {
        aclrtDestroyStream(g_stream);
        g_stream = nullptr;
    }
    if (g_context != nullptr) {
        aclrtDestroyContext(g_context);
        g_context = nullptr;
    }

    // 5. 释放设备并终止AscendCL
    aclrtResetDevice(g_device_id);
    aclFinalize();

    std::cout << "所有资源释放完成！" << std::endl;
}

代码解释：

资源释放需遵循 “反向创建顺序”：先释放模型→Device 内存→Host 内存→流→上下文→设备→AscendCL。

2.6 主函数入口

int main() {
    // 1. 初始化AscendCL
    if (!InitAscendCL()) {
        return -1;
    }

    // 2. 加载模型（替换为你的.om模型路径）
    std::string model_path = "./add_model.om";
    if (!LoadModel(model_path)) {
        ReleaseAllResources();
        return -1;
    }

    // 3. 内存分配与数据传输
    if (!AllocMemoryAndCopyData()) {
        ReleaseAllResources();
        return -1;
    }

    // 4. 执行推理
    if (!ExecuteModelInference()) {
        ReleaseAllResources();
        return -1;
    }

    // 5. 释放资源
    ReleaseAllResources();
    return 0;
}

三、编译与运行说明

3.1 编译命令

g++ -o ascend_cl_infer ascend_cl_infer.cpp -I${ASCEND_TOOLKIT_HOME}/include -L${ASCEND_TOOLKIT_HOME}/lib64 -lacl -lacl_mdl -lpthread -fPIC

3.2 运行前配置

export LD_LIBRARY_PATH=${ASCEND_TOOLKIT_HOME}/lib64:$LD_LIBRARY_PATH
export ASCEND_DEVICE_ID=0

3.3 预期输出

AscendCL初始化成功！
模型输入数量：2，输出数量：1
内存分配与数据拷贝成功！
推理结果（前10个）：
input1[0] + input2[0] = 0
input1[1] + input2[1] = 3
input1[2] + input2[2] = 6
input1[3] + input2[3] = 9
input1[4] + input2[4] = 12
input1[5] + input2[5] = 15
input1[6] + input2[6] = 18
input1[7] + input2[7] = 21
input1[8] + input2[8] = 24
input1[9] + input2[9] = 27
所有资源释放完成！

四、关键注意事项

模型编译：自定义Add算子需先通过 Ascend C 编译为.o 文件，再打包进.om 模型（可通过 ATC 工具：atc --model=add_model.onnx --framework=5 --output=add_model --soc_version=Ascend310）；
内存对齐：Device 侧内存分配需满足昇腾硬件的对齐要求（AscendCL 已自动处理，但自定义算子需注意）；
错误码排查：若出现错误，可通过aclGetErrMsg获取详细错误信息（如const char* err_msg = aclGetErrMsg(ret);）。

总结

本文通过完整代码实现了 AscendCL 模型推理的全流程，从环境初始化到资源释放，再到与自定义Add算子的协同，覆盖了昇腾推理开发的核心实操点。掌握这套代码框架后，你可快速适配其他模型（如 ResNet、BERT）的推理开发，真正落地昇腾 AI 全栈的应用能力。

2025年昇腾CANN训练营第二季，基于CANN开源开放全场景，推出0基础入门系列、码力全开特辑、开发者案例等专题课程，助力不同阶段开发者快速提升算子开发技能。获得Ascend C算子中级认证，即可领取精美证书，完成社区任务更有机会赢取华为手机，平板、开发板等大奖。

报名链接:https://www.hiascend.com/developer/activities/cann20252

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

KMP 实现鸿蒙跨端：Kotlin 正则表达式匹配和替换工具

人工智能6S服务平台

鸿蒙 Electron 的权限管理深度解析：从应用权限申请到鸿蒙系统权限校验的全流程

人工智能6S服务平台

鸿蒙Electron基础入门--拓展

使用electron-builder适配鸿蒙系统，修改package.json："build": {"productName": "鸿蒙日志同步工具","arch": ["x64", "arm64"] // 适配PC与平板架构],"deviceTypes": ["phone", "tablet", "pc"], // 多设备支持打包命令：# 生成鸿蒙安装包（.app格式）