CANN cann-samples:从示例代码理解 AscendCL 调用流程

文章目录
一、前言
昇腾 CANN(Compute Architecture for Neural Networks)是华为面向 AI 场景的异构计算架构,为昇腾 NPU 提供完整的开发与运行支撑。cann-samples 是昇腾 CANN 官方提供的示例代码仓库,涵盖了模型推理、算子开发、媒体处理、通信集合等多种场景,是开发者学习 AscendCL(Ascend Computing Language)编程接口的最佳实践库。本文从 cann-samples 示例代码入手,深入剖析 AscendCL 的完整调用流程,帮助开发者快速掌握昇腾 NPU 开发要点。
二、cann-samples 项目结构
2.1 示例分类
cann-samples 按照应用场景分为多个大类:
| 类别 | 说明 | 典型示例 |
|---|---|---|
| inference | 模型推理 | ResNet、YOLO、BERT 等模型推理 |
| operator | 算子开发 | 自定义算子开发示例 |
| media | 媒体处理 | 图像/视频编解码、预处理 |
| communication | 通信集合 | 多卡分布式通信 |
| misc | 其他功能 | 性能调优、工具使用等 |
2.2 目录组织
cann-samples/
├── samples/
│ ├── inference/ # 模型推理示例
│ │ ├── modelInference/
│ │ ├── modelInferenceV2/
│ │ └── ...
│ ├── operator/ # 算子开发示例
│ │ ├── AddCustom/
│ │ └── ...
│ ├── media/ # 媒体处理示例
│ │ ├── dvpp/
│ │ └── ...
│ └── communication/ # 通信集合示例
│ ├── broadcast/
│ └── ...
├── scripts/ # 编译与运行脚本
├── docs/ # 文档
└── README.md
2.3 运行脚本说明
cann-samples 提供了标准化的编译与运行流程:
# 1. 设置环境变量
source /usr/local/Ascend/ascend-toolkit/set_env.sh
# 2. 编译示例
cd samples/inference/modelInference
bash build.sh
# 3. 运行示例
./out/main
编译脚本 build.sh 自动处理 CMake 配置、依赖链接等步骤,开发者只需关注业务代码。
三、AscendCL 完整调用流程
AscendCL 的调用流程遵循严格的生命周期管理,从初始化到资源释放共分为 7 个关键步骤。
3.1 生命周期概览
┌─────────────┐ ┌──────────────┐ ┌─────────────────┐
│ aclInit │ -> │ aclrtSetDevice│ -> │aclrtCreateContext│
└─────────────┘ └──────────────┘ └─────────────────┘
│
v
┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐
│ 资源释放(反向) │ <- │ 模型/算子执行 │ <- │aclrtCreateStream│
└─────────────────┘ └─────────────────┘ └──────────────┘
3.2 步骤详解
步骤 1:初始化 AscendCL 运行时
// 初始化 AscendCL 运行时环境
aclError ret = aclInit(nullptr);
if (ret != ACL_SUCCESS) {
std::cerr << "aclInit failed, ret = " << ret << std::endl;
return -1;
}
aclInit 是整个 AscendCL 程序的入口,负责加载驱动、初始化内部资源。参数为配置文件路径,nullptr 表示使用默认配置。
步骤 2:指定运算设备
// 设置使用的 NPU 设备 ID
int32_t deviceId = 0;
ret = aclrtSetDevice(deviceId);
if (ret != ACL_SUCCESS) {
std::cerr << "aclrtSetDevice failed, ret = " << ret << std::endl;
aclFinalize();
return -1;
}
多卡场景下,通过 deviceId 指定目标 NPU。
步骤 3:创建计算上下文
// 创建 Context,管理计算资源
aclrtContext context = nullptr;
ret = aclrtCreateContext(&context, deviceId);
if (ret != ACL_SUCCESS) {
std::cerr << "aclrtCreateContext failed, ret = " << ret << std::endl;
aclrtResetDevice(deviceId);
aclFinalize();
return -1;
}
Context 是计算资源的容器,一个设备可创建多个 Context,实现资源隔离。
步骤 4:创建执行流
// 创建 Stream,管理任务队列
aclrtStream stream = nullptr;
ret = aclrtCreateStream(&stream);
if (ret != ACL_SUCCESS) {
std::cerr << "aclrtCreateStream failed, ret = " << ret << std::endl;
aclrtDestroyContext(context);
aclrtResetDevice(deviceId);
aclFinalize();
return -1;
}
Stream 是异步执行的任务队列,所有计算、内存拷贝操作都提交到 Stream 中执行。
步骤 5:模型加载与执行
// 加载离线模型
aclmdlDesc* modelDesc = nullptr;
aclmdlDataset* inputDataset = nullptr;
aclmdlDataset* outputDataset = nullptr;
void* modelMem = nullptr;
void* modelWeight = nullptr;
aclmdlLoadFromFile("model.om", &modelId);
// 准备输入输出数据集
// ... (详见下文完整示例)
// 执行模型推理
ret = aclmdlExecute(modelId, inputDataset, outputDataset);
步骤 6:同步与获取结果
// 同步等待 Stream 执行完成
ret = aclrtSynchronizeStream(stream);
if (ret != ACL_SUCCESS) {
std::cerr << "aclrtSynchronizeStream failed" << std::endl;
}
// 从输出数据集获取结果
// ...
步骤 7:资源释放(反向顺序)
// 按创建相反的顺序释放资源
aclmdlUnload(modelId); // 卸载模型
aclrtDestroyStream(stream); // 销毁 Stream
aclrtDestroyContext(context); // 销毁 Context
aclrtResetDevice(deviceId); // 重置设备
aclFinalize(); // 反初始化 AscendCL
资源释放遵循"后进先出"原则,与创建顺序相反,避免依赖冲突。
四、关键 API 的参数解读
4.1 aclrtMalloc 内存管理
aclrtMalloc 是 AscendCL 中最核心的内存分配接口,支持多种内存类型:
aclError aclrtMalloc(
void** devPtr, // 输出:设备内存指针
size_t size, // 分配大小(字节)
aclrtMemMallocPolicy policy // 内存分配策略
);
内存分配策略(aclrtMemMallocPolicy):
typedef enum aclrtMemMallocPolicy {
ACL_MEM_MALLOC_HUGE_FIRST = 0, // 优先使用大页内存
ACL_MEM_MALLOC_HUGE_ONLY, // 仅使用大页内存
ACL_MEM_MALLOC_NORMAL_ONLY, // 仅使用普通内存
ACL_MEM_MALLOC_HUGE_FIRST_P2P, // P2P 场景优先大页
ACL_MEM_MALLOC_HUGE_ONLY_P2P, // P2P 场景仅大页
ACL_MEM_MALLOC_NORMAL_ONLY_P2P // P2P 场景仅普通
} aclrtMemMallocPolicy;
使用示例:
void* deviceMem = nullptr;
size_t memSize = 1024 * 1024; // 1MB
// 优先使用大页内存,性能更优
aclError ret = aclrtMalloc(&deviceMem, memSize, ACL_MEM_MALLOC_HUGE_FIRST);
if (ret != ACL_SUCCESS) {
std::cerr << "aclrtMalloc failed" << std::endl;
}
// 使用后必须释放
aclrtFree(deviceMem);
大页内存 vs 普通内存:
- 大页内存:减少 TLB miss,适合大块连续内存访问场景
- 普通内存:分配灵活,适合小块内存或频繁分配释放场景
4.2 aclrtMemcpy 内存拷贝
aclrtMemcpy 负责主机(Host)与设备(Device)之间的数据传输:
aclError aclrtMemcpy(
void* dst, // 目标地址
size_t dstMax, // 目标缓冲区最大大小
const void* src, // 源地址
size_t count, // 拷贝字节数
aclrtMemcpyKind kind // 拷贝方向
);
拷贝方向枚举(aclrtMemcpyKind):
typedef enum aclrtMemcpyKind {
ACL_MEMCPY_HOST_TO_HOST = 0, // Host -> Host
ACL_MEMCPY_HOST_TO_DEVICE, // Host -> Device (H2D)
ACL_MEMCPY_DEVICE_TO_HOST, // Device -> Host (D2H)
ACL_MEMCPY_DEVICE_TO_DEVICE // Device -> Device (D2D)
} aclrtMemcpyKind;
典型使用场景:
// 1. Host 数据拷贝到 Device
float hostData[1024];
void* deviceData = nullptr;
aclrtMalloc(&deviceData, sizeof(hostData), ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMemcpy(deviceData, sizeof(hostData), hostData, sizeof(hostData),
ACL_MEMCPY_HOST_TO_DEVICE);
// 2. Device 数据拷贝回 Host
float resultData[1024];
aclrtMemcpy(resultData, sizeof(resultData), deviceData, sizeof(resultData),
ACL_MEMCPY_DEVICE_TO_HOST);
// 3. Device 内部拷贝(用于多输入场景)
void* deviceData2 = nullptr;
aclrtMalloc(&deviceData2, sizeof(hostData), ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMemcpy(deviceData2, sizeof(hostData), deviceData, sizeof(hostData),
ACL_MEMCPY_DEVICE_TO_DEVICE);
4.3 aclmdlExecute 模型执行
aclmdlExecute 是模型推理的核心接口:
aclError aclmdlExecute(
uint32_t modelId, // 模型 ID
const aclmdlDataset* input, // 输入数据集
aclmdlDataset* output // 输出数据集
);
输入输出数据集构建:
// 创建输入数据集
aclmdlDataset* CreateInputDataset(void* inputData, size_t inputSize) {
aclmdlDataset* dataset = aclmdlCreateDataset();
// 创建数据缓冲区
aclDataBuffer* dataBuffer = aclCreateDataBuffer(inputData, inputSize);
// 将缓冲区添加到数据集
aclmdlAddDatasetBuffer(dataset, dataBuffer);
return dataset;
}
// 创建输出数据集(需预先知道输出大小)
aclmdlDataset* CreateOutputDataset(size_t outputSize) {
void* outputData = nullptr;
aclrtMalloc(&outputData, outputSize, ACL_MEM_MALLOC_HUGE_FIRST);
aclmdlDataset* dataset = aclmdlCreateDataset();
aclDataBuffer* dataBuffer = aclCreateDataBuffer(outputData, outputSize);
aclmdlAddDatasetBuffer(dataset, dataBuffer);
return dataset;
}
// 执行推理
aclmdlDataset* input = CreateInputDataset(inputBuffer, inputSize);
aclmdlDataset* output = CreateOutputDataset(outputSize);
aclError ret = aclmdlExecute(modelId, input, output);
aclrtSynchronizeStream(stream); // 同步等待结果
五、从 sample 到生产代码的跨越
cann-samples 提供了功能验证的示例代码,但生产环境需要更健壮的实现。
5.1 错误处理
sample 代码(简化版):
aclError ret = aclInit(nullptr);
if (ret != ACL_SUCCESS) {
return -1;
}
生产代码(详细错误信息):
// 错误检查宏
#define ACL_CHECK(call) \
do { \
aclError ret = call; \
if (ret != ACL_SUCCESS) { \
const char* errorStr = aclGetRecentErrMsg(); \
std::cerr << "[ERROR] " << #call << " failed: " \
<< "code=" << ret \
<< ", msg=" << (errorStr ? errorStr : "N/A") \
<< " at " << __FILE__ << ":" << __LINE__ \
<< std::endl; \
return ret; \
} \
} while (0)
// 使用宏简化代码
ACL_CHECK(aclInit(nullptr));
ACL_CHECK(aclrtSetDevice(0));
ACL_CHECK(aclrtCreateContext(&context, 0));
5.2 资源管理 RAII
手动管理资源容易遗漏,推荐使用 RAII(Resource Acquisition Is Initialization)模式:
// RAII 包装类
class AscendContext {
public:
AscendContext(int32_t deviceId) : deviceId_(deviceId),
context_(nullptr),
stream_(nullptr) {
ACL_CHECK_THROW(aclrtSetDevice(deviceId_));
ACL_CHECK_THROW(aclrtCreateContext(&context_, deviceId_));
ACL_CHECK_THROW(aclrtCreateStream(&stream_));
}
~AscendContext() {
if (stream_) {
aclrtDestroyStream(stream_);
}
if (context_) {
aclrtDestroyContext(context_);
}
aclrtResetDevice(deviceId_);
}
// 禁止拷贝
AscendContext(const AscendContext&) = delete;
AscendContext& operator=(const AscendContext&) = delete;
// 允许移动
AscendContext(AscendContext&& other) noexcept;
AscendContext& operator=(AscendContext&& other) noexcept;
aclrtStream GetStream() const { return stream_; }
aclrtContext GetContext() const { return context_; }
private:
int32_t deviceId_;
aclrtContext context_;
aclrtStream stream_;
};
// 使用 RAII
void ProcessModel() {
AscendContext ctx(0); // 构造时初始化
// 执行业务逻辑...
// 函数结束时自动析构,资源自动释放
}
5.3 多 Stream 并行
生产环境常需多 Stream 实现任务并行:
// 多 Stream 并行推理
void ParallelInference(uint32_t modelId, int streamCount) {
std::vector<aclrtStream> streams(streamCount);
std::vector<aclmdlDataset*> inputs(streamCount);
std::vector<aclmdlDataset*> outputs(streamCount);
// 创建多个 Stream
for (int i = 0; i < streamCount; i++) {
ACL_CHECK(aclrtCreateStream(&streams[i]));
inputs[i] = PrepareInput(i);
outputs[i] = PrepareOutput(i);
}
// 并行提交任务
for (int i = 0; i < streamCount; i++) {
ACL_CHECK(aclmdlExecuteAsync(modelId, inputs[i], outputs[i], streams[i]));
}
// 等待所有 Stream 完成
for (int i = 0; i < streamCount; i++) {
ACL_CHECK(aclrtSynchronizeStream(streams[i]));
}
// 释放资源
for (int i = 0; i < streamCount; i++) {
aclrtDestroyStream(streams[i]);
}
}
5.4 日志分级
生产环境需要完善的日志系统:
// 日志级别配置
void SetupLogging() {
// 设置日志级别:DEBUG、INFO、WARN、ERROR、FATAL
aclAppLogConfig config;
config.level = ACL_INFO;
config.outputToStdOut = true;
config.outputToFile = false;
aclAppLogSetConfig(&config);
}
// 日志输出示例
void LogModelInfo(uint32_t modelId) {
aclmdlDesc* desc = aclmdlCreateDesc();
aclmdlGetDesc(desc, modelId);
size_t inputNum = aclmdlGetNumInputs(desc);
size_t outputNum = aclmdlGetNumOutputs(desc);
ACL_LOG_INFO("Model loaded: inputs=%zu, outputs=%zu", inputNum, outputNum);
aclmdlDestroyDesc(desc);
}
六、不同类型示例的横向对比
6.1 模型推理示例(inference)
特点:
- 使用离线模型(.om 文件)
- 关注数据预处理、推理、后处理流程
- 示例路径:
samples/inference/modelInference
核心流程:
// 模型推理核心代码
aclmdlLoadFromFile("resnet50.om", &modelId);
aclmdlExecute(modelId, input, output);
aclmdlUnload(modelId);
6.2 算子开发示例(operator)
特点:
- 使用 Ascend C 语言开发自定义算子
- 关注算子输入输出设计、核函数实现
- 示例路径:
samples/operator/AddCustom
核心流程:
// Ascend C 算子开发(示例)
// 核函数定义
extern "C" __global__ __aicore__ void add_custom(
GM_ADDR x, GM_ADDR y, GM_ADDR z) {
// 算子实现逻辑
// ...
}
// Host 端调用
aclOpKernelDesc* kernelDesc = aclCreateOpKernelDesc("add_custom", ...);
aclCompileOpKernel(kernelDesc, ...);
6.3 媒体处理示例(media)
特点:
- 使用 DVPP(Digital Vision Pre-Processing)硬件加速
- 关注图像/视频编解码、格式转换
- 示例路径:
samples/media/dvpp
核心流程:
// DVPP 图像处理
acldvppChannelDesc* channelDesc = acldvppCreateChannelDesc();
acldvppCreateChannel(channelDesc);
// 创建图片描述符
acldvppPicDesc* inputPic = acldvppCreatePicDesc();
acldvppPicDesc* outputPic = acldvppCreatePicDesc();
// 执行图片解码
acldvppJpegDecodeAsync(channelDesc, inputPic, outputPic, stream);
6.4 通信集合示例(communication)
特点:
- 用于多卡分布式训练场景
- 关注通信原语:Broadcast、Reduce、AllReduce 等
- 示例路径:
samples/communication/broadcast
核心流程:
// 集合通信示例
aclrtMemAllGatherConfig config;
config.rankNum = 8;
config.rankId = 0;
void* sendBuf = nullptr;
void* recvBuf = nullptr;
aclrtMalloc(&sendBuf, dataSize, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc(&recvBuf, dataSize * 8, ACL_MEM_MALLOC_HUGE_FIRST);
// 执行 AllGather
aclrtAllGather(sendBuf, recvBuf, dataCount, dataType, &config, stream);
6.5 对比总结
| 示例类型 | 核心能力 | 典型应用场景 | 学习难度 |
|---|---|---|---|
| 模型推理 | 离线模型加载与执行 | 深度学习推理部署 | ★★☆ |
| 算子开发 | Ascend C 自定义算子 | 算子优化与扩展 | ★★★★ |
| 媒体处理 | DVPP 硬件加速 | 图像/视频预处理 | ★★★ |
| 通信集合 | 多卡通信原语 | 分布式训练推理 | ★★★★ |
七、关键陷阱与解决方案
7.1 陷阱一:aclrtMalloc 未对齐导致性能下降
问题描述:
aclrtMalloc 返回的内存地址默认 64 字节对齐,但某些高性能算子(如矩阵乘法)要求 256 字节或更高对齐。未对齐会导致访存效率大幅下降。
错误示例:
void* mem = nullptr;
aclrtMalloc(&mem, 1024, ACL_MEM_MALLOC_HUGE_FIRST);
// 可能返回 0x7f000040(64 字节对齐,不满足 256 要求)
// 实际性能下降 20%-30%
解决方案:
// 方案 1:分配时预留对齐空间
size_t requiredSize = 1024;
size_t alignedSize = requiredSize + 256; // 预留空间
void* rawMem = nullptr;
aclrtMalloc(&rawMem, alignedSize, ACL_MEM_MALLOC_HUGE_FIRST);
// 手动对齐
uintptr_t addr = reinterpret_cast<uintptr_t>(rawMem);
uintptr_t alignedAddr = (addr + 255) & ~255; // 256 字节对齐
void* alignedMem = reinterpret_cast<void*>(alignedAddr);
// 使用 alignedMem 进行计算...
// 释放时使用原始指针
aclrtFree(rawMem);
// 方案 2:使用对齐内存分配辅助类
class AlignedDeviceMemory {
public:
AlignedDeviceMemory(size_t size, size_t alignment = 256) {
totalSize_ = size + alignment;
ACL_CHECK(aclrtMalloc(&rawPtr_, totalSize_, ACL_MEM_MALLOC_HUGE_FIRST));
uintptr_t addr = reinterpret_cast<uintptr_t>(rawPtr_);
uintptr_t alignedAddr = (addr + alignment - 1) & ~(alignment - 1);
alignedPtr_ = reinterpret_cast<void*>(alignedAddr);
}
~AlignedDeviceMemory() {
if (rawPtr_) {
aclrtFree(rawPtr_);
}
}
void* Get() const { return alignedPtr_; }
private:
void* rawPtr_ = nullptr;
void* alignedPtr_ = nullptr;
size_t totalSize_ = 0;
};
// 使用
AlignedDeviceMemory mem(1024, 256); // 256 字节对齐
void* devicePtr = mem.Get();
7.2 陷阱二:Context 未及时释放导致 NPU 资源泄漏
问题描述:
Context 是 NPU 资源的管理单元,未及时释放会导致显存、计算资源持续占用,最终耗尽设备资源。
错误示例:
void ProcessModel() {
aclrtContext context;
aclrtCreateContext(&context, 0);
// 业务逻辑中提前返回,忘记释放 Context
if (someError) {
return; // Context 泄漏!
}
aclrtDestroyContext(context);
}
后果:
- 显存占用持续增长
- 后续 Context 创建失败
- 严重时导致 NPU 不可用,需重启设备
解决方案:
// 方案 1:RAII 自动管理
class ContextGuard {
public:
ContextGuard(int32_t deviceId) : deviceId_(deviceId) {
ACL_CHECK_THROW(aclrtCreateContext(&context_, deviceId_));
}
~ContextGuard() {
if (context_) {
aclrtDestroyContext(context_);
}
}
aclrtContext Get() const { return context_; }
private:
int32_t deviceId_;
aclrtContext context_ = nullptr;
};
// 使用 RAII
void ProcessModel() {
ContextGuard guard(0); // 自动管理生命周期
if (someError) {
return; // 自动析构,释放 Context
}
// 正常流程...
}
// 方案 2:智能指针
std::unique_ptr<aclrtContext, decltype(&aclrtDestroyContext)> CreateContextRAII(int32_t deviceId) {
aclrtContext context;
ACL_CHECK_THROW(aclrtCreateContext(&context, deviceId));
return {context, &aclrtDestroyContext};
}
资源泄漏检测脚本:
#!/bin/bash
# npu_memory_monitor.sh - 检测 NPU 资源泄漏
echo "=== NPU Memory Monitor ==="
while true; do
timestamp=$(date '+%Y-%m-%d %H:%M:%S')
# 获取各 NPU 的显存使用情况
for device_id in 0 1 2 3 4 5 6 7; do
mem_info=$(npu-smi info -t memory -i $device_id 2>/dev/null)
if [ $? -eq 0 ]; then
used=$(echo "$mem_info" | grep "Memory Usage" | awk '{print $3}')
total=$(echo "$mem_info" | grep "Memory Usage" | awk '{print $5}')
if [ ! -z "$used" ]; then
echo "[$timestamp] Device $device_id: $used / $total"
fi
fi
done
echo "---"
sleep 60
done
运行检测:
chmod +x npu_memory_monitor.sh
./npu_memory_monitor.sh > npu_memory.log 2>&1 &
八、实战代码模板
8.1 完整 AscendCL 调用模板
#include <iostream>
#include <vector>
#include "acl/acl.h"
// 错误检查宏
#define ACL_CHECK(call) \
do { \
aclError ret = call; \
if (ret != ACL_SUCCESS) { \
std::cerr << "[ERROR] " << #call \
<< " failed, code=" << ret \
<< " at " << __FILE__ << ":" << __LINE__ \
<< std::endl; \
return ret; \
} \
} while (0)
// 资源管理类
class AscendRuntime {
public:
AscendRuntime(int32_t deviceId = 0) : deviceId_(deviceId) {}
aclError Init() {
ACL_CHECK(aclInit(nullptr));
ACL_CHECK(aclrtSetDevice(deviceId_));
ACL_CHECK(aclrtCreateContext(&context_, deviceId_));
ACL_CHECK(aclrtCreateStream(&stream_));
initialized_ = true;
return ACL_SUCCESS;
}
~AscendRuntime() {
if (!initialized_) return;
if (stream_) aclrtDestroyStream(stream_);
if (context_) aclrtDestroyContext(context_);
aclrtResetDevice(deviceId_);
aclFinalize();
}
aclrtStream GetStream() const { return stream_; }
aclrtContext GetContext() const { return context_; }
private:
int32_t deviceId_;
aclrtContext context_ = nullptr;
aclrtStream stream_ = nullptr;
bool initialized_ = false;
};
// 模型推理封装
class ModelRunner {
public:
ModelRunner(const std::string& modelPath, aclrtStream stream)
: stream_(stream), modelPath_(modelPath) {}
aclError Load() {
ACL_CHECK(aclmdlLoadFromFile(modelPath_.c_str(), &modelId_));
modelDesc_ = aclmdlCreateDesc();
ACL_CHECK(aclmdlGetDesc(modelDesc_, modelId_));
return ACL_SUCCESS;
}
~ModelRunner() {
if (modelDesc_) aclmdlDestroyDesc(modelDesc_);
if (modelId_ != 0xFFFFFFFF) aclmdlUnload(modelId_);
}
aclError Inference(void* inputData, size_t inputSize,
void* outputData, size_t outputSize) {
// 创建输入数据集
aclDataBuffer* inputBuf = aclCreateDataBuffer(inputData, inputSize);
aclmdlDataset* inputDataset = aclmdlCreateDataset();
aclmdlAddDatasetBuffer(inputDataset, inputBuf);
// 创建输出数据集
aclDataBuffer* outputBuf = aclCreateDataBuffer(outputData, outputSize);
aclmdlDataset* outputDataset = aclmdlCreateDataset();
aclmdlAddDatasetBuffer(outputDataset, outputBuf);
// 执行推理
ACL_CHECK(aclmdlExecute(modelId, inputDataset, outputDataset));
ACL_CHECK(aclrtSynchronizeStream(stream_));
// 清理
aclDestroyDataBuffer(inputBuf);
aclDestroyDataBuffer(outputBuf);
aclmdlDestroyDataset(inputDataset);
aclmdlDestroyDataset(outputDataset);
return ACL_SUCCESS;
}
private:
uint32_t modelId_ = 0xFFFFFFFF;
aclmdlDesc* modelDesc_ = nullptr;
aclrtStream stream_;
std::string modelPath_;
};
// 主函数
int main() {
AscendRuntime runtime;
ACL_CHECK(runtime.Init());
ModelRunner runner("model.om", runtime.GetStream());
ACL_CHECK(runner.Load());
// 准备输入输出
float inputData[1024];
float outputData[1024];
void* dInput = nullptr;
void* dOutput = nullptr;
ACL_CHECK(aclrtMalloc(&dInput, sizeof(inputData), ACL_MEM_MALLOC_HUGE_FIRST));
ACL_CHECK(aclrtMalloc(&dOutput, sizeof(outputData), ACL_MEM_MALLOC_HUGE_FIRST));
// 拷贝输入
ACL_CHECK(aclrtMemcpy(dInput, sizeof(inputData), inputData, sizeof(inputData),
ACL_MEMCPY_HOST_TO_DEVICE));
// 推理
ACL_CHECK(runner.Inference(dInput, sizeof(inputData), dOutput, sizeof(outputData)));
// 拷贝输出
ACL_CHECK(aclrtMemcpy(outputData, sizeof(outputData), dOutput, sizeof(outputData),
ACL_MEMCPY_DEVICE_TO_HOST));
// 清理
aclrtFree(dInput);
aclrtFree(dOutput);
std::cout << "Inference completed successfully!" << std::endl;
return 0;
}
8.2 错误检查宏(增强版)
// error_check.h
#ifndef ERROR_CHECK_H
#define ERROR_CHECK_H
#include <iostream>
#include <sstream>
#include "acl/acl.h"
namespace ascend {
inline std::string FormatError(const char* call, aclError code,
const char* file, int line) {
std::ostringstream oss;
oss << "[ACL_ERROR] " << call
<< " failed with code " << code;
const char* errMsg = aclGetRecentErrMsg();
if (errMsg && strlen(errMsg) > 0) {
oss << ": " << errMsg;
}
oss << " (at " << file << ":" << line << ")";
return oss.str();
}
} // namespace ascend
// 基础检查宏
#define ACL_CHECK(call) \
do { \
aclError __ret = (call); \
if (__ret != ACL_SUCCESS) { \
std::cerr << ascend::FormatError(#call, __ret, \
__FILE__, __LINE__) \
<< std::endl; \
return __ret; \
} \
} while (0)
// 检查并抛异常
#define ACL_CHECK_THROW(call) \
do { \
aclError __ret = (call); \
if (__ret != ACL_SUCCESS) { \
throw std::runtime_error( \
ascend::FormatError(#call, __ret, __FILE__, __LINE__));\
} \
} while (0)
// 检查并继续(不返回)
#define ACL_CHECK_CONTINUE(call) \
do { \
aclError __ret = (call); \
if (__ret != ACL_SUCCESS) { \
std::cerr << ascend::FormatError(#call, __ret, \
__FILE__, __LINE__) \
<< std::endl; \
} \
} while (0)
#endif // ERROR_CHECK_H
8.3 RAII 内存管理类
// device_memory.h
#ifndef DEVICE_MEMORY_H
#define DEVICE_MEMORY_H
#include "acl/acl.h"
#include "error_check.h"
namespace ascend {
class DeviceMemory {
public:
explicit DeviceMemory(size_t size,
aclrtMemMallocPolicy policy = ACL_MEM_MALLOC_HUGE_FIRST)
: size_(size), policy_(policy) {
ACL_CHECK_THROW(aclrtMalloc(&ptr_, size, policy));
}
~DeviceMemory() {
if (ptr_) {
aclrtFree(ptr_);
}
}
// 禁止拷贝
DeviceMemory(const DeviceMemory&) = delete;
DeviceMemory& operator=(const DeviceMemory&) = delete;
// 允许移动
DeviceMemory(DeviceMemory&& other) noexcept
: ptr_(other.ptr_), size_(other.size_), policy_(other.policy_) {
other.ptr_ = nullptr;
other.size_ = 0;
}
DeviceMemory& operator=(DeviceMemory&& other) noexcept {
if (this != &other) {
if (ptr_) aclrtFree(ptr_);
ptr_ = other.ptr_;
size_ = other.size_;
policy_ = other.policy_;
other.ptr_ = nullptr;
other.size_ = 0;
}
return *this;
}
void* Get() const { return ptr_; }
size_t Size() const { return size_; }
// 拷贝主机数据到设备
void CopyFromHost(const void* src, size_t size) {
if (size > size_) {
throw std::runtime_error("Copy size exceeds allocated memory");
}
ACL_CHECK_THROW(aclrtMemcpy(ptr_, size_, src, size,
ACL_MEMCPY_HOST_TO_DEVICE));
}
// 拷贝设备数据到主机
void CopyToHost(void* dst, size_t size) const {
if (size > size_) {
throw std::runtime_error("Copy size exceeds allocated memory");
}
ACL_CHECK_THROW(aclrtMemcpy(dst, size, ptr_, size,
ACL_MEMCPY_DEVICE_TO_HOST));
}
private:
void* ptr_ = nullptr;
size_t size_ = 0;
aclrtMemMallocPolicy policy_;
};
} // namespace ascend
#endif // DEVICE_MEMORY_H
8.4 多 Stream 并行推理
// multi_stream_inference.cpp
#include <vector>
#include <thread>
#include "acl/acl.h"
class MultiStreamInference {
public:
MultiStreamInference(int numStreams, uint32_t modelId)
: numStreams_(numStreams), modelId_(modelId) {}
void Init() {
streams_.resize(numStreams_);
for (int i = 0; i < numStreams_; i++) {
ACL_CHECK(aclrtCreateStream(&streams_[i]));
}
}
void Run(const std::vector<void*>& inputs,
const std::vector<void*>& outputs,
const std::vector<size_t>& inputSizes,
const std::vector<size_t>& outputSizes) {
// 异步提交所有任务
for (int i = 0; i < numStreams_; i++) {
aclDataBuffer* inputBuf = aclCreateDataBuffer(inputs[i], inputSizes[i]);
aclDataBuffer* outputBuf = aclCreateDataBuffer(outputs[i], outputSizes[i]);
aclmdlDataset* inputDataset = aclmdlCreateDataset();
aclmdlDataset* outputDataset = aclmdlCreateDataset();
aclmdlAddDatasetBuffer(inputDataset, inputBuf);
aclmdlAddDatasetBuffer(outputDataset, outputBuf);
// 异步执行
aclmdlExecuteAsync(modelId_, inputDataset, outputDataset, streams_[i]);
}
// 等待所有 Stream 完成
for (int i = 0; i < numStreams_; i++) {
aclrtSynchronizeStream(streams_[i]);
}
}
~MultiStreamInference() {
for (auto stream : streams_) {
if (stream) aclrtDestroyStream(stream);
}
}
private:
int numStreams_;
uint32_t modelId_;
std::vector<aclrtStream> streams_;
};
8.5 性能计时工具
// perf_timer.h
#ifndef PERF_TIMER_H
#define PERF_TIMER_H
#include <chrono>
#include "acl/acl.h"
namespace ascend {
class PerfTimer {
public:
PerfTimer() : start_(std::chrono::high_resolution_clock::now()) {}
void Reset() {
start_ = std::chrono::high_resolution_clock::now();
}
double ElapsedMs() const {
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration<double, std::milli>(end - start_).count();
}
private:
std::chrono::high_resolution_clock::time_point start_;
};
// NPU 事件计时(更精确)
class NpuEventTimer {
public:
NpuEventTimer(aclrtStream stream) : stream_(stream) {
aclrtCreateEvent(&startEvent_);
aclrtCreateEvent(&endEvent_);
}
~NpuEventTimer() {
if (startEvent_) aclrtDestroyEvent(startEvent_);
if (endEvent_) aclrtDestroyEvent(endEvent_);
}
void Start() {
aclrtRecordEvent(startEvent_, stream_);
}
void Stop() {
aclrtRecordEvent(endEvent_, stream_);
aclrtSynchronizeStream(stream_);
}
float ElapsedMs() const {
float ms;
aclrtEventElapsedTime(startEvent_, endEvent_, &ms);
return ms;
}
private:
aclrtStream stream_;
aclrtEvent startEvent_ = nullptr;
aclrtEvent endEvent_ = nullptr;
};
} // namespace ascend
#endif // PERF_TIMER_H
8.6 模型描述信息打印
// model_info_printer.cpp
#include <iostream>
#include "acl/acl.h"
void PrintModelInfo(aclmdlDesc* modelDesc) {
std::cout << "=== Model Information ===" << std::endl;
// 输入信息
size_t inputNum = aclmdlGetNumInputs(modelDesc);
std::cout << "Number of inputs: " << inputNum << std::endl;
for (size_t i = 0; i < inputNum; i++) {
std::cout << "\nInput " << i << ":" << std::endl;
// 名称
const char* name = aclmdlGetInputNameByIndex(modelDesc, i);
std::cout << " Name: " << (name ? name : "N/A") << std::endl;
// 数据类型
aclDataType dataType = aclmdlGetInputDataType(modelDesc, i);
std::cout << " Data type: " << static_cast<int>(dataType) << std::endl;
// 形状
aclmdlIODims dims;
aclmdlGetInputDims(modelDesc, i, &dims);
std::cout << " Shape: [";
for (size_t j = 0; j < dims.dimCount; j++) {
std::cout << dims.dims[j];
if (j < dims.dimCount - 1) std::cout << ", ";
}
std::cout << "]" << std::endl;
// 大小
size_t size = aclmdlGetInputSizeByIndex(modelDesc, i);
std::cout << " Size: " << size << " bytes" << std::endl;
}
// 输出信息
size_t outputNum = aclmdlGetNumOutputs(modelDesc);
std::cout << "\nNumber of outputs: " << outputNum << std::endl;
for (size_t i = 0; i < outputNum; i++) {
std::cout << "\nOutput " << i << ":" << std::endl;
const char* name = aclmdlGetOutputNameByIndex(modelDesc, i);
std::cout << " Name: " << (name ? name : "N/A") << std::endl;
aclDataType dataType = aclmdlGetOutputDataType(modelDesc, i);
std::cout << " Data type: " << static_cast<int>(dataType) << std::endl;
aclmdlIODims dims;
aclmdlGetOutputDims(modelDesc, i, &dims);
std::cout << " Shape: [";
for (size_t j = 0; j < dims.dimCount; j++) {
std::cout << dims.dims[j];
if (j < dims.dimCount - 1) std::cout << ", ";
}
std::cout << "]" << std::endl;
size_t size = aclmdlGetOutputSizeByIndex(modelDesc, i);
std::cout << " Size: " << size << " bytes" << std::endl;
}
}
8.7 批量推理优化
// batch_inference.cpp
#include <vector>
#include "acl/acl.h"
class BatchInference {
public:
BatchInference(uint32_t modelId, aclrtStream stream, size_t batchSize)
: modelId_(modelId), stream_(stream), batchSize_(batchSize) {}
void PrepareBatchInput(const std::vector<void*>& singleInputs,
size_t singleInputSize) {
batchInputSize_ = singleInputSize * batchSize_;
ACL_CHECK(aclrtMalloc(&batchInputMem_, batchInputSize_,
ACL_MEM_MALLOC_HUGE_FIRST));
// 拼接批次数据
for (size_t i = 0; i < singleInputs.size() && i < batchSize_; i++) {
size_t offset = i * singleInputSize;
ACL_CHECK(aclrtMemcpy(
static_cast<char*>(batchInputMem_) + offset,
singleInputSize,
singleInputs[i],
singleInputSize,
ACL_MEMCPY_DEVICE_TO_DEVICE
));
}
}
void Run() {
aclDataBuffer* inputBuf = aclCreateDataBuffer(batchInputMem_, batchInputSize_);
aclDataBuffer* outputBuf = aclCreateDataBuffer(batchOutputMem_, batchOutputSize_);
aclmdlDataset* inputDataset = aclmdlCreateDataset();
aclmdlDataset* outputDataset = aclmdlCreateDataset();
aclmdlAddDatasetBuffer(inputDataset, inputBuf);
aclmdlAddDatasetBuffer(outputDataset, outputBuf);
ACL_CHECK(aclmdlExecute(modelId_, inputDataset, outputDataset));
ACL_CHECK(aclrtSynchronizeStream(stream_));
}
~BatchInference() {
if (batchInputMem_) aclrtFree(batchInputMem_);
if (batchOutputMem_) aclrtFree(batchOutputMem_);
}
private:
uint32_t modelId_;
aclrtStream stream_;
size_t batchSize_;
size_t batchInputSize_ = 0;
size_t batchOutputSize_ = 0;
void* batchInputMem_ = nullptr;
void* batchOutputMem_ = nullptr;
};
8.8 资源泄漏检测脚本(Python)
#!/usr/bin/env python3
# resource_leak_detector.py
import subprocess
import time
import re
from collections import defaultdict
class ResourceMonitor:
def __init__(self, device_ids=[0]):
self.device_ids = device_ids
self.history = defaultdict(list)
def get_memory_usage(self, device_id):
"""获取指定设备的显存使用情况"""
cmd = f"npu-smi info -t memory -i {device_id}"
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
if result.returncode != 0:
return None
# 解析输出
output = result.stdout
match = re.search(r"Memory Usage\s*:\s*(\d+)\s*/\s*(\d+)\s*MB", output)
if match:
used = int(match.group(1))
total = int(match.group(2))
return {"used_mb": used, "total_mb": total, "usage_pct": used * 100.0 / total}
return None
def get_context_count(self, device_id):
"""获取指定设备的 Context 数量(通过系统调用)"""
# 注意:需要 root 权限或特定工具支持
# 这里使用简化方式,实际需要对应工具
return None
def monitor(self, interval_seconds=60, max_records=1440):
"""持续监控资源使用"""
print(f"Starting resource monitor (interval: {interval_seconds}s)")
print("-" * 60)
record_count = 0
while record_count < max_records:
timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
for device_id in self.device_ids:
mem_info = self.get_memory_usage(device_id)
if mem_info:
record = {
"timestamp": timestamp,
**mem_info
}
self.history[device_id].append(record)
# 检测泄漏趋势
if len(self.history[device_id]) >= 10:
recent = self.history[device_id][-10:]
trend = self._analyze_trend(recent)
if trend == "increasing":
print(f"[WARNING] Device {device_id}: Memory usage increasing, "
f"possible leak! Current: {mem_info['used_mb']} MB")
print(f"[{timestamp}] Device {device_id}: "
f"{mem_info['used_mb']}/{mem_info['total_mb']} MB "
f"({mem_info['usage_pct']:.1f}%)")
time.sleep(interval_seconds)
record_count += 1
def _analyze_trend(self, records):
"""分析使用趋势"""
values = [r["used_mb"] for r in records]
# 简单趋势判断:后5个值的平均是否明显大于前5个
first_half = sum(values[:5]) / 5
second_half = sum(values[5:]) / 5
if second_half > first_half * 1.2: # 增长超过 20%
return "increasing"
elif second_half < first_half * 0.8:
return "decreasing"
else:
return "stable"
def report(self):
"""生成监控报告"""
print("\n" + "=" * 60)
print("Resource Monitor Report")
print("=" * 60)
for device_id, records in self.history.items():
if not records:
continue
print(f"\nDevice {device_id}:")
print(f" Total records: {len(records)}")
used_values = [r["used_mb"] for r in records]
print(f" Min memory: {min(used_values)} MB")
print(f" Max memory: {max(used_values)} MB")
print(f" Avg memory: {sum(used_values)/len(used_values):.1f} MB")
trend = self._analyze_trend(records)
print(f" Trend: {trend}")
if __name__ == "__main__":
monitor = ResourceMonitor(device_ids=[0, 1, 2, 3])
try:
monitor.monitor(interval_seconds=60)
except KeyboardInterrupt:
monitor.report()
九、总结与推荐
本文从 cann-samples 示例代码入手,系统梳理了 AscendCL 的完整调用流程,包括初始化、设备管理、内存操作、模型执行、资源释放等关键环节。通过深入剖析关键 API 的参数语义,以及从示例代码到生产代码的跨越技巧,帮助开发者快速掌握昇腾 NPU 开发要点。
特别强调了两个关键陷阱:内存对齐问题和 Context 资源泄漏问题,并提供了完整的解决方案和检测脚本。通过 RAII、错误处理宏、多 Stream 并行等最佳实践,可以构建健壮高效的昇腾应用。
推荐资源:
- ops-transformer FlashAttention:高性能注意力算子实现
- cann-samples 官方仓库:https://atomgit.com/cann/cann-samples
持续学习与实践,深入理解昇腾 CANN 的设计哲学,才能充分发挥昇腾 NPU 的强大算力。
更多推荐



所有评论(0)