手把手上手 CANN 算子模板：catlass 让你少写90%的算子代码

本文介绍了如何利用昇腾NPU的算子模板库catlass快速实现自定义稀疏矩阵乘法算子。作者通过实际案例展示了从环境准备到算子开发的完整流程，重点解析了catlass的核心组件：计算核(ComputeUnit)、分块策略(Tile)和流水线编排(Pipeline)。文章详细说明了如何扩展标准GEMM算子，加入mask处理逻辑实现稀疏计算，并提供了关键的性能优化建议，如mask数据的预取策略。相比传统

雨季666

299人浏览 · 2026-05-20 21:35:17

雨季666 · 2026-05-20 21:35:17 发布

上次接了个新模型，发现里面有个自定义算子要移植到昇腾NPU 上——一个带稀疏模式的矩阵乘，比标准 GEMM 多一个 mask 输入，框架里没有现成算子。翻了翻 CANN 生态，找到 catlass 仓库，试了一周之后发现：之前手写一个算子要三天，用 catlass 一天就收工了。

这篇就是踩坑实录，步骤全部可复现。

catlass 是什么

catlass 是昇腾算子模板库，定位是给 ops-nn / ops-math / ops-blas 这几个算子仓提供底层模板。说白了，catlass 不直接暴露给应用开发者，它是一层"母版"——其他算子仓基于 catlass 的模板生成各自的高性能算子实现。

对终端用户来说，你其实已经在用 catlass 了：ops-nn 里的 MatMul、ops-blas 里的 GEMM，这些背后全是 catlass 模板生成的。

环境准备

# 1. 确认 CANN 已安装（需要 8.0+）
cat /usr/local/Ascend/ascend-toolkit/version.info
# 确认 CANN_VERSION >= 8.0.0

# 2. 克隆 catlass 仓库
git clone https://atomgit.com/cann/catlass
cd catlass

# 3. 查看目录结构，找到模板目录
ls catlass/
# core/  templates/  examples/  scripts/  README.md

# 4. 看一个现成例子，从简单入手
ls examples/
# gemm/  matmul/  attention/  layernorm/  ...

⚠️ 踩坑预警：catlass 依赖 opbase（算子基础组件），克隆前先确认 opbase 在同级目录，或者直接：

git clone https://atomgit.com/cann/opbase
git clone https://atomgit.com/cann/catlass
# 两个目录平级放，catlass 的 CMakeLists 会自动找 opbase

核心概念：模板三剑客

catlass 模板体系里，搞清楚三个东西就够了：

1. Core 计算核（ComputeUnit）

这是最底层的计算逻辑定义。一个 Core 描述的是：给定一组输入数据块，在昇腾 Cube/Vector 单元上做一次完整计算。

// catlass/core/gemm/compute_unit.h
// 简化版，计算核的声明

template <typename T, LAYOUT Layout>
class GemmComputeUnit {
    // T: 数据类型，half/float/int8
    // Layout: 行主序/列主序
    
    // 输入：左矩阵块、右矩阵块、输出块
    // bias 在这里可选加
    __aicore____inline__ void Process(
        LocalTensor<half>& tensorC,
        const LocalTensor<half>& tensorA,
        const LocalTensor<half>& tensorB,
        const LocalTensor<half>& tensorBias,  // 可选
        const GemmParam& param
    ) {
        // 底层调用昇腾 Cube 矩阵乘指令
        // UB -> GM 写入在调用侧处理，这里只管算
        Mmad(tensorC, tensorA, tensorB, param);
    }
};

2. Tile 策略（分块大小）

昇腾NPU SRAM 容量有限（几十 MB），一个大矩阵塞不进去。Tile 策略定义的是：怎么把大矩阵切成小块，每块分别送入 ComputeUnit 计算。

// catlass/core/gemm/tile_strategy.h
// tile 策略定义了 Br、Bc、-inner 块大小

struct GemmTileStrategy {
    static constexpr uint32_t Br = 64;   // A 的行块大小
    static constexpr uint32_t Bc = 64;   // B 的列块大小
    static constexpr uint32_t Kr = 16;   // 累加寄存器复用大小
    
    // 对于 Atlas A2，BR=64/Bc=64 在 L1 缓存下效率最高
    // 这个参数是经验值，来自昇腾官方 benchmark
};

⚠️ 踩坑预警：Atlas 910 和 Atlas A2 的 L1 缓存大小不一样，用同一个 tile 策略性能可能差 20%。catlass 的 CMakeLists 里按硬件自动选 tile 策略，一般不用改，但如果你追求极致性能，这里就是主战场。

3. 流水线编排（Pipeline）

计算和数据搬运要并行。catlass 用 Tiling + DoubleBuffer 编排流水线，把数据从 HBM 搬到 SRAM 的过程和 Cube 计算过程重叠起来。

// catlass/core/gemm/pipeline.h
// 流水线编排的核心逻辑

template <typename ComputeUnit, typename TileStrategy>
void GemmPipeline<ComputeUnit, TileStrategy>::Execute() {
    // 双缓冲：ping-pong 两个 buffer
    // 等 buffer0 在算的时候，下一块数据已经往 buffer1 搬了
    // 等 buffer1 算完，buffer0 的下一块又到位了
    // 计算和搬运时间完全重叠，HBM 带宽不再是瓶颈
    
    for (uint32_t m = 0; m < M; m += Br) {
        for (uint32_t n = 0; n < N; n += Bc) {
            // 计算当前块 C[m:m+Br, n:n+Bc]
            compute_unit.Process(C_tile, A_tile, B_tile, bias_tile, param);
            
            // 写回 HBM，这里才是真正的 HBM 写操作
            DataCopyDesc desc = {Br, Bc};
            GMtensorC.template CopyFrom(C_tile, desc);
        }
    }
}

手写一个自定义算子

光看不练假把式。接下来基于 catlass 模板，实际生成一个带 mask 的稀疏 GEMM 算子。

第一步：声明算子参数

// my_masked_gemm/op_impl.h
// 定义稀疏 GEMM 的参数结构

#include "catlass/core/gemm/compute_unit.h"

struct MaskedGemmParam {
    // 标准 GEMM 参数
    uint32_t M, N, K;         // 矩阵维度
    uint32_t lda, ldb, ldc;   // 跨距
    half alpha, beta;         // 缩放因子
    
    // 稀疏 GEMM 专用参数
    uint32_t* mask_ptr;       // 指向 mask 的 HBM 地址
    uint32_t mask_row;        // mask 矩阵的行数
    uint32_t mask_col;        // mask 矩阵的列数
};

第二步：扩展计算核，加入 mask 逻辑

// my_masked_gemm/masked_compute_unit.h
// 在标准 GEMM 计算核基础上，加入 mask 处理

template <typename T>
class MaskedGemmComputeUnit : public catlass::GemmComputeUnit<T, ROW_COL> {
    using Base = catlass::GemmComputeUnit<T, ROW_COL>;
    
    __aicore____inline__ void Process(
        LocalTensor<T>& C,
        const LocalTensor<T>& A,
        const LocalTensor<T>& B,
        const LocalTensor<T>& Bias,
        const LocalTensor<uint32_t>& mask,  // mask 张量
        const MaskedGemmParam& param
    ) {
        // 先做标准 GEMM 计算
        Base::Process(C, A, B, Bias, param);
        
        // 再做 mask 乘：把 mask=0 的位置清零
        // 这是稀疏 GEMM 的关键：被 mask 遮住的输出直接置零
        // 用 Vector 单元做 element-wise 乘法，开销很小
        for (uint32_t i = 0; i < Br; ++i) {
            for (uint32_t j = 0; j < Bc; ++j) {
                uint32_t mask_val = mask.GetValue(i, j);
                if (mask_val == 0) {
                    C.SetValue(i * ldc + j, T(0.0f));
                }
            }
        }
    }
};

⚠️ 踩坑预警：mask 张量怎么进 SRAM 是关键。如果 mask 在 HBM 里，每次 mask 判断都要访问 HBM，那就把双缓冲省出来的带宽又吃回去了。正确做法是把 mask 预取到 L1，或者用更聪明的 bitmask 压缩表示——catlass 的 examples 里有一个 mask_attention 参考实现，可以直接抄。

第三步：注册到 AscendCL

// my_masked_gemm/op_register.cpp
// 把算子注册到 CANN，算子名和 PyTorch 适配层对应

#include "acl/acl.h"
#include "operator.h"

class MaskedGemmOp : public Op {
    bool InferShape() override {
        // 输出 shape = [M, N]
        this->outputDesc(0).dims = {param.M, param.N};
        return true;
    }
    
    bool Execute() override {
        // 核心逻辑：加载数据 → 调用流水线 → 写回
        MaskedGemmComputeUnit<half> compute_unit;
        GemmTileStrategy tile_strategy;
        GemmPipeline pipeline(compute_unit, tile_strategy);
        pipeline.Execute();
    }
};

// 注册算子
REGISTER_OP("MaskedGemm")
    .Input(0, "A", DTYPE_HALF)
    .Input(1, "B", DTYPE_HALF)
    .Input(2, "mask", DTYPE_UINT32)
    .Output(0, "C", DTYPE_HALF)
    .Attr("M", VALUE_TYPE_UINT32)
    .Attr("N", VALUE_TYPE_UINT32)
    .Attr("K", VALUE_TYPE_UINT32);

性能对比

实现方式	开发时间	性能（512×512 GEMM）	可维护性
纯手写 Ascend C	~3天	100%	低，代码难读
基于 catlass 模板	~1天	~97%（自动优化）	高，模板清晰
纯手写 TBE（过时）	~5天	约95%	极低，已不推荐