CANN ops-tensor：张量操作算子库的切片与索引优化

本文介绍了PyTorch在昇腾NPU上的优化策略，重点解决非连续张量操作带来的性能问题。通过ops-tensor仓实现了三类张量操作的优化：切片/索引、形状变换和广播/拼接。针对gather算子提出索引预排序和批量读取方案，利用UB内存减少HBM访问。同时分析了reshape与view的区别，建议避免不必要的contiguous操作。还介绍了Blaze后端的JIT编译优化，通过预编译和缓存机制处理

布丁爱吃鱼ya

319人浏览 · 2026-05-23 22:21:06

布丁爱吃鱼ya · 2026-05-23 22:21:06 发布

请添加图片描述

前言

PyTorch 的 advanced indexing 看起来简单——x[:, [1, 3, 5]] 一行搞定。底层却不简单：非连续内存访问、步长计算、维度广播，每一步都可能是性能陷阱。在昇腾NPU上，ops-tensor 把这些操作重新实现，核心目标是减少非连续内存访问带来的 HBM 读放大。

非连续张量的性能陷阱

先说清楚问题是什么。

PyTorch 的张量可以是非连续（non-contiguous）的。典型场景：

import torch
x = torch.randn(1000, 1000)
y = x[:, ::2]      # 步长=2，非连续
z = x.t()          # 转置，非连续

y 和 z 在内存里不是连续存储的。对它们做 matmul、add 等操作，底层要么：

先 contiguous() 拷一份连续内存（慢，占显存）
或者用 stride-aware 的 Kernel 直接读非连续内存（快，但 Kernel 复杂）

ops-tensor 选的是路线 2——直接支持非连续内存访问，不额外拷贝。

ops-tensor 的算子清单

ops-tensor 仓覆盖的张量操作分三类：

类别	算子	非连续敏感？
切片/索引	`slice` / `index_select` / `gather` / `take`	✅ 高敏感
形状变换	`reshape` / `permute` / `transpose` / `squeeze`	⚠️ 部分敏感
广播/拼接	`broadcast_to` / `cat` / `stack` / `chunk`	✅ 高敏感

非连续敏感的意思是：输入张量如果非连续，算子的 HBM 访问模式会从"顺序读"变成"跳跃读"，带宽利用率掉 30-50%。

gather 算子：索引预排序 + 批量读取

gather 是最常见的非连续访问算子。语义是：

# 沿着 dim=1，用 index 取元素
output[i][j] = input[i][index[i][j]]

问题：index 是任意的，访问 input 的内存地址是乱的——Cache 命中率低，HBM 读放大严重。

ops-tensor 的优化思路

1. 索引预排序

把 index 排序，让对 input 的访问尽量连续。比如：

原 index = [7, 1, 9, 2, 8]     # 访问顺序乱
排序后    = [1, 2, 7, 8, 9]     # 访问顺序连续

排序后批量读 input，读完再按原顺序 scatter 回 output。

2. UB 内排序，不占 HBM

排序在 UB（片上内存）里做，不写 HBM。UB 大小有限（AI Core 的 UB 大概几百 KB），所以一次只排 1024 个索引——够用了，因为 HBM 的 burst read 一次也能拉 1024 个 float16。

3. Vector 核做排序，Cube 核同时算别的

gather 的排序是逐元素的，适合 Vector 核。Cube 核可以同时跑别的 Matrix Multiply，两个核不抢资源。

代码示例（Ascend C 风格伪代码）

__aicore__ void GatherKernel(AscendC::GlobalTensor<float> &input,
                            AscendC::GlobalTensor<int> &index,
                            AscendC::GlobalTensor<float> &output,
                            int batch, int seqLen, int hiddenDim) {
    // UB 分配
    AscendC::LocalTensor<float> ubInput = QUEUE_UB.AllocTensor<float>();
    AscendC::LocalTensor<int>   ubIndex = QUEUE_UB.AllocTensor<int>();
    AscendC::LocalTensor<float> ubOutput = QUEUE_UB.AllocTensor<float>();

    // 1. 把 index 预排序（在 UB 里做）
    SortIndicesInUB(ubIndex, indexTileSize);  // Vector 核

    // 2. 按排序后的 index 批量读 input（连续读，带宽利用率高）
    int sortedIdx = ubIndex[i];
    DataCopy(ubInput, input[sortedIdx * hiddenDim], hiddenDim);  // HBM 连续读

    // 3. 计算结果存在 ubOutput
    ComputeGather(ubOutput, ubInput, ubIndex, hiddenDim);

    // 4. 按原顺序 scatter 回 output（用另一个 UB buffer 存映射关系）
    ScatterToOutput(output, ubOutput, originalOrder);

    QUEUE_UB.FreeTensor(ubInput);
    QUEUE_UB.FreeTensor(ubIndex);
    QUEUE_UB.FreeTensor(ubOutput);
}

关键点：SortIndicesInUB 是瓶颈吗？不是——index 的长度通常远小于 input 的大小，排序开销被后续连续读的带宽节省抵消了。

reshape 和 view 的区别：什么时候触发数据拷贝

这个是新手最容易踩的坑。

reshape 和 view 在 PyTorch 里的行为不一样：

操作	触发拷贝？	要求
`x.reshape(shape)`	可能触发	新 shape 的总元素数 = 旧 shape
`x.view(shape)`	不触发（返回 view）	内存必须连续

在昇腾NPU上，reshape 如果触发了拷贝，就是从 HBM 读一份、写一份——带宽直接翻倍。

ops-tensor 的处理方式

ops-tensor 的 reshape 实现：

先检查连续性——连续就只改 TensorDesc 的 shape，不碰数据
不连续就强制 contiguous——调 contiguous Kernel 做拷贝

contiguous Kernel 的实现很简单：按新 shape 的顺序，从旧内存地址读、写到新地址。这个 Kernel 是纯 Vector 核的（逐元素拷贝），Cube 核帮不上忙。

性能建议

如果你确定接下来要做一个 Matrix Multiply（Cube 核操作），先 reshape 再 contiguous 是亏的——Matrix Multiply 本身支持 stride-aware 访问，不用连续内存。

正确做法：

# 亏：先 contiguous，再 matmul（多一次 HBM 读写）
x = x.reshape(-1, hiddenDim).contiguous()
y = torch.matmul(x, weight)   # Cube 核

# 赚：直接 matmul，Cube 核自己处理非连续（内部有 stride 参数）
y = torch.matmul(x.reshape(-1, hiddenDim), weight)  # 不触发 contiguous

Blaze 后端：JIT 编译优化

ops-tensor 的底层有个 Blaze 后端——JIT（Just-In-Time）编译引擎，专门针对动态 shape 做优化。

动态 shape 的问题

推理时，seq_len 是变的（用户 query 有长有短）。静态编译的 Kernel 只能跑固定 seq_len，换个长度就要重新编译——编译一次要几秒，推理服务扛不住。

Blaze 做的是：

捕获动态 shape 的模式——比如 seq_len 在 128/256/512/1024 之间跳
提前编译这几个长度的 Kernel，缓存起来
运行时直接取缓存的 Kernel，不重新编译

Blaze 的编译缓存管理

缓存存在主机内存里，不占 Device 内存。缓存的 Key 是 (op_type, shape_signature, dtype)。

比如 gather 算子的缓存 Key：

("gather", "batch=32,seqLen=512,hiddenDim=4096", "float16")

遇到没见过的 shape，Blaze 现场编译（几秒），编译完塞进缓存。

torch_npu 调用示例

import torch
import torch_npu

# 1. gather 示例
input_tensor = torch.randn(32, 512, 4096).npu()
index = torch.tensor([0, 10, 50, 100, 200]).npu()
output = torch.gather(input_tensor, dim=1, index=index.unsqueeze(0).unsqueeze(-1).expand(32, 5, 4096))
# 底层走的是 ops-tensor 的优化 gather Kernel（索引预排序 + 批量读）

# 2. reshape vs contiguous（性能对比）
x = torch.randn(32, 4, 128, 4096).npu()   # 非连续（transpose 过的）

# 方式1：先 contiguous（亏）
x_contig = x.contiguous()                     # ← 触发 HBM 拷贝
y1 = torch.matmul(x_contig.reshape(-1, 4096), weight)  # Cube 核

# 方式2：直接 reshape（赚）
y2 = torch.matmul(x.reshape(-1, 4096), weight)  # Cube 核内部处理非连续

# y1 和 y2 的数值结果一样，但 y1 多了一次 HBM 拷贝

# 3. 动态 shape（Blaze JIT）
for seq_len in [128, 256, 512, 1024]:
    x = torch.randn(1, seq_len, 4096).npu()
    y = torch.gather(x, dim=1, index=torch.arange(seq_len//2).unsqueeze(0).unsqueeze(-1).expand(1, seq_len//2, 4096).npu())
    # 第一次 seq_len=128：Blaze 现场编译（~2s）
    # 后续 seq_len=256/512/1024：Blaze 取缓存（~0s）

一个容易忽略的细节

index_select 和 gather 的区别：

算子	索引含义	典型用途
`gather`	任意索引，可重复	稀疏采样、不均衡采样
`index_select`	不重复的索引	选几个 head、选几层 layer

index_select 的访问模式比 gather 规则——索引不重复，可以做去重 + 连续预取。ops-tensor 对 index_select 有专门的优化路径，比 gather 快 20-30%。

如果你确定索引不重复，用 index_select 不用 gather。

调试工具：算子级别的 timeline

ops-tensor 集成了 opbase 的 Profiler，可以打出每个张量操作的时间线：

import torch_npu
torch_npu.npu.profiler.profile(
    activities=[torch_npu.npu.profiler.ProfilerActivity.NPU],
    record_shapes=True
) as prof:
    output = torch.gather(input_tensor, dim=1, index=index)
prof.export_chrome_trace("gather_trace.json")