在昇腾NPU上从零跑通FlashAttention：cann-learning-hub实践全记录

去年底接到一个任务：把内部的7B推理服务从A100迁移到昇腾910，attention部分得换成FlashAttention。论文读过，原理懂，但真到昇腾NPU上动手就抓瞎——环境怎么配、算子怎么调、性能怎么测，全得从头摸索。后来顺着cann-learning-hub的学习资源走了一遍，花了五天把FlashAttention从"能跑"调到"跑得快"。整个过程记录下来，给同样在迁移路上的人参考。

2501_93856656

14人浏览 · 2026-05-21 11:54:08

2501_93856656 · 2026-05-21 11:54:08 发布

环境：先把这些搞定再动手

别急着写代码，环境不对后面全是坑。我的配置：

硬件：Ascend 910（Atlas 800训练服务器）
CANN：8.0.RCx
OS：EulerOS 2.10
Python：3.9
torch_npu：跟CANN 8.0配套的版本

# 装完CANN后验证
source /usr/local/Ascend/ascend-toolkit/set_env.sh
npu-smi info
# 确认能看到910的设备信息

# 装torch_npu（版本必须跟CANN匹配，别装错）
pip install torch==2.1.0 torch_npu==2.1.0.*
python -c "import torch_npu; print(torch_npu.npu.is_available())"
# 输出True才算成功

踩过一个低级错误：CANN 8.0和7.0的torch_npu不兼容，混装会导致npu_flash_attention接口找不到。装之前查一下CANN和torch_npu的版本对应表，cann-learning-hub的入门教程里有链接。

标准attention先跑通：知道问题在哪

换FlashAttention之前，先用标准attention跑一遍，搞清楚瓶颈在哪：

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu # 自动把.cuda()重定向到.npu()
import time

def bench_standard_attention(batch, heads, seq_len, dim):
 device = 'npu'
 q = torch.randn(batch, heads, seq_len, dim, device=device, dtype=torch.float16)
 k = torch.randn(batch, heads, seq_len, dim, device=device, dtype=torch.float16)
 v = torch.randn(batch, heads, seq_len, dim, device=device, dtype=torch.float16)
 
 # 预热
 for _ in range(3):
 scores = torch.matmul(q, k.transpose(-2, -1)) / (dim ** 0.5)
 attn = torch.softmax(scores, dim=-1)
 out = torch.matmul(attn, v)
 torch.npu.synchronize()
 
 # 计时
 start = time.time()
 for _ in range(20):
 scores = torch.matmul(q, k.transpose(-2, -1)) / (dim ** 0.5)
 attn = torch.softmax(scores, dim=-1)
 out = torch.matmul(attn, v)
 torch.npu.synchronize()
 
 avg_ms = (time.time() - start) / 20 * 1000
 return avg_ms, out.shape

# 测试不同序列长度
for seq in [512, 2048, 4096]:
 ms, shape = bench_standard_attention(4, 32, seq, 128)
 print(f"seq={seq}: {ms:.1f}ms, output={shape}")

跑出来的数据：

序列长度	延迟(ms)	显存占用(GB)
512	8.2	2.1
2048	48.6	9.8
4096	187.3	34.2
8192	OOM	—

序列4096时显存已经34GB，8192直接爆了。这就是O(N²)中间结果的代价——标准attention的scores矩阵大小是batch×heads×seq×seq，序列一长根本存不下。

切到FlashAttention：第一坑是layout

cann-learning-hub的FlashAttention教程里专门有一节讲输入格式，我当初跳过了，结果踩了最大的坑。

import torch_npu

def flash_attention_infer(q, k, v, causal=True):
 """
 昇腾NPU上的FlashAttention调用
 q/k/v: [batch, seq_len, heads, dim] (BSND格式)
 """
 # 第一个坑：input_layout
 # PyTorch标准的attention权重通常是 [batch, heads, seq, dim] (BHSD)
 # 但npu_flash_attention默认接收BSND格式
 # 如果你的模型输出是BHSD，要么转置，要么改layout参数
 
 scale = 1.0 / (q.size(-1) ** 0.5)
 
 output = torch_npu.npu_flash_attention(
 q, k, v,
 head_num=q.size(2),
 input_layout="BSND", # 关键参数！传错不报错但结果全错
 scale=scale,
 keep_prob=1.0, # 推理关闭dropout
 atten_mask=None, # causal用下面的参数控制
 )
 return output

# 用法
q = torch.randn(4, 4096, 32, 128, device='npu', dtype=torch.float16)
k = torch.randn(4, 4096, 32, 128, device='npu', dtype=torch.float16)
v = torch.randn(4, 4096, 32, 128, device='npu', dtype=torch.float16)
out = flash_attention_infer(q, k, v)

如果你的模型权重是BHSD格式（PyTorch nn.MultiheadAttention的默认输出），有两个选择：

# 方案1：转置tensor（有额外开销）
q_bsnd = q.transpose(1, 2).contiguous() # BHSD -> BSND
out = flash_attention_infer(q_bsnd, k_bsnd, v_bsnd)

# 方案2：直接传layout参数（推荐，零开销）
output = torch_npu.npu_flash_attention(
 q, k, v,
 head_num=32,
 input_layout="BNSD", # 就是BHSD，昇腾文档里叫BNSD
 scale=1.0/(128**0.5),
 keep_prob=1.0,
)

cann-learning-hub的教程里把BNSD和BSND的区别讲得比较清楚，建议先看一遍再动手，省得在layout上浪费时间。

数值验证：别信"结果一样"

FlashAttention的数学结果等价于标准attention，但FP16精度下会有舍入差异。不验证直接上线，出了问题根本排查不到attention头上。

def verify_numerical_consistency():
 # 用小规模数据做精确对比
 batch, seq, heads, dim = 1, 256, 8, 64
 
 # NPU上跑FlashAttention
 q = torch.randn(batch, seq, heads, dim, device='npu', dtype=torch.float16)
 k = torch.randn(batch, seq, heads, dim, device='npu', dtype=torch.float16)
 v = torch.randn(batch, seq, heads, dim, device='npu', dtype=torch.float16)
 
 out_flash = torch_npu.npu_flash_attention(
 q, k, v, head_num=heads,
 input_layout="BSND",
 scale=1.0/(dim**0.5),
 keep_prob=1.0,
 )
 
 # CPU上跑标准attention作为参考（FP32精度）
 q_f32 = q.cpu().float().transpose(1, 2) # BSND -> BHSD
 k_f32 = k.cpu().float().transpose(1, 2)
 v_f32 = v.cpu().float().transpose(1, 2)
 scores = torch.matmul(q_f32, k_f32.transpose(-2, -1)) / (dim**0.5)
 attn = torch.softmax(scores, dim=-1)
 out_ref = torch.matmul(attn, v_f32).transpose(1, 2) # BHSD -> BSND
 
 # 对比
 diff = (out_flash.cpu().float() - out_ref).abs()
 print(f"最大误差: {diff.max().item():.6f}")
 print(f"平均误差: {diff.mean().item():.6f}")
 
 # FP16下max_diff < 0.02是正常的
 assert diff.max().item() < 0.05, "误差过大，检查layout和scale参数"

verify_numerical_consistency()

我第一次跑出来误差0.3——不是FlashAttention的问题，是scale忘传了，默认值不对。加上scale=1.0/(dim**0.5)之后误差降到0.008。这种低级错误排查起来很耗时，cann-learning-hub的社区博客里有人专门写过参数踩坑汇总，值得一看。

性能实测：FlashAttention到底快多少？

在Ascend 910上，batch=4，heads=32，dim=128：

def bench_flash_attention(batch, heads, seq_len, dim):
 q = torch.randn(batch, seq_len, heads, dim, device='npu', dtype=torch.float16)
 k = torch.randn(batch, seq_len, heads, dim, device='npu', dtype=torch.float16)
 v = torch.randn(batch, seq_len, heads, dim, device='npu', dtype=torch.float16)
 
 # 预热（第一次调用有编译开销）
 for _ in range(5):
 _ = torch_npu.npu_flash_attention(
 q, k, v, head_num=heads,
 input_layout="BSND",
 scale=1.0/(dim**0.5),
 keep_prob=1.0,
 )
 torch.npu.synchronize()
 
 # 计时
 start = time.time()
 for _ in range(50):
 _ = torch_npu.npu_flash_attention(
 q, k, v, head_num=heads,
 input_layout="BSND",
 scale=1.0/(dim**0.5),
 keep_prob=1.0,
 )
 torch.npu.synchronize()
 
 avg_ms = (time.time() - start) / 50 * 1000
 return avg_ms

for seq in [512, 2048, 4096, 8192]:
 ms = bench_flash_attention(4, 32, seq, 128)
 print(f"seq={seq}: {ms:.1f}ms")

完整对比数据：

序列长度	标准attention(ms)	FlashAttention(ms)	加速比	显存(标准→Flash)
512	8.2	4.1	2.0x	2.1→1.8GB
2048	48.6	11.3	4.3x	9.8→3.2GB
4096	187.3	24.8	7.5x	34.2→5.6GB
8192	OOM	52.1	—	—

序列越长加速越明显，因为标准attention的N²开销在长序列下是灾难性的。8192序列从"跑不了"变成52ms，这就是O(N)vsO(N²)的差距。

第二个坑：序列长度对齐

FlashAttention在昇腾NPU上要求序列长度是16的倍数（CANN 8.0的约束）。如果你的输入序列不满足，需要padding：

def pad_to_align(seq_tensor, align=16):
 """序列长度对齐到align的倍数"""
 seq_len = seq_tensor.size(1)
 if seq_len % align == 0:
 return seq_tensor, seq_len
 
 padded_len = (seq_len // align + 1) * align
 pad_size = padded_len - seq_len
 # 最后一维补零不影响attention结果（softmax会把零权重位置压到接近0）
 padding = torch.zeros(
 seq_tensor.size(0), pad_size,
 seq_tensor.size(2), seq_tensor.size(3),
 device=seq_tensor.device, dtype=seq_tensor.dtype
 )
 padded = torch.cat([seq_tensor, padding], dim=1)
 return padded, seq_len # 返回原始长度，后面截断用

# 使用
q, orig_len = pad_to_align(q, align=16)
k, _ = pad_to_align(k, align=16)
v, _ = pad_to_align(v, align=16)

out = torch_npu.npu_flash_attention(q, k, v, head_num=32,
 input_layout="BSND",
 scale=1.0/(128**0.5),
 keep_prob=1.0)

# 截断回原始长度
out = out[:, :orig_len, :, :]

这个对齐约束在cann-learning-hub的教程里有提到，但藏得比较深。我是OOM了之后才翻到的。

端到端：嵌入LLaMA推理

单算子跑通只是第一步，真正有用的是嵌入完整模型。一个7B LLaMA的attention层替换：

class LlamaFlashAttention(torch.nn.Module):
 def __init__(self, hidden_size, num_heads):
 super().__init__()
 se
...(truncated)...

从cann-learning-hub的FlashAttention入门教程开始，把分块计算和在线softmax的原理吃透，再到昇腾NPU上动手验证。社区博客里搜"FlashAttention踩坑"能翻到不少实战经验，比自己摸索快得多。

https://atomgit.com/cann/cann-learning-hub

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

昇腾CANN ops-blas：GEMM 在 NPU 上为什么可以快到极致

人工智能6S服务平台

昇腾CANN ops-cv：图像预处理和后处理算子的全部细节

人工智能6S服务平台

Ascend Transformer Boost 实战：大模型推理性能提升的秘密

人工智能6S服务平台

所有评论(0)

查看更多评论

2501_93856656

@2501_93856656

已为社区贡献5条内容

在昇腾NPU上从零跑通FlashAttention：cann-learning-hub实践全记录

2501_93856656

环境：先把这些搞定再动手

标准attention先跑通：知道问题在哪

切到FlashAttention：第一坑是layout

数值验证：别信"结果一样"

性能实测：FlashAttention到底快多少？

第二个坑：序列长度对齐

端到端：嵌入LLaMA推理

所有评论(0)

温馨提示：您尚未绑定手机号

2501_93856656