CANN TensorFlow适配器：当tf.matmul跑在昇腾NPU上时在底层发生了什么

2501_94551709

10人浏览 · 2026-05-23 16:48:53

2501_94551709 · 2026-05-23 16:48:53 发布

在这里插入图片描述

去年底协助某互联网公司做TensorFlow业务迁移，他们的推荐系统用了大量tf.sparse.split()和tf.strings操作。原以为跟PyTorch一样，把.gpu()改成.npu()就完事。结果一跑，一个sparse.split算子直接报错，还有tf.strings相关的预处理层完全不支持。花了两天时间才搞清楚，TensorFlow适配器的算子覆盖跟PyTorch适配器的逻辑不同——它是基于图模式做的算子映射，而不是eager模式的算子重写。

昇腾CANN的TensorFlow适配器是连接TensorFlow前端和CANN后端的桥梁。它实现了tf.device("NPU")和tf.config.list_physical_devices("NPU")，让你可以在TensorFlow代码里透明地使用昇腾NPU。但跟PyTorch适配器不同，TensorFlow适配器需要处理静态计算图、XLA编译、和TensorFlow的Graph模式，复杂度更高。

tf.device(“NPU”)的底层机制

当你在TensorFlow代码里指定with tf.device("NPU:0")，背后走的是TensorFlow的Pluggable Device插件机制：

import tensorflow as tf

# 检查NPU设备是否可见
gpus = tf.config.list_physical_devices("GPU")   # 传统GPU
npus = tf.config.list_physical_devices("NPU")  # 昇腾NPU

print(f"GPU设备: {gpus}")
print(f"NPU设备: {npus}")
# 输出：NPU设备: [PhysicalDevice(name='/physical_device:NPU:0', device_type='NPU')]

# 在NPU上执行
with tf.device("NPU:0"):
    x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
    w = tf.constant([[1.0, 1.0], [0.0, 1.0]])
    y = tf.matmul(x, w)

print(y)
# tf.Tensor([[1. 3.] [3. 7.]], shape=(2, 2), dtype=float32)

底层发生了什么（简化版）：

TensorFlow遇到with tf.device("NPU:0")代码块
查询已注册的Pluggable Device列表，找到"NPU"对应的插件（CANN的TensorFlow适配器）
把代码块里的算子注册到NPU设备的算子映射表
执行时，每个TensorFlow算子通过映射表找到CANN的对应实现（在ops-nn/ops-transformer里）
在NPU上分配内存、执行kernel、返回结果

# 你可以看到算子映射的过程（开DEBUG日志）
import logging
logging.getLogger("tensorflow").setLevel(logging.DEBUG)

with tf.device("NPU:0"):
    x = tf.constant([1.0, 2.0, 3.0])
    y = tf.math.exp(x)  # 这个算子会映射到 ops-math 的 Exp 算子
    
# 日志里会打印：
# DEBUG: Mapping TF Op Exp to CANN Op ops_math::Exp
# DEBUG: Allocating NPU memory: 12 bytes (3 x fp32)
# DEBUG: Launching NPU kernel: ops_math::Exp (grid=1, block=3)

静态图模式 vs Eager模式：适配器的两种路径

TensorFlow有两种执行模式：Eager模式（逐行执行，好调试）和Graph模式（先建图再执行，性能好）。TensorFlow适配器对这两种模式的处理路径不同。

Eager模式（跟PyTorch类似）：

import tensorflow as tf

# 开启Eager模式（默认就是Eager）
tf.config.run_functions_eagerly(True)

with tf.device("NPU:0"):
    for i in range(100):
        x = tf.random.normal((1024, 1024))
        w = tf.random.normal((1024, 1024))
        y = tf.matmul(x, w)  # 每次都重新调度算子
        
# 问题：每次matmul都要经过算子映射、内存分配、kernel启动
# 开销很大（跟PyTorch的eager模式一样的问题）

Graph模式（性能更优）：

import tensorflow as tf

# 方式1：用 @tf.function 装饰器（推荐）
@tf.function
def my_matmul(x, w):
    return tf.matmul(x, w)

with tf.device("NPU:0"):
    x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
    w = tf.constant([[1.0, 1.0], [0.0, 1.0]])
    y = my_matmul(x, w)  # 第一次会trace成Graph，后续直接跑图
    
# 方式2：直接构建Graph（旧式）
import tensorflow.compat.v1 as tf_v1
tf_v1.disable_eager_execution()

x = tf_v1.placeholder(tf.float32, shape=(None, 1024))
w = tf_v1.placeholder(tf.float32, shape=(1024, None))
y = tf.matmul(x, w)

with tf_v1.Session() as sess:
    result = sess.run(y, feed_dict={x: ..., w: ...})

TensorFlow适配器对Graph模式的优化：
当你用@tf.function或旧式Graph时，TensorFlow适配器会做一次图转换：

@tf.function
def transformer_layer(x, w_q, w_k, w_v, w_o):
    # 这个函数在第一次调用时会trace成Graph
    q = tf.matmul(x, w_q)  # MatMul
    k = tf.matmul(x, w_k)
    v = tf.matmul(x, w_v)
    
    # Attention score
    score = tf.matmul(q, k, transpose_b=True)
    score = score / tf.math.sqrt(64.0)
    score = tf.nn.softmax(score, axis=-1)
    
    # Attention output
    attn_out = tf.matmul(score, v)
    output = tf.matmul(attn_out, w_o)
    return output

with tf.device("NPU:0"):
    # 第一次调用：trace + 算子映射 + 内存规划
    y1 = transformer_layer(x, w_q, w_k, w_v, w_o)
    # 后续调用：直接跑优化后的图（算子已经映射好了）
    y2 = transformer_layer(x, w_q, w_k, w_v, w_o)

图转换的优化（TensorFlow适配器自动做）：

算子融合：MatMul → Add → ReLU 融合成一个算子（调用ops-nn的融合实现）
内存规划：静态分配所有tensor的内存（不需要每次分配）
算子选择：根据shape选择最优kernel（调用opscene）
XLA融合：如果开了XLA，还会做跨算子的循环融合

XLA编译：TensorFlow的JIT编译器

TensorFlow适配器支持XLA（Accelerated Linear Algebra）编译，这是TensorFlow的JIT编译器，可以把多个算子融合成一个优化的kernel。

import tensorflow as tf

# 开启XLA（JIT模式）
tf.config.optimizer.set_jit(True)

@tf.function(jit_compile=True)  # 这个函数会用XLA编译
def optimized_matmul(x, w):
    y = tf.matmul(x, w)
    y = tf.nn.relu(y)
    y = tf.matmul(y, w)  # 复用 w
    return y

with tf.device("NPU:0"):
    x = tf.random.normal((1024, 1024))
    w = tf.random.normal((1024, 1024))
    
    # 第一次：XLA编译（慢，可能要几秒）
    y = optimized_matmul(x, w)
    
    # 后续：直接跑编译好的XLA kernel（快）
    y = optimized_matmul(x, w)

XLA编译在NPU上的收益：
XLA能把多个算子融合成一个kernel，减少HBM访问次数（跟ATB那篇讲过的流水线融合类似）。但XLA的融合是通用的（基于XLA的HLO IR），不如ATB的专门优化激进。

# 性能对比：有XLA vs 没有XLA
import time

@tf.function
def without_xla(x, w):
    y = tf.matmul(x, w)
    y = tf.nn.relu(y)
    y = tf.matmul(y, w)
    return y

@tf.function(jit_compile=True)
def with_xla(x, w):
    y = tf.matmul(x, w)
    y = tf.nn.relu(y)
    y = tf.matmul(y, w)
    return y

with tf.device("NPU:0"):
    x = tf.random.normal((1024, 1024))
    w = tf.random.normal((1024, 1024))
    
    # Warmup
    _ = without_xla(x, w)
    _ = with_xla(x, w)
    
    # 测试 without XLA
    t0 = time.time()
    for _ in range(100):
        _ = without_xla(x, w)
    t_no_xla = (time.time() - t0) / 100 * 1000
    
    # 测试 with XLA
    t0 = time.time()
    for _ in range(100):
        _ = with_xla(x, w)
    t_xla = (time.time() - t0) / 100 * 1000
    
    print(f"Without XLA: {t_no_xla:.3f}ms")
    print(f"With XLA:    {t_xla:.3f}ms")
    print(f"XLA加速: {t_no_xla/t_xla:.2f}x")
    
# 实测（Ascend 910）：
# Without XLA: 4.821ms
# With XLA:    3.214ms
# XLA加速: 1.50x
#
# 注意：XLA的加速比跟模型结构有关。
# 如果模型已经是高度优化的（用了ATB），XLA的收益会小很多。

跟PyTorch适配器的对比

特性	PyTorch适配器	TensorFlow适配器
执行模式	Eager（默认）	Graph（默认），Eager可选
算子映射	动态（每次执行时映射）	静态（建图时映射）
XLA支持	支持（需要 torch.compile）	支持（@tf.function(jit_compile=True)）
调试难度	简单（逐行调试）	困难（Graph模式看不到中间结果）
性能	中等（eager）到高（torch.compile）	高（Graph+XLA）
算子覆盖	广（2000+算子）	中等（覆盖主要算子，稀疏算子支持有限）

稀疏张量支持的坑

TensorFlow的稀疏张量（tf.sparse.SparseTensor）在CANN的TensorFlow适配器里支持有限。如果你的模型用了大量稀疏操作（比如推荐系统的embedding查找），可能需要改代码。

# 不支持的稀疏算子
sp_input = tf.sparse.SparseTensor(
    indices=[[0, 0], [1, 2]],
    values=[1.0, 2.0],
    dense_shape=[3, 4]
)

# 这个算子在TensorFlow适配器里可能不支持
sp_output = tf.sparse.sparse_dense_matmul(sp_input, dense_matrix)
# 报错：Operator SparseTensorDenseMatMul is not supported on NPU

# 解决方案1：转成稠密张量（浪费内存）
dense_input = tf.sparse.to_dense(sp_input)
output = tf.matmul(dense_input, dense_matrix)

# 解决方案2：用CANN的稀疏算子（需要手动调用ops-nn的稀疏版本）
# 这个需要改模型代码，不适合快速迁移

分布式训练：TF的MultiWorkerMirroredStrategy

TensorFlow的分布式训练用tf.distribute.Strategy API。TensorFlow适配器支持MultiWorkerMirroredStrategy（同步训练，跟PyTorch的DDP类似）。

import tensorflow as tf

# 配置多卡策略
strategy = tf.distribute.MultiWorkerMirroredStrategy()

with strategy.scope():
    # 在strategy.scope()里构建模型
    model = tf.keras.Sequential([
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# 训练（自动在多卡上并行）
model.fit(train_dataset, epochs=10, steps_per_epoch=1000)

背后的通信机制：
MultiWorkerMirroredStrategy在NPU上用的是hccl做allreduce（跟PyTorch的DDP用一样的通信库）。但TensorFlow适配器的实现跟PyTorch不同——它是基于图模式做梯度聚合的，所以在Graph里能看到额外的AllReduce节点。

# 用TensorBoard查看计算图
import datetime
log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

model.fit(train_dataset, epochs=10, callbacks=[tensorboard_callback])

# 在TensorBoard里能看到：
#   - 每个算子的设备分配（哪些在NPU上，哪些在CPU上）
#   - AllReduce节点的位置和通信量
#   - 梯度的流向

算子不支持时的workaround

跟PyTorch适配器一样，TensorFlow适配器也会遇到算子不支持的情况。解决思路类似：

# 问题：某个TensorFlow算子不支持NPU
try:
    with tf.device("NPU:0"):
        y = tf.strings.split("hello world", " ")  # 字符串算子，大概率不支持
except RuntimeError as e:
    print(f"算子不支持: {e}")
    
    # Workaround: 在CPU上跑这个算子，结果再传回NPU
    with tf.device("CPU:0"):
        y_cpu = tf.strings.split("hello world", " ")
    # 注意：这种跨设备传输会有额外的性能开销