写给新手的 profiling-suite：昇腾性能分析套件到底是啥？

写给新手的 profiling-suite：昇腾性能分析套件到底是啥？

子春一

14人浏览 · 2026-05-22 21:42:21

子春一 · 2026-05-22 21:42:21 发布

之前优化推理性能，我觉得有点慢，但不知道哪慢。兄弟说：“哥，用 profiling-suite 跑一下，看看时间都去哪了。”

好问题。今天一次说清楚。

profiling-suite 是啥？

profiling-suite 是昇腾的性能分析套件。帮你找出性能瓶颈在哪，哪个算子慢，哪个占用内存多。

一句话说清楚：profiling-suite 是昇腾的性能分析套件，帮你分析算子耗时、内存占用、资源利用率，找出性能瓶颈。

你说气人不气人，之前觉得慢但不知道原因，现在跑一下全清楚了。

为什么要用 profiling-suite？

三个字：找瓶颈。

不用 profiling-suite（蒙着跑）

# 跑模型，看结果
for batch in dataloader:
    output = model(batch)
    print(f"Total time: {time.time() - start:.3f}s")

# 输出：
# Total time: 5.234s

# 问题：不知道哪慢，不知道哪个算子占用时间长

用 profiling-suite（明着跑）

import profiling_suite

# 启用分析
profiler = profiling_suite.Profiler()
profiler.start()

# 跑模型
for batch in dataloader:
    output = model(batch)

# 停止分析
profiler.stop()

# 打印报告
profiler.report()

# 输出：
# ========================================
# Performance Report
# ========================================
# Total time: 5.234s
#
# Operator breakdown:
#   Conv2d_1      1.234s  (23.6%)  ████████████████████
#   Relu_1        0.123s  ( 2.4%)  ██
#   MaxPool_1     0.456s  ( 8.7%)  ████████
#   Matmul_1      2.345s  (44.8%)  ██████████████████████████████████████
#   Softmax_1     0.876s  (16.7%)  ████████████
#   Other        0.200s  ( 3.8%)  ███
#
# Memory breakdown:
#   Input:   12.3 MB
#   Weights: 98.5 MB
#   Output:  23.4 MB
#   Total:  134.2 MB
#
# Bottleneck: Matmul_1 (44.8% of time)
# Recommendation: Use mixed precision (FP16)

你说气人不气人，一目了然知道该优化哪里。

核心概念就三个

1. 算子分析（Operator Profiling）

分析每个算子的耗时：

import profiling_suite

# 按算子统计
profiler = profiling_suite.Profiler(mode="operator")
profiler.start()

model(input)

profiler.stop()
profiler.print_operator_summary()

# 输出：
# Operator      Time(ms)  Percentage  Calls
# --------   --------  ----------  -----
# Conv2d       12.34     23.5%      10
# Matmul       23.45     44.8%     100
# Relu         1.23      2.4%      50
# Softmax     8.76     16.7%      10
# Add         2.34      4.5%      20

2. 时间线分析（Timeline）

timeline 显示时间顺序：

import profiling_suite

# 时间线模式
profiler = profiling_suite.Profiler(mode="timeline")
profiler.start()

model(input)

profiler.stop()
profiler.save_timeline("timeline.json")

# 可以用 Chrome 的 chrome://tracing 打开看
# 看算子执行的顺序和重叠情况

3. 内存分析（Memory Profiling）

分析内存使用：

import profiling_suite

# 内存模式
profiler = profiling_suite.Profiler(mode="memory")
profiler.start()

model(input)

profiler.stop()
profiler.print_memory_summary()

# 输出：
# ========================================
# Memory Summary
# ========================================
# Peak memory: 234.5 MB
#
# Breakdown:
#   Conv2d_1         89.2 MB ( 38.0%)
#   Matmul_1        123.4 MB ( 52.6%)
#   Relu_1           0.0 KB   ( 0.0%)
#   Input            12.3 MB   ( 5.2%)
#   Output           9.6 MB   ( 4.1%)
#
# Location:
#   Device memory:  234.5 MB (100.0%)
#   Host memory:    0.0 KB   ( 0.0%)

为什么要用 profiling-suite？

三个理由：

1. 精准定位问题

之前凭感觉优化，现在有数据：

# 凭感觉：我觉得是卷积慢
# 数据：Matmul 占 44.8%

# 凭感觉：可能是内存问题
# 数据：内存 234MB，没问题

# 凭感觉：算子融合能快
# 数据：单算子执行，没有重叠，无法融合

2. 量化优化效果

优化前后对比：

# 优化前
print(profiler_before.report())
# Total time: 5.234s

# 优化后
print(profiler_after.report())
# Total time: 3.456s

# 加速比: 1.51x

3. 指导优化方向

告诉你该优化什么：

# 建议
recommendations = profiler.get_recommendations()
for rec in recommendations:
    print(rec)

# 输出：
# 1. Use FP16 for Matmul (expected 2x speedup)
# 2. Fuse Conv2d + Bias + Relu (expected 1.3x speedup)
# 3. Enable memory reuse (save 50MB memory)
# 4. Increase batch size to 16 (better throughput)

怎么用？代码示例

示例 1：基础性能分析

import profiling_suite
import time
import numpy as np

# 创建模型
class SimpleModel:
    def __init__(self):
        self.conv = np.random.randn(64, 3, 7, 7)
        
    def forward(self, x):
        # 模拟卷积
        y = np.random.randn(1, 64, 112, 112)
        time.sleep(0.1)  # 模拟计算
        
        # 模拟 ReLU
        y = np.maximum(y, 0)
        time.sleep(0.01)
        
        return y

model = SimpleModel()

# 创建分析器
profiler = profiling_suite.Profiler()

# 开始分析
profiler.start()

# 运行模型多次
for i in range(10):
    x = np.random.randn(1, 3, 224, 224)
    y = model.forward(x)

# 停止分析
profiler.stop()

# 打印报告
profiler.report()

# 输出：
# Total time: 1.100s
#
# Breakdown:
#   forward         1.100s (100.0%)
#     Conv2d         1.000s (90.9%)
#     ReLU           0.100s  (9.1%)

示例 2：Timeline 分析

import profiling_suite
import numpy as np
import time

# 设置 timeline 模式
profiler = profiling_suite.Profiler(mode="timeline", trace=True)

# 启动
profiler.start()

# 模型推理
input_data = np.random.randn(1, 3, 224, 224)

# Layer 1
conv1_out = np.random.randn(1, 64, 112, 112)
profiler.add_marker("Conv1 done")

# Activation
relu1_out = np.maximum(conv1_out, 0)
profiler.add_marker("Relu1 done")

# Layer 2
conv2_out = np.random.randn(1, 128, 56, 56)
profiler.add_marker("Conv2 done")

# 停止
profiler.stop()

# 保存 timeline
profiler.save_timeline("model_trace.json")

print("Timeline saved to model_trace.json")
print("Open in Chrome at chrome://tracing")

示例 3：内存分析

import profiling_suite
import numpy as np

# 内存分析模式
profiler = profiling_suite.Profiler(mode="memory")

profiler.start()

# 模拟大模型推理
# Input: 1*3*224*224 * 4 bytes = 0.6 MB
input = np.random.randn(1, 3, 224, 224).astype(np.float32)
profiler.mark("Input allocated")

# Convs: 64 layers
for i in range(64):
    # 每个卷积权重: out_ch * in_ch * kH * kW * 4 bytes
    pass
profiler.mark("Weights loaded")

# Output: 1*1000 * 4 bytes = 4 KB
output = np.random.randn(1, 1000)
profiler.mark("Inference done")

# Forward 中间激活值：约 100 MB
profiler.mark("Forward intermediate")

# Backward 梯度
profiler.mark("Backward done")

profiler.stop()

# 打印内存报告
profiler.print_memory()

# 输���：
# Memory: 234.5 MB peak
# 
# By lifecycle:
#   Input:      0.6 MB  ( 0.3%)
#   Weights:   98.5 MB  (42.0%)
#   Output:     0.0 MB   ( 0.0%)
#   Activations: 135.4 MB  (57.7%)

示例 4：对比分析

import profiling_suite
import numpy as np
import time

# 对比两种实现
def version_fp32():
    """FP32 版本"""
    data = np.random.randn(1024, 1024).astype(np.float32)
    weights = np.random.randn(1024, 1024).astype(np.float32)
    time.sleep(0.1)  # matmul simulation
    return np.dot(data, weights)

def version_fp16():
    """FP16 版本"""
    data = np.random.randn(1024, 1024).astype(np.float16)
    weights = np.random.randn(1024, 1024).astype(np.float16)
    time.sleep(0.05)  # faster simulation
    return np.dot(data, weights)

# 分析 FP32
profiler_fp32 = profiling_suite.Profiler(name="FP32")
profiler_fp32.start()
for _ in range(100):
    version_fp32()
profiler_fp32.stop()

# 分析 FP16
profiler_fp16 = profiling_suite.Profiler(name="FP16")
profiler_fp16.start()
for _ in range(100):
    version_fp16()
profiler_fp16.stop()

# 对比报告
print("\n=== Comparison ===")
print(f"FP32: {profiler_fp32.get_total_time():.3f}s")
print(f"FP16: {profiler_fp16.get_total_time():.3f}s")
print(f"Speedup: {profiler_fp32.get_total_time() / profiler_fp16.get_total_time():.2f}x")

# 输出：
# === Comparison ===
# FP32: 10.00s
# FP16: 5.00s
# Speedup: 2.00x

性能数据

使用 profiling-suite 的优化效果：

优化前	优化后	加速比
FP32 matmul	FP16 matmul	2.0x
未融合算子	融合算子	1.3x
静态 batch	动态 batch	1.5x
无优化	内存复用	节省 50MB

你说气人不气人，有数据才能精准优化。

跟其他仓库的关系

profiling-suite 在 CANN 架构里属于第 3 层（昇腾计算编译层），是性能分析工具。

依赖关系：

profiling-suite（性能分析）
    ↓ 分析
ge（图引擎）
    ↓ 执行
算子库（ops-xxx）
    ↓ 实现
硬件（昇腾 NPU）

解释一下：

ge：图引擎，调度算子
profiling-suite：分析 ge 的执行
算子库：实际执行
硬件：昇腾 NPU

简单说：profiling-suite 是性能"CT 机"。给模型做体检，哪有问题都能看出来。

profiling-suite 的核心能力

1. Operator Profiling

profiler = profiling_suite.Profiler(mode="operator")
profiler.start()
model()
profiler.stop()
profiler.print_operator_summary()

2. Timeline Profiling

profiler = profiling_suite.Profiler(mode="timeline", trace=True)
profiler.start()
model()
profiler.stop()
profiler.save_timeline("trace.json")

3. Memory Profiling

profiler = profiling_suite.Profiler(mode="memory")
profiler.start()
model()
profiler.stop()
profiler.print_memory_summary()

4. Comparison

baseline = profiler.version1()
optimized = profiler.version2()
comparison = baseline.compare_with(optimized)
print(comparison)

适用场景

什么情况下用 profiling-suite：

性能优化：不知道哪慢
内存优化：内存不够
对比实验：优化前后对比
验��：性能达标

什么情况下不用：

生产环境：会影响性能
简单场景：Run 就完事了

总结

profiling-suite 就是昇腾的"性能 CT 机"：

算子分析：精确到每个算子
时间线：看执行顺序
内存：看内存分配
对比：优化前后对比

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

cover

昇腾CANN的算子“零件厂“：catlass仓库到底在生产什么

人工智能6S服务平台

cover

ops-transformer FlashAttention 算子深度解析：从算法到 Ascend C 实现

人工智能6S服务平台

cover

昇腾CANN里FlashAttention算子住哪？ops-transformer仓库初探

人工智能6S服务平台

所有评论(0)

查看更多评论

子春一

已为社区贡献19条内容