昇腾NPU生产环境监控——让你的模型“被看见“

昇腾NPU生产环境监控——让你的模型"被看见"

2501_94551709

85人浏览 · 2026-05-24 01:36:50

2501_94551709 · 2026-05-24 01:36:50 发布

在这里插入图片描述

训练环境能跑通不代表生产环境没问题。模型部署上线后，你需要知道它什么时候在跑、什么时候在发呆、什么时候快出问题了。

这篇将手把手教你给昇腾NPU推理服务建立一套完整的监控体系，涵盖指标采集、可视化、告警配置以及常见问题的自动发现。

一、监控体系架构设计

一个健壮的监控系统分为三层：基础设施层、性能层、业务层。

核心监控指标清单

层级	指标名称	含义	告警阈值建议
基础设施	`npu_utilization`	NPU整体利用率	< 30% (持续5min) -> 资源浪费> 95% (持续1min) -> 瓶颈
	`npu_temperature`	温度	> 80°C -> 降频风险> 85°C -> 紧急停机
	`npu_power_watts`	功耗	> TDP限制 -> 触发保护
	`memory_used_gb`	显存使用率	> 90% -> OOM预警
性能	`inference_latency_p99`	P99延迟	超过SLA目标 (如 200ms)
	`throughput_qps`	每秒请求数	突降 -> 服务故障
	`error_rate`	错误率	> 0.1%
业务	`queue_length`	排队长度	> 100 -> 响应变慢
	`success_rate`	成功率	< 99.9%

二、NPU指标采集器实战

方案A：轻量级采集 (`npu-smi`)

适合大多数场景，兼容性好，无需额外依赖。

import subprocess
import re
import time
import threading
from dataclasses import dataclass, field
from typing import Dict, List, Optional

@dataclass
class NPUMetrics:
    device_id: int
    timestamp: float
    utilization: float = 0.0
    power_watts: float = 0.0
    temperature_c: float = 0.0
    memory_used_mb: float = 0.0
    memory_total_mb: float = 0.0

class NPUCollector:
    """
    基于 npu-smi 的轻量级采集器
    
    特点：
      - 零代码侵入（不需要修改推理代码）
      - 兼容性好（CANN 所有版本支持）
      - 开销小（每1秒执行一次子进程）
    """
    
    def __init__(self, device_ids: List[int] = None, interval: float = 1.0):
        self.device_ids = device_ids or [0]
        self.interval = interval
        self.running = False
        self._latest_metrics: Dict[int, NPUMetrics] = {}
        self._lock = threading.Lock()
        
    def start(self):
        if self.running: return
        self.running = True
        self._thread = threading.Thread(target=self._collect_loop, daemon=True)
        self._thread.start()
        print(f"NPU采集器启动：间隔={self.interval}s, 设备={self.device_ids}")
    
    def stop(self):
        self.running = False
        if self._thread:
            self._thread.join(timeout=2)
            
    def get_latest(self, device_id: int) -> Optional[NPUMetrics]:
        with self._lock:
            return self._latest_metrics.get(device_id)

    def _collect_loop(self):
        while self.running:
            try:
                metrics = self._collect_all()
                with self._lock:
                    self._latest_metrics.update(metrics)
            except Exception as e:
                print(f"[ERROR] 采集失败: {e}")
            time.sleep(self.interval)

    def _collect_all(self) -> Dict[int, NPUMetrics]:
        # 构建命令：只查询需要的字段，减少输出解析难度
        cmd = [
            "npu-smi", "info", 
            "-t", "utilization,power,temperature,memory", 
            "-o", "table"
        ]
        
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=5)
        if result.returncode != 0:
            raise Exception(f"npu-smi failed: {result.stderr}")
            
        return self._parse_table(result.stdout)

    def _parse_table(self, output: str) -> Dict[int, NPUMetrics]:
        metrics = {}
        lines = output.strip().split('\n')
        
        for line in lines:
            # 跳过表头、分隔线
            if '===' in line or '|' not in line or 'NPU' in line:
                continue
            
            parts = [p.strip() for p in line.split('|')]
            if len(parts) < 6:
                continue
                
            try:
                # 格式示例：| 0 | 910B | Ok | ... | 45% | 180W | 52C | 8192MB / 32768MB |
                device_id = int(parts[1])
                
                # 解析利用率 (假设在第3列，具体视npu-smi版本而定)
                util_str = parts[3].replace('%', '')
                utilization = float(util_str) if util_str else 0.0
                
                # 解析功耗
                power_str = parts[4].replace('W', '').strip()
                power = float(power_str) if power_str else 0.0
                
                # 解析温度
                temp_str = parts[5].replace('C', '').strip()
                temp = float(temp_str) if temp_str else 0.0
                
                # 解析显存
                mem_str = parts[6]
                match = re.match(r'(\d+)MB\s*/\s*(\d+)MB', mem_str)
                used = float(match.group(1)) if match else 0.0
                total = float(match.group(2)) if match else 0.0
                
                metrics[device_id] = NPUMetrics(
                    device_id=device_id,
                    timestamp=time.time(),
                    utilization=utilization,
                    power_watts=power,
                    temperature_c=temp,
                    memory_used_mb=used,
                    memory_total_mb=total
                )
            except (ValueError, IndexError) as e:
                continue
                
        return metrics

# ===== 使用示例 =====
collector = NPUCollector(interval=1.0)
collector.start()

try:
    while True:
        m = collector.get_latest(0)
        if m:
            print(f"[{m.timestamp:.0f}] Util:{m.utilization:.1f}% Temp:{m.temperature_c:.1f}C Mem:{m.memory_used_mb:.0f}/{m.memory_total_mb:.0f}MB")
        time.sleep(2)
except KeyboardInterrupt:
    collector.stop()

方案B：深度采集 (CANN API / ACL)

如果需要更细粒度的数据（如AI Core vs Vector Unit利用率，ECC错误），需直接调用CANN API。

# 伪代码示意，实际需引入 cann 模块
def collect_deep_metrics():
    import cann
    from cann import acl
    
    # 初始化
    acl.init()
    acl.set_device(0)
    
    # 获取详细统计信息
    stats = acl.get_device_statistic(0)
    
    return {
        "ai_core_util": stats.ai_core_util,
        "vector_core_util": stats.vector_core_util,
        "ecc_corrected": stats.ecc_corrected_count,
        "ecc_uncorrected": stats.ecc_uncorrected_count,
        # ... 更多底层指标
    }

三、推理性能与业务指标埋点

除了硬件指标，必须监控模型本身的表现。

1. 延迟分解 (Latency Breakdown)

不要只看总耗时，要拆解看瓶颈在哪里。

import time
from contextlib import contextmanager

@contextmanager
def measure_latency(name=""):
    t0 = time.perf_counter()
    yield
    latency_ms = (time.perf_counter() - t0) * 1000
    # 上报到 Prometheus
    record_metric(f"inference_latency_{name}_ms", latency_ms)

# 在推理循环中使用
with measure_latency("preprocess"):
    data = preprocess(image)

with measure_latency("inference"):
    output = model(data.npu())

with measure_latency("postprocess"):
    result = postprocess(output)

2. 关键指标定义

Total Latency: 端到端耗时 (用户感知)。
TTFT (Time To First Token): LLM特有，首字生成时间。
TPOT (Time Per Output Token): 后续每个token的生成时间。
Queue Wait Time: 请求在等待槽位的时间。

3. 集成 Prometheus Client

from prometheus_client import Counter, Histogram, Gauge, start_http_server

# 定义指标
REQUEST_LATENCY = Histogram('inference_latency_seconds', 'Inference latency distribution', ['model_name'])
REQUEST_COUNT = Counter('inference_requests_total', 'Total requests', ['status'])
NPU_UTIL_GAUGE = Gauge('npu_utilization_percent', 'Current NPU utilization')

def record_inference(latency_s, status='success'):
    REQUEST_LATENCY.observe(latency_s)
    REQUEST_COUNT.labels(status=status).inc()

# 启动HTTP服务器供Prometheus抓取
start_http_server(8000)

四、可视化与告警配置 (Grafana + Alertmanager)

1. Grafana 仪表盘模板

推荐使用 Node Exporter Full 模板进行二次开发，添加自定义NPU面板：

Top Panel: NPU 实时状态 (Util, Temp, Power, Memory) - 大字体显示。
Middle Panel: 推理延迟分布图 (P50/P90/P99) - 折线图。
Bottom Panel: QPS 趋势图 - 柱状图。
Alert Box: 当前活跃告警列表。

2. PromQL 查询示例

# 1. 检测NPU利用率过低 (可能空闲或故障)
avg(npu_utilization_percent) by (device_id) < 20

# 2. 检测温度过高
npu_temperature_celsius > 80

# 3. 检测显存即将耗尽
sum(npu_memory_used_bytes) by (device_id) / sum(npu_memory_total_bytes) by (device_id) > 0.9

# 4. 检测P99延迟超标
histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 0.2

3. 告警规则配置 (`alerting_rules.yml`)

groups:
- name: ascend_npu_alerts
  rules:
  - alert: HighNPUTemperature
    expr: npu_temperature_celsius > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "NPU {{ $labels.device_id }} 温度过高"
      description: "温度达到 {{ $value }}°C，可能导致降频。"

  - alert: NPUOutOfMemory
    expr: (npu_memory_used_bytes / npu_memory_total_bytes) > 0.95
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "NPU {{ $labels.device_id }} 显存即将溢出"
      description: "已用 {{ $value | humanizePercentage }}"

  - alert: HighInferenceLatency
    expr: histogram_quantile(0.99, rate(inference_latency_seconds_bucket[5m])) > 0.5
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "推理P99延迟过高"
      description: "P99延迟为 {{ $value }}秒，超过阈值0.5s。"