昇腾NPU监控与可观测性——让AI基础设施“被看见“(完整版)
昇腾NPU监控与可观测性——让AI基础设施"被看见"(完整版)
·

一、监控体系设计:三层架构
在昇腾NPU的生产环境中,我们需要构建一个全栈式的监控体系,覆盖从硬件底层到业务上层的全链路。
1. 架构全景图
2. 核心指标定义
| 层级 | 指标名称 | 单位 | 阈值建议 | 意义 |
|---|---|---|---|---|
| 硬件层 | npu_compute_util |
% | > 80% 持续5min | 算力是否饱和 |
npu_memory_used |
MB | > 90% | 显存泄漏风险 | |
npu_temperature |
°C | > 75°C | 过热降频风险 | |
npu_power |
W | > TDP | 功耗墙触发 | |
| 推理层 | inference_latency_p99 |
ms | < SLA阈值 | 用户体验上限 |
inference_qps |
req/s | - | 系统吞吐能力 | |
inference_error_rate |
% | < 0.1% | 稳定性 | |
| 业务层 | recommendation_ctr |
% | - | 业务效果 |
embedding_cache_hit |
% | > 95% | 缓存效率 |
二、代码实现:完整的监控采集器
你提供的 NPUCollector 和 InferenceMetricsCollector 是基础,但在生产环境中需要更健壮的实现(如异步采集、错误重试、多设备聚合)。
1. 增强版 NPU 采集器 (npumonitor.py)
import torch
import subprocess
import time
import json
from dataclasses import dataclass, asdict
from typing import List, Dict, Optional
import psutil
@dataclass
class NPUMetrics:
"""NPU指标数据类"""
timestamp: float
device_id: int
compute_util: float # AI Core利用率 %
hbm_util: float # HBM带宽利用率 %
memory_used_mb: float
memory_total_mb: float
temperature_c: float
power_w: float
ecc_errors: int = 0 # ECC错误计数
status: str = "Normal" # 健康状态
class AscendMonitor:
"""
昇腾NPU监控采集器
特性:
1. 支持 npu-smi 和 PyTorch API 双模式
2. 异步采集,降低对业务影响
3. 自动解析 nup-smi 输出
4. 集成 Prometheus Exporter 接口
"""
def __init__(self, device_ids: List[int] = None):
if device_ids is None:
device_ids = list(range(torch.npu.device_count()))
self.device_ids = device_ids
self.exporter_port = 9090
def collect_all(self) -> List[Dict]:
"""采集所有设备指标并返回JSON格式"""
metrics_list = []
for dev_id in self.device_ids:
try:
metric = self._collect_device(dev_id)
metrics_list.append(metric)
except Exception as e:
print(f"[Error] Device {dev_id} collection failed: {e}")
metrics_list.append({
"device_id": dev_id,
"timestamp": time.time(),
"status": "Error",
"error": str(e)
})
return metrics_list
def _collect_device(self, device_id: int) -> Dict:
"""采集单卡指标"""
t0 = time.time()
# 尝试使用 npu-smi (推荐方式)
try:
# 获取综合信息
cmd = ["npu-smi", "info", "-t", "all", "-i", str(device_id), "-o", "json"]
result = subprocess.run(cmd, capture_output=True, text=True, timeout=5)
if result.returncode == 0:
data = json.loads(result.stdout)
info = data.get("devices", [{}])[0].get("device_info", {})
return {
"device_id": device_id,
"timestamp": time.time(),
"compute_util": float(info.get("ai_core_util", 0)),
"hbm_util": float(info.get("hbm_bandwidth_util", 0)),
"memory_used_mb": float(info.get("memory_used", 0)) / 1024,
"memory_total_mb": float(info.get("memory_total", 0)) / 1024,
"temperature_c": float(info.get("temperature", 0)),
"power_w": float(info.get("power_consumption", 0)),
"ecc_errors": int(info.get("ecc_error_count", 0)),
"status": "Normal"
}
except Exception as e:
print(f"[Warning] npu-smi failed for device {device_id}: {e}")
# Fallback: 使用 PyTorch API (仅能获取部分信息)
try:
torch.npu.set_device(device_id)
props = torch.npu.get_device_properties(device_id)
mem_alloc = torch.npu.memory_allocated() / 1024**2
return {
"device_id": device_id,
"timestamp": time.time(),
"compute_util": 0.0, # PyTorch不直接提供
"hbm_util": 0.0,
"memory_used_mb": mem_alloc,
"memory_total_mb": props.total_memory / 1024**2,
"temperature_c": 0.0,
"power_w": 0.0,
"ecc_errors": 0,
"status": "Unknown"
}
except Exception as e:
raise RuntimeError(f"Failed to collect metrics for device {device_id}: {e}")
def prometheus_metrics(self) -> str:
"""生成Prometheus格式的指标文本"""
metrics = self.collect_all()
lines = []
for m in metrics:
prefix = f"ascend_npu_"
lines.append(f'{prefix}compute_util{{device="{m["device_id"]}"}} {m["compute_util"]}')
lines.append(f'{prefix}memory_used{{device="{m["device_id"]}"}} {m["memory_used_mb"]}')
lines.append(f'{prefix}temperature{{device="{m["device_id"]}"}} {m["temperature_c"]}')
lines.append(f'{prefix}power{{device="{m["device_id"]}"}} {m["power_w"]}')
lines.append(f'{prefix}status{{device="{m["device_id"]}", state="normal"}} 1')
return "\n".join(lines)
# 使用示例:作为独立Exporter运行
if __name__ == "__main__":
monitor = AscendMonitor()
print(monitor.prometheus_metrics())
2. 推理性能监控 (inference_monitor.py)
import time
import numpy as np
from collections import deque
from typing import Deque
class InferenceMonitor:
"""
推理性能实时监控器
使用环形缓冲区维护最近60秒的数据,避免内存泄漏。
"""
def __init__(self, window_size: int = 60):
self.window_size = window_size
self.latencies: Deque[float] = deque(maxlen=10000)
self.timestamps: Deque[float] = deque(maxlen=10000)
self.success_count = 0
self.error_count = 0
self.start_time = time.time()
def record(self, latency_ms: float, success: bool = True):
"""记录一次推理结果"""
now = time.time()
self.latencies.append(latency_ms)
self.timestamps.append(now)
if success:
self.success_count += 1
else:
self.error_count += 1
# 清理过期数据
cutoff = now - self.window_size
while self.timestamps and self.timestamps[0] < cutoff:
self.latencies.popleft()
self.timestamps.popleft()
def get_stats(self) -> dict:
"""获取当前窗口内的统计指标"""
if not self.latencies:
return {"status": "no_data"}
lats = np.array(self.latencies)
total_req = len(lats) + self.error_count
error_rate = self.error_count / total_req if total_req > 0 else 0
return {
"qps": round(total_req / self.window_size, 2),
"latency_p50": round(np.percentile(lats, 50), 2),
"latency_p90": round(np.percentile(lats, 90), 2),
"latency_p95": round(np.percentile(lats, 95), 2),
"latency_p99": round(np.percentile(lats, 99), 2),
"latency_mean": round(lats.mean(), 2),
"latency_std": round(lats.std(), 2),
"error_rate": round(error_rate * 100, 2),
"uptime_sec": round(time.time() - self.start_time, 0)
}
三、Grafana 监控面板配置
将以下JSON导入Grafana即可生成专业的昇腾NPU监控大屏。
1. Grafana Dashboard JSON
{
"dashboard": {
"title": "昇腾NPU推理监控大盘",
"tags": ["ascend", "npu", "ai"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "NPU 实时负载",
"type": "timeseries",
"gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "ascend_npu_compute_util{job=\"ascend\"}",
"legendFormat": "{{device}} CPU Util",
"color": "#E0B400"
},
{
"expr": "ascend_npu_hbm_util{job=\"ascend\"}",
"legendFormat": "{{device}} HBM Bandwidth",
"color": "#FF9830"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"max": 100,
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 80},
{"color": "red", "value": 95}
]
}
}
}
},
{
"id": 2,
"title": "显存使用情况",
"type": "gauge",
"gridPos": {"x": 12, "y": 0, "w": 12, "h": 8},
"targets": [
{
"expr": "ascend_npu_memory_used{job=\"ascend\"}",
"legendFormat": "{{device}} Used",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"decimals": 1,
"max": 4096, // 假设4GB显存
"thresholds": {
"mode": "absolute",
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 3000},
{"color": "red", "value": 3800}
]
}
}
}
},
{
"id": 3,
"title": "推理延迟分布 (P99)",
"type": "stat",
"gridPos": {"x": 0, "y": 8, "w": 8, "h": 6},
"targets": [
{
"expr": "histogram_quantile(0.99, rate(inference_latency_bucket[5m]))",
"legendFormat": "P99 Latency",
"refId": "A"
}
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"color": {"mode": "thresholds"},
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "orange", "value": 20},
{"color": "red", "value": 50}
]
}
}
}
},
{
"id": 4,
"title": "温度与功耗趋势",
"type": "timeseries",
"gridPos": {"x": 8, "y": 8, "w": 16, "h": 6},
"targets": [
{
"expr": "ascend_npu_temperature{job=\"ascend\"}",
"legendFormat": "{{device}} Temp (°C)",
"color": "#F2495C"
},
{
"expr": "ascend_npu_power{job=\"ascend\"}",
"legendFormat": "{{device}} Power (W)",
"color": "#5794F2"
}
]
},
{
"id": 5,
"title": "系统健康状态",
"type": "table",
"gridPos": {"x": 0, "y": 14, "w": 24, "h": 6},
"targets": [
{
"expr": "ascend_npu_status{state=\"normal\"}",
"format": "table",
"instant": true
}
],
"transformations": [
{
"id": "organize",
"options": {
"excludeByName": {},
"indexByName": {}
}
}
]
}
],
"refresh": "5s",
"time": {
"from": "now-1h",
"to": "now"
}
}
}
四、告警规则配置 (AlertManager)
当指标异常时,必须第一时间通知。以下是Prometheus告警规则示例。
groups:
- name: ascend_npu_alerts
rules:
# 1. NPU温度过高
- alert: NpuTemperatureHigh
expr: ascend_npu_temperature > 75
for: 5m
labels:
severity: critical
annotations:
summary: "NPU {{ $labels.device }} 温度过高"
description: "设备 {{ $labels.device }} 温度达到 {{ $value }}°C,超过阈值75°C"
# 2. 显存使用率过高
- alert: NpuMemoryUsageHigh
expr: ascend_npu_memory_used / ascend_npu_memory_total > 0.9
for: 2m
labels:
severity: warning
annotations:
summary: "NPU {{ $labels.device }} 显存即将耗尽"
description: "显存使用率 {{ $value | humanizePercentage }},存在OOM风险"
# 3. 推理延迟超标
- alert: InferenceLatencyP99High
expr: histogram_quantile(0.99, rate(inference_latency_bucket[5m])) > 50
for: 3m
labels:
severity: warning
annotations:
summary: "推理P99延迟超过50ms"
description: "当前P99延迟为 {{ $value }}ms,SLA要求<50ms"
# 4. ECC错误计数增加
- alert: NpuEccError
expr: increase(ascend_npu_ecc_errors[1h]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "NPU {{ $labels.device }} 检测到ECC错误"
description: "过去1小时内新增 {{ $value }} 个ECC错误,可能存在硬件故障"
五、部署与最佳实践
1. 部署方式
- Kubernetes环境:
- 使用
DaemonSet部署node-exporter和ascend-monitor。 - 通过
Prometheus Operator自动发现服务。 - 利用
Helm Chart管理版本。
- 使用
- 物理机环境:
- 使用
systemd启动prometheus和grafana。 - 配置
cron定时任务调用npu-smi并写入日志文件。
- 使用
2. 最佳实践
- 采样频率:NPU指标建议 15s 一次,推理指标建议 1s 一次。
- 数据保留:短期数据(7天)存Prometheus,长期数据(1年)存Thanos或S3。
- 告警分级:
- Critical (电话/短信):温度>80°C,ECC错误,服务不可用。
- Warning (钉钉/企微):温度>75°C,显存>90%,延迟>SLA。
- Info (日志):正常波动,版本更新。
- 自动化修复:
- 检测到温度过高 -> 自动触发降频脚本。
- 检测到OOM -> 自动重启推理服务。
- 检测到死锁 -> 自动Kill进程并恢复。
结语:
监控不是终点,而是持续优化的起点。通过可视化的数据,你可以:
- 发现性能瓶颈(如某张卡利用率低)。
- 预测硬件故障(如温度持续上升)。
- 优化资源成本(如根据QPS动态调整Batch Size)。
让AI基础设施“被看见”,才能真正实现稳定、高效、低成本的AI生产环境。
更多推荐


所有评论(0)