写给新手的 triton-inference-server-ge-backend：昇腾Triton推理服务后端到底是啥？

微祎_

12人浏览 · 2026-05-22 23:38:55

微祎_ · 2026-05-22 23:38:55 发布

之前帮兄弟部署模型上线，他问我：“哥，我们有 100 个模型要同时服务，用啥框架？TensorFlow Serving 只支持 TensorFlow，好头疼。”

我说用 Triton + GE Backend，支持所有框架。

好问题。今天一次说清楚。

triton-inference-server-ge-backend 是啥？

triton-inference-server-ge-backend = Triton Inference Server GraphExecutor Backend，昇腾为 Triton 推理服务器开发的 GE（GraphExecutor）后端。

一句话说清楚：triton-inference-server-ge-backend 是昇腾的 Triton 推理服务后端，让你用 Triton 统一管理昇腾 NPU 上的所有模型（TensorFlow、PyTorch、ONNX…），一个框架搞定所有推理服务。

你说气人不气人，之前要为每个框架搭一套服务，现在一个 Triton 全搞定。

为什么要用 triton-inference-server-ge-backend？

三个字：统一管。

不用 Triton GE Backend（各自为战）

# TensorFlow 模型 → 搭 TensorFlow Serving
docker run -p 8501:8501 tensorflow/serving

# PyTorch 模型 → 搭 TorchServe
docker run -p 8080:8080 pytorch/torchserve

# ONNX 模型 → 搭 ONNX Runtime Server
docker run -p 8001:8001 onnxruntime/server

# 问题：
# 1. 每个框架一套服务（维护成本高）
# 2. 资源不能共享（NPU 利用率低）
# 3. 监控要 each 看一眼（麻烦）
# 4. 版本管理混乱

用 Triton GE Backend（统一服务）

# 一个 Triton 服务，管理所有模型
$ docker run -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    triton-inference-server-ge-backend:latest

# 查看模型仓库
$ curl localhost:8000/v2/health/ready
{"ready": true}

# 查看所有模型
$ curl localhost:8000/v2/models
{"models": ["resnet50", "bert", "yolo", "gpt"]}

# 推理（统一 API）
$ curl -X POST localhost:8000/v2/models/resnet50/infer \
    -d '{"inputs": [{"name": "input", "shape": [1, 3, 224, 224], "datatype": "FP32", "data": [...]}]}'

你说气人不气人，一个框架搞定所有推理服务。

核心概念就三个

1. Triton Inference Server

Triton 是开源推理服务框架：

# Triton 架构
Triton Inference Server
├── HTTP/REST API (端口 8000）
├── gRPC API (端口 8001）
├── Metrics API (端口 8002）
├── Model Repository (模型仓库）
│   ├── resnet50/
│   │   ├── config.pbtxt
│   │   └── 1/
│   │       └── model.graphdef (或 .onnx / .pt）
│   ├── bert/
│   └── yolo/
└── Backends (后端）
    ├── tensorrt (NVIDIA GPU）
    ├── onnxruntime (CPU/GPU）
    └── ge (昇腾 NPU）← 我们关注的

2. GE Backend

GE Backend 让 Triton 支持昇腾 NPU：

# model_repository/resnet50/config.pbtxt
name: "resnet50"
platform: "graph_executor"  # ← 用 GE 后端
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

# GE 后端配置
parameters [
  {
    key: "EXECUTOR_TYPE"
    value: { string_value: "graph" }
  },
  {
    key: "DEVICE_ID"
    value: { string_value: "0" }
  }
]

3. 模型仓库

模型按目录组织：

model_repository/
├── resnet50/           # 模型名
│   ├── config.pbtxt    # 模型配置
│   └── 1/              # 版本 1
│       └── model.graphdef  # 模型文件（昇腾格式）
│
├── bert/
│   ├── config.pbtxt
│   └── 1/
│       └── model.graphdef
│
└── ensemble_model/      # 集成模型
    ├── config.pbtxt
    └── 1/
        └── model.graphdef

为什么要用 triton-inference-server-ge-backend？

三个理由：

1. 统一 API

所有模型一个 API：

# 推理 ResNet-50
$ curl -X POST localhost:8000/v2/models/resnet50/infer \
    -d '{"inputs": [...]}'

# 推理 BERT
$ curl -X POST localhost:8000/v2/models/bert/infer \
    -d '{"inputs": [...]}'

# 推理 YOLO
$ curl -X POST localhost:8000/v2/models/yolo/infer \
    -d '{"inputs": [...]}'

# 同一个 API，只是模型名不同

2. 动态批处理

自动合并请求，提升吞吐：

# config.pbtxt
name: "resnet50"
platform: "graph_executor"

# 动态批处理
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000  # 最多等 5ms
}

# 效果：
# 原来：每个请求单独推理 → 吞吐 125 img/s
# 现在：4 个请求合并推理 → 吞吐 450 img/s（3.6x）

3. 模型集成

多个模型串起来：

# ensemble_model/config.pbtxt
name: "preprocess_resnet_postprocess"
platform: "ensemble"

# 步骤 1：预处理
step [
  {
    model_name: "preprocess"
    model_version: 1
    input_map {
      key: "input"
      value: "raw_image"
    }
    output_map {
      key: "output"
      value: "preprocessed_image"
    }
  }
]

# 步骤 2：推理
step [
  {
    model_name: "resnet50"
    model_version: 1
    input_map {
      key: "input"
      value: "preprocessed_image"
    }
    output_map {
      key: "output"
      value: "logits"
    }
  }
]

# 步骤 3：后处理
step [
  {
    model_name: "postprocess"
    model_version: 1
    input_map {
      key: "input"
      value: "logits"
    }
    output_map {
      key: "output"
      value: "predictions"
    }
  }
]

怎么用？代码示例

示例 1：部署 ResNet-50

# 1. 准备模型仓库
$ mkdir -p model_repository/resnet50/1

# 2. 转换模型（PyTorch → 昇腾格式）
$ python convert_to_ge.py \
    --input_model resnet50.pth \
    --output_model model_repository/resnet50/1/model.graphdef \
    --input_shape [1,3,224,224] \
    --output_shape [1,1000]

# 3. 写配置文件
$ cat > model_repository/resnet50/config.pbtxt << EOF
name: "resnet50"
platform: "graph_executor"
max_batch_size: 32

input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [3, 224, 224]
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [1000]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000
}
EOF

# 4. 启动 Triton
$ docker run -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $(pwd)/model_repository:/models \
    triton-inference-server-ge-backend:latest \
    tritonserver --model-repository=/models

# 5. 测试推理
$ curl -X POST localhost:8000/v2/models/resnet50/infer \
    -H "Content-Type: application/json" \
    -d '{
      "inputs": [
        {
          "name": "input",
          "shape": [1, 3, 224, 224],
          "datatype": "FP32",
          "data": [0.1, 0.2, ...]  # 224*224*3 = 150528 个值
        }
      ]
    }'

# 输出：
# {
#   "model_name": "resnet50",
#   "model_version": "1",
#   "outputs": [
#     {
#       "name": "output",
#       "shape": [1, 1000],
#       "datatype": "FP32",
#       "data": [...]
#     }
#   ]
# }

示例 2：部署 BERT

# 1. 准备模型仓库
$ mkdir -p model_repository/bert/1

# 2. 转换模型（TensorFlow → 昇腾格式）
$ python convert_to_ge.py \
    --input_model bert_pretrained \
    --output_model model_repository/bert/1/model.graphdef \
    --input_shape [1,128] \
    --input_type INT32

# 3. 写配置文件
$ cat > model_repository/bert/config.pbtxt << EOF
name: "bert"
platform: "graph_executor"
max_batch_size: 16

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [128]
  },
  {
    name: "attention_mask"
    data_type: TYPE_INT32
    dims: [128]
  }
]

output [
  {
    name: "pooled_output"
    data_type: TYPE_FP32
    dims: [768]
  }
]

dynamic_batching {
  preferred_batch_size: [4, 8]
  max_queue_delay_microseconds: 5000
}
EOF

# 4. 启动 Triton（如果还没启动）
# 如果已经启动，Triton 会自动加载新模型

# 5. 测试推理
$ curl -X POST localhost:8000/v2/models/bert/infer \
    -H "Content-Type: application/json" \
    -d '{
      "inputs": [
        {
          "name": "input_ids",
          "shape": [1, 128],
          "datatype": "INT32",
          "data": [101, 2023, ..., 102]  # token IDs
        },
        {
          "name": "attention_mask",
          "shape": [1, 128],
          "datatype": "INT32",
          "data": [1, 1, ..., 1, 0, 0]  # attention mask
        }
      ]
    }'

示例 3：客户端代码（Python）

import tritonclient.http as httpclient
import numpy as np

# 连接到 Triton
client = httpclient.InferenceServerClient(url="localhost:8000")

# 检查服务状态
print("Server ready:", client.is_server_ready())
print("Model ready:", client.is_model_ready("resnet50"))

# 准备输入
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

# 构造推理请求
inputs = [
    httpclient.InferInput("input", input_data.shape, "FP32")
]
inputs[0].set_data_from_numpy(input_data)

outputs = [
    httpclient.InferRequestedOutput("output")
]

# 发送推理请求
results = client.infer(
    model_name="resnet50",
    inputs=inputs,
    outputs=outputs
)

# 获取结果
output_data = results.as_numpy("output")
print(f"Output shape: {output_data.shape}")
print(f"Top-5 predictions: {np.argsort(output_data[0])[-5:][::-1]}")

示例 4：性能监控

# 1. 查看 Metrics
$ curl localhost:8002/metrics

# 输出（节选）：
# triton_inference_count{model="resnet50"} 1000
# triton_inference_count{model="bert"} 500
# triton_inference_exec_count{model="resnet50"} 1000
# triton_inference_exec_count{model="bert"} 500
# triton_inference_request_duration_us{model="resnet50",le="1000"} 950
# triton_inference_queue_duration_us{model="resnet50",le="100"} 980

# 2. 查看 GPU/NPU 利用率（另开终端）
$ watch -n 1 npu-smi stats -i 0

# 3. 压力测试
$ python benchmark.py \
    --model resnet50 \
    --concurrency 10 \
    --requests 1000

# 输出：
# Throughput: 1250 req/s
# Latency (p50): 26ms
# Latency (p99): 45ms
# GPU/NPU Utilization: 85%

性能数据

用 Triton GE Backend 的性能提升：

场景	不用 Triton	用 Triton	提升
单模型推理	125 img/s	125 img/s	1x（一样）
动态批处理	125 img/s	450 img/s	3.6x
多模型并发	手动调度	自动调度	2x
资源利用率	40%	85%	2.1x

你说气人不气人，动态批处理直接快 3.6 倍。

跟其他仓库的关系

triton-inference-server-ge-backend 在 CANN 架构里属于第 4 层（昇腾计算执行层），是推理服务后端。

依赖关系：

triton-inference-server-ge-backend（Triton 后端）
    ↓ 调用
GE / GraphExecutor（图执行器）
    ↓ 调用
Runtime（运行时）
    ↓ 调用
硬件（昇腾 NPU）

解释一下：

Triton：开源推理服务框架
GE Backend：Triton 的昇腾后端
GE / GraphExecutor：昇腾图执行器
硬件：昇腾 NPU

简单说：triton-inference-server-ge-backend 是 Triton 和昇腾之间的桥梁。想用 Triton 管理昇腾模型，就用它。

triton-inference-server-ge-backend 的核心内容

1. 后端实现

// src/graph_executor_backend.cc
#include "triton/backend/backend_model.h"
#include "ge/ge_api.h"

class GraphExecutorBackend : public triton::backend::BackendModel {
 public:
  void Infer(...) override {
    // 1. 准备输入
    std::vector<ge::Tensor> inputs = PrepareInputs(...);
    
    // 2. 运行推理
    std::vector<ge::Tensor> outputs;
    ge::GraphExecutor executor;
    executor.Run(inputs, &outputs);
    
    // 3. 返回输出
    ProcessOutputs(outputs, ...);
  }
};

2. 模型配置

# config.pbtxt
name: "..."
platform: "graph_executor"
max_batch_size: 32

input [...]
output [...]

dynamic_batching {...}

3. 客户端 SDK

# Python 客户端
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")
results = client.infer(model_name="resnet50", inputs=[...], outputs=[...])

4. 监控

# Metrics 端点
curl localhost:8002/metrics

# 关键指标：
# - triton_inference_count
# - triton_inference_request_duration_us
# - triton_inference_queue_duration_us