小模型在昇腾NPU上的推理部署：：【triton-server服务化部署sensevoice】

模型仓库的目录结构配置文件的详细解释（多输入多输出、动态形状）Python 后端服务端代码（基于ais_bench推理接口，支持动态形状）客户端调用示例（预处理、推理、后处理）Triton Server 启动命令注意：本文聚焦于服务化部署，其中预处理（FBank 提取）和后处理（CTC 解码）在客户端完成。您也可以将部分预处理/后处理移至 Triton Python 后端实现，具体取决于性能需求和

ღ温酒叙余生ღ᭄ꦿ࿐

130人浏览 · 2026-05-20 15:42:35

ღ温酒叙余生ღ᭄ꦿ࿐ · 2026-05-20 15:42:35 发布

作者：昇腾实战派

小模型在NPU上的推理部署：【知识地图】

引言

Triton Inference Server 是 NVIDIA 开源的高性能推理服务框架，支持多种后端和模型格式，广泛应用于生产环境。本文将详细介绍如何在昇腾（Ascend）AI处理器上使用 Triton Server 部署 SenseVoice 语音识别模型。SenseVoice 是一个多语言语音识别模型，输入为音频波形特征（FBank），输出为 CTC 对数概率和编码器输出长度。我们将通过 Python 后端调用昇腾推理引擎 ais_bench 加载离线模型（.om）并提供服务。文章涵盖模型仓库组织、配置文件编写、服务端代码实现、客户端调用以及服务启动命令，帮助开发者快速上手昇腾设备上的 Triton 部署。

环境准备

硬件：昇腾 AI 处理器（如 Atlas 300 系列）
软件：
- 驱动与固件：根据昇腾社区指引安装对应版本
- 镜像推荐：使用昇腾社区提供的 Triton Server 镜像

模型仓库目录结构

Triton Server 要求模型按指定目录结构组织。以下是一个 SenseVoice 模型的典型布局：

models/
└── sensevoice                  # 模型名称，与 config.pbtxt 中的 name 一致
    ├── 1                       # 模型版本号（必须为数字）
    │   └── model.py             # Python 后端核心代码
    ├── client_sensevoice.py     # 客户端测试脚本（非必需）
    ├── config.pbtxt             # 模型配置文件
    └── model.om                 # 昇腾离线模型

1/ 目录表示版本号，内部必须包含 model.py（Python 后端的入口文件）。
config.pbtxt 描述了模型的输入输出、后端类型、参数等。
.om 文件是经过昇腾 ATC 工具转换后的离线模型，本例中为 SenseVoice 模型。

配置文件 `config.pbtxt` 详解

# 模型名称，通常与存放此配置文件的目录名保持一致
name: "sensevoice"

# 指定模型运行的平台/后端，这里是 Python 后端
backend: "python"

# 模型支持的最大批处理大小，设置为 0 表示不支持动态批处理（每个请求独立形状）
max_batch_size: 0

# 输入节点配置（SenseVoice 有四个输入）
input [
  {
    name: "speech"               # 音频特征张量
    data_type: TYPE_FP32
    dims: [ -1, -1, 560 ]        # [batch, 时间帧, 特征维度] -1 表示可变
  },
  {
    name: "speech_lengths"       # 每个样本的有效帧长度
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "language"             # 语言 ID（如 auto, zh, en）
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "textnorm"             # 文本正则化类型 ID（withitn/woitn）
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]

# 输出节点配置（SenseVoice 有两个输出）
output [
  {
    name: "ctc_logits"           # CTC 对数概率
    data_type: TYPE_FP32
    dims: [ -1, -1, 25055 ]      # [batch, 时间帧, 词表大小]
  },
  {
    name: "encoder_out_lens"     # 编码器输出长度
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]

# 实例组配置 (可选，定义使用多少个NPU实例并行)
instance_group [
  {
    count: 1                      # 实例数量
  }
]

# 自定义参数，在 Python 后端的 initialize 中读取
parameters: [
  {
    key: "batch_size",
    value: { string_value: "1" }
  },
  {
    key: "model_path",
    value: { string_value: "/home/users/models/sensevoice/model.om" }
  },
  {
    key: "device_id",
    value: { string_value: "0" }
  }
]

说明：

输入输出维度中的 -1 表示可变维度，Triton 会依据实际请求动态调整。
max_batch_size: 0 表示由模型自行处理 batch（每个请求可包含不同形状的张量）。
parameters 部分传递自定义参数，在 model.py 中通过 model_config.get('parameters', {}) 读取。

服务端代码（Python 后端）

文件路径：models/sensevoice/1/model.py

import json
import numpy as np
import triton_python_backend_utils as pb_utils
from ais_bench.infer.interface import InferSession

class TritonPythonModel:
    """Triton Python 后端模型类"""
    def load_model(self, model_path, device_id):
        """Load OM model using ais_bench InferSession."""
        return InferSession(int(device_id), model_path)

    def initialize(self, args):
        """模型初始化，只调用一次"""
        model_config = json.loads(args['model_config'])
        self.input_config = model_config['input']
        self.output_config = model_config['output']
        print(f"Model initialized with input: {self.input_config}, output: {self.output_config}")

        params = model_config.get('parameters', {})    
        self.batch_size = int(params['batch_size']['string_value'])  
        self.model_path = params['model_path']['string_value'] 
        self.device_id = int(params['device_id']['string_value'])    

        self.model = self.load_model(self.model_path, self.device_id)

        # 获取输出张量名称（模型有两个输出）
        self.output0_name = model_config['output'][0]['name']   # "ctc_logits"
        self.output1_name = model_config['output'][1]['name']   # "encoder_out_lens"

    def execute(self, requests):
        responses = []
        for request in requests:
            # 获取四个输入张量
            speech_tensor = pb_utils.get_input_tensor_by_name(request, "speech")
            speech_lengths_tensor = pb_utils.get_input_tensor_by_name(request, "speech_lengths")
            language_tensor = pb_utils.get_input_tensor_by_name(request, "language")
            textnorm_tensor = pb_utils.get_input_tensor_by_name(request, "textnorm")

            speech = speech_tensor.as_numpy()
            speech_lengths = speech_lengths_tensor.as_numpy()
            language = language_tensor.as_numpy()
            textnorm = textnorm_tensor.as_numpy()

            # 调用 OM 模型推理，输入顺序需与转换时的签名一致
            # 这里假设顺序为 [speech, speech_lengths, language, textnorm]
            ctc_logits, encoder_out_lens = self.model.infer(
                [speech, speech_lengths, language, textnorm],
                mode='dymshape',          # 动态形状模式
                custom_sizes=100000000     # 自定义内存大小（根据模型调整）
            )

            # 封装输出张量（类型必须与 config.pbtxt 一致）
            out0_tensor = pb_utils.Tensor(self.output0_name, ctc_logits.astype(np.float32))
            out1_tensor = pb_utils.Tensor(self.output1_name, encoder_out_lens.astype(np.int32))

            response = pb_utils.InferenceResponse(output_tensors=[out0_tensor, out1_tensor])
            responses.append(response)

        return responses

    def finalize(self):
        """清理资源（可选）"""
        print("Cleaning up resources...")

关键点：

必须实现 Triton Python 后端标准接口：TritonPythonModel 类及 initialize、execute、finalize 方法。
模型加载：使用 ais_bench.infer.interface.InferSession 加载 OM 模型，并读取配置中的自定义参数（模型路径、设备 ID）。
多输入处理：execute 方法中通过 pb_utils.get_input_tensor_by_name 分别获取四个输入张量，转换为 NumPy 数组。
推理调用：self.model.infer 接收输入列表，顺序需与 OM 模型签名严格一致。mode='dymshape' 支持动态形状输入（SenseVoice 的音频长度可变）。
多输出返回：模型输出 ctc_logits 和 encoder_out_lens，分别封装为 pb_utils.Tensor，最后构造响应。

客户端代码示例

文件路径：models/sensevoice/client_sensevoice.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
SenseVoice Triton 客户端（函数式版本）
用法：python client_sensevoice.py --audio /path/to/audio.wav --url localhost:9000
"""

import argparse
import numpy as np
import tritonclient.http as httpclient
from funasr import AutoModel
from funasr.utils.load_utils import load_audio_text_image_video, extract_fbank
from funasr.utils.postprocess_utils import rich_transcription_postprocess
import torch

# ---------- 全局配置 ----------
BLANK_ID = 0
LID_DICT = {"auto": 0, "zh": 3, "en": 4, "yue": 7, "ja": 11, "ko": 12, "nospeech": 13}
TEXTNORM_DICT = {"withitn": 14, "woitn": 15}

def init_frontend_tokenizer(funasr_model="iic/SenseVoiceSmall", device="cpu"):
    """初始化前端（特征提取器）和 tokenizer，不加载完整模型"""
    _, kwargs = AutoModel.build_model(
        model=funasr_model,
        trust_remote_code=True,
        device=device,
        disable_pbar=True,
        disable_log=True
    )
    return kwargs["frontend"], kwargs["tokenizer"]

def preprocess(audio_path, frontend, tokenizer, language="auto", text_norm="withitn", fs=16000):
    """音频预处理，返回 Triton 需要的输入字典"""
    audio_list = load_audio_text_image_video(
        audio_path, fs=frontend.fs, audio_fs=fs,
        data_type="sound", tokenizer=tokenizer
    )
    speech, speech_lengths = extract_fbank(audio_list, data_type="sound", frontend=frontend)

    lang_id = LID_DICT.get(language, 0)
    language_arr = np.array([lang_id], dtype=np.int32)
    norm_id = TEXTNORM_DICT.get(text_norm, 14)
    textnorm_arr = np.array([norm_id], dtype=np.int32)

    return {
        "speech": speech.cpu().numpy().astype(np.float32),
        "speech_lengths": speech_lengths.cpu().numpy().astype(np.int32),
        "language": language_arr,
        "textnorm": textnorm_arr
    }

def infer(triton_client, model_name, inputs):
    """调用 Triton 推理服务"""
    triton_inputs = []
    for name, data in inputs.items():
        dtype = "FP32" if data.dtype == np.float32 else "INT32"
        triton_input = httpclient.InferInput(name, data.shape, dtype)
        triton_input.set_data_from_numpy(data)
        triton_inputs.append(triton_input)

    outputs = [
        httpclient.InferRequestedOutput("ctc_logits"),
        httpclient.InferRequestedOutput("encoder_out_lens")
    ]
    response = triton_client.infer(model_name, triton_inputs, outputs=outputs)
    return response.as_numpy("ctc_logits"), response.as_numpy("encoder_out_lens")

def postprocess(ctc_logits, encoder_out_lens, tokenizer):
    """CTC 解码及后处理"""
    ctc_logits = torch.from_numpy(ctc_logits)
    encoder_out_lens = torch.from_numpy(encoder_out_lens)

    x = ctc_logits[0, : encoder_out_lens[0].item(), :]
    yseq = x.argmax(dim=-1)
    yseq = torch.unique_consecutive(yseq, dim=-1)
    mask = yseq != BLANK_ID
    token_int = yseq[mask].tolist()

    raw_text = tokenizer.decode(token_int)
    return rich_transcription_postprocess(raw_text)

def recognize(audio_path, server_url="localhost:8000", model_name="sensevoice",
              funasr_model="iic/SenseVoiceSmall", language="auto", text_norm="withitn",
              device="cpu"):
    """完整识别流程"""
    triton_client = httpclient.InferenceServerClient(url=server_url)
    frontend, tokenizer = init_frontend_tokenizer(funasr_model, device)

    print(f"[1/3] 预处理音频: {audio_path}")
    inputs = preprocess(audio_path, frontend, tokenizer, language, text_norm)

    print(f"[2/3] 调用 Triton 推理服务 ({model_name})...")
    ctc_logits, enc_lens = infer(triton_client, model_name, inputs)

    print(f"[3/3] CTC 解码与后处理...")
    result = postprocess(ctc_logits, enc_lens, tokenizer)
    return result

def main():
    parser = argparse.ArgumentParser(description="SenseVoice Triton 客户端")
    parser.add_argument("--audio", type=str, required=True, help="音频文件路径")
    parser.add_argument("--url", type=str, default="localhost:8000", help="Triton 服务器地址")
    parser.add_argument("--model", type=str, default="sensevoice", help="Triton 中的模型名称")
    parser.add_argument("--language", type=str, default="auto", help="识别语言")
    parser.add_argument("--text_norm", type=str, default="withitn", help="文本正则化")
    args = parser.parse_args()

    result = recognize(
        audio_path=args.audio,
        server_url=args.url,
        model_name=args.model,
        language=args.language,
        text_norm=args.text_norm
    )
    print("\n识别结果：", result)

if __name__ == "__main__":
    main()

说明：

客户端使用 FunASR 工具进行预处理（音频加载、FBank 特征提取）和后处理（CTC 解码）。
通过 Triton HTTP 客户端发送四个输入，接收两个输出。
支持指定语言和文本正则化类型，默认使用 auto 和 withitn。

启动 Triton Server

在昇腾镜像上启动 Triton Server 服务，执行以下命令：

/opt/tritonserver/bin/tritonserver \
    --model-repository=/path/to/your/models \
    --http-port=9000 \
    --grpc-port=9002

--model-repository：指定模型仓库的根目录（绝对路径），例如 /home/user/models。
--http-port：HTTP 服务端口，客户端通过该端口发送 HTTP 请求（此处改为 9000）。
--grpc-port：gRPC 服务端口（此处改为 9002）。

启动后，Triton Server 会加载 models/ 下所有有效模型，并打印日志。您可以使用客户端脚本测试：

python client_sensevoice.py --audio test.wav --url localhost:9000

总结与后续工作

本文提供了一个在昇腾设备上使用 Triton Server 部署 SenseVoice 语音识别模型的完整示例，包括：

模型仓库的目录结构
配置文件 config.pbtxt 的详细解释（多输入多输出、动态形状）
Python 后端服务端代码（基于 ais_bench 推理接口，支持动态形状）
客户端调用示例（预处理、推理、后处理）
Triton Server 启动命令

注意：本文聚焦于服务化部署，其中预处理（FBank 提取）和后处理（CTC 解码）在客户端完成。您也可以将部分预处理/后处理移至 Triton Python 后端实现，具体取决于性能需求和架构设计。

SenseVoice 模型的 OM 转换及更多优化技巧可参考昇腾官方文档：

🔗 Ascend/modelzoo - SenseVoice 模型适配示例

人工智能6S服务平台

作为“人工智能6S店”的官方数字引擎，为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐

CANN ops-transformer FlashAttention 实战：从 clone 到跑通算子测试，踩坑全记录

上周帮一个实习生在昇腾NPU上跑 ops-transformer 的 FlashAttention 算子，从 clone 到 ut 通过花了两天——不是因为代码难，是因为踩了太多环境坑。把整个过程记下来，后面的人不用再踩一遍。昇腾CANN 的 ops-transformer 仓库是 Transformer 类大模型进阶算子库，FlashAttention、MoE路由、MC2 通信这些算子全在这里。

人工智能6S服务平台

vllm-ascend 通信优化：SP/FlashComm1/FlashComm2

在大规模模型的推理过程中，通信效率成为性能瓶颈之一。本文面向 vLLM + 昇腾 NPU 场景下的推理工程师、性能优化人员与运维人员，系统梳理了 `vllm-ascend` 中的三套递进式通信优化方案：**SP（Sequence Parallelism）**、**FlashComm1（FC1）** 与 **FlashComm2（FC2）**。本文将详细介绍这些方案的设计思路、数学等价性、代码实现以

人工智能6S服务平台

CANN ops-transformer FlashAttention 里的因果掩码：分块计算时怎么防止“偷看未来“

有个问题困扰了我一阵子：昇腾CANN 的 ops-transformer 仓库里 FlashAttention 分块做在线 Softmax 的时候，因果掩码（causal mask）怎么处理？分块算 QK^T，你只有局部数据，怎么知道哪些位置应该被遮掉？翻了一遍代码，发现实现方式和我想的完全不一样。昇腾NPU 上 FlashAttention 的 causal mask 处理是直接在分块级别做遮挡