LayoutLMv3 模型迁移指导书

文档说明

1.1 文档目的

本指导书详细阐述 LayoutLMv3 多模态文档智能模型从 GPU 环境迁移至昇腾 NPU(910B)环境的全流程,包括环境配置、迁移操作、常见错误及解决方案,旨在帮助技术人员完成模型的高效迁移与验证。

1.2 适用范围

本指导书适用于基于以下软件配置的昇腾 NPU 环境,若配置不同,需适配对应版本的依赖包与工具:

配置项

版本 / 信息

镜像名称

8.3.rc1-910b-ubuntu22.04-py3.11

镜像 OS

Ubuntu 22.04

Python

3.11.13

CANN

8.3RC1

PyTorch

2.7.1+cu130

Torch-npu

7.2.0

一、LayoutLMv3 模型概述

1.1 模型描述

LayoutLMv3 是面向文档智能领域的预训练多模态 Transformer 模型,采用统一的文本与图像掩码策略,通过简洁的架构和训练目标实现通用型文档理解能力。可微调应用于文本核心任务(表格理解、票据解析、文档视觉问答)和图像核心任务(文档图像分类、文档版面分析)。

1.2 网络架构

1.2.1 统一的多模态 Transformer 架构

核心为单一 Transformer 编码器,文本和图像数据均转换为同维度向量序列后拼接输入,通过自注意力机制学习文本、视觉特征及二维位置关系的深层关联,无需多阶段处理。

1.2.2 输入与嵌入层:文本与图像的早期统一

文本嵌入:沿用 BERT 分词与向量化方法,融入一维位置嵌入;图像嵌入(核心改进):摒弃 Faster R-CNN 预训练目标检测器,采用 “直接分块线性投影”—— 将文档图像分割为固定大小图像块,通过线性投影层映射为视觉 Token,并融入图像块二维坐标的位置嵌入。

1.2.3 预训练任务

LayoutLMv3 通过三类联合预训练任务实现多模态理解:

遮盖语言建模:遮盖文本 Token,基于上下文文本和视觉信息预测;遮盖图像建模:遮盖图像块 Token,基于文本和剩余视觉信息重建视觉特征;词 - 块对齐(核心创新):随机选择文本词,判断对应空间位置的图像块,强化文本与图像的细粒度对应关系。

二、模型迁移环境准备

2.1 前置条件

宿主机已安装昇腾 910B NPU 驱动,且可正常识别davinci设备;宿主机已配置 Docker 环境,可拉取 / 使用指定镜像(ID:c1855ae355cb);具备外网访问权限,用于下载依赖包、模型仓库及权重。

2.2 容器部署(挂载 NPU 到容器)

步骤 1:获取宿主机 HwHiAiUser 用户 GID

id HwHiAiUser  # 记录输出中的gid值(示例:1000)

步骤 2:启动容器并挂载 NPU 设备 / 依赖文件

docker run -itd \

--name zxp \

--privileged=true \

--device=/dev/davinci5 \

--device=/dev/davinci_manager \

--device=/dev/devmm_svm \

--device=/dev/hisi_hdc \

-v /usr/local/dcmi:/usr/local/dcmi \

-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \

-v /usr/bin/hccn_tool:/usr/bin/hccn_tool \

-v /home/00938648:/root/ \

-v /usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/common \

-v /usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/driver \

-v /etc/ascend_install.info:/etc/ascend_install.info \

-v /etc/hccn.conf:/etc/hccn.conf \

-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \

c1855ae355cb \

/bin/bash

步骤 3:验证容器内 NPU 设备

# 进入容器docker exec -it zxp /bin/bash

#查看可用davinci设备ls /dev/ | grep davinci*# 预期输出:davinci5.davinci_manager等

2.3 容器内驱动与用户配置

步骤 1:创建 HwHiAiUser 用户

groupadd -g 1000 HwHiAiUser && useradd -g HwHiAiUser -d /home/HwHiAiUser -m HwHiAiUser && echo ok

步骤 2:配置驱动环境变量

export LD_LIBRARY_PATH=/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:${LD_LIBRARY_PATH}

说明:该方式仅对当前终端有效,若需持久化,需写入~/.bashrc或/etc/profile.

2.4 软件及依赖安装

步骤 1:CANN 安装Toolkit开发套件包

步骤 2:配置环境变量及安装Kernels算子包

# 下载PyTorch包

Wget https://download.pytorch.org/whl/cpu/torch-2.7.1%2Bcpu-cp311-cp311-manylinux_2_28_aarch64.whl# 安装PyTorch

pip3 install torch-2.7.1+cpu-cp311-cp311-manylinux_2_28_aarch64.whl# 下载Torch-npu插件包

Wget https://gitcode.com/Ascend/pytorch/releases/download/v7.2.0-pytorch2.7.1/torch_npu-2.7.1-cp311-cp311-manylinux_2_28_aarch64.whl# 安装Torch-npu

pip3 install torch_npu-2.7.1-cp311-cp311-manylinux_2_28_aarch64.whl

三、模型迁移核心流程
通过分析代码,代码依赖detectron2框架,要对模型迁移适配需要对detectron2进行迁移适配并安装。

3.1 依赖包完整性检查与安装

步骤 1:安装 Git 工具

apt-get update && apt-get install -y git# 克隆detectron2仓库(模型依赖)git clone https://github.com/facebookresearch/detectron2.git -b v0.6

步骤 2:检查并安装缺失依赖

(1)依赖检查脚本

python

运行

import pkg_resources

required_packages = [

('pandas', '1.2.4'),

('libcst', None),

('prettytable', None),

('jedi', None)]for pkg, min_version in required_packages:

try:

installed_version = pkg_resources.get_distribution(pkg).version

status = f"✓ 已安装 (版本 {installed_version})"

if min_version and pkg_resources.parse_version(installed_version) < pkg_resources.parse_version(min_version):

status = f"⚠ 版本过低 (当前: {installed_version}, 需要: >={min_version})"

print(f"{pkg:20} {status}")

except pkg_resources.DistributionNotFound:

print(f"{pkg:20} ✗ 未安装")

print("\n提示:如果缺少任何'必选'包,请使用 'pip3 install <包名>' 安装。")

依赖包

是否必选

说明

最低版本要求

pandas

必选

数据处理

>= 1.2.4

libcst

必选

Python语法树解析器

-

prettytable

必选

数据可视化表格

-

jedi

可选(建议)

用于跨文件解析

-

(2)依赖安装命令

bash

运行

# 修复setuptools版本过低问题

apt-get update

apt-get install -y python3-dev build-essential libopenblas-dev liblapack-dev gfortran

python3 -m pip install --upgrade pip

python3 -m pip install --upgrade setuptools==68.2.0 wheel

python3 -m pip install cython numpy==1.23.5 -i https://mirrors.aliyun.com/pypi/simple/

# 安装依赖

pip3 install pandas==2.0.3 scipy==1.10.1 scikit-learn==1.3.0 tqdm==4.65.0 openpyxl==3.1.2 xlrd==2.0.1 -i https://mirrors.aliyun.com/pypi/simple/# 安装语法解析/可视化依赖

pip3 install pandas libcst prettytable jedi -i https://mirrors.aliyun.com/pypi/simple/

(阿里云镜像源速度下载较快)

3.2 模型代码自动转换(GPU→NPU)

步骤 1:获取detectron2源码及执行昇腾迁移工具

bash

运行

Git clone https://github.com/facecbookresearch/detectrob2.git -b v0.6

/usr/local/Ascend/ascend-toolkit/8.3.RC1/tools/ms_fmk_transplt/pytorch_gpu2npu.sh -i /detectron2/detectron2 -o /detectron2/detectron2-npu/ -v  2.1.0

步骤 2:转换后执行

将目录移动到opt目录,方便打包镜像

bash

运行

python -m pip install -e detectron2

# 创建新虚拟环境(避免依赖冲突)

python3 -m venv /opt/ascend_venv

source /opt/ascend_venv/bin/activate# 安装detectron2依赖

pip install torch -i https://mirrors.aliyun.com/pypi/simple/

pip install -i https://mirrors.aliyun.com/pypi/simple/ matplotlib

cd /opt/detectron2

pip install -e . --no-build-isolation

步骤 3:修改 detectron2 代码(可选)

bash

运行

vi ./detectron2/engine/defaults.py  # 根据迁移工具提示修改适配NPU的代码

3.3 LayoutLMv3 模型迁移与运行

步骤 1:克隆模型仓库与权重

bash

运行

# 进入opt目录(统一管理)

cd /opt

# 克隆unilm仓库(含LayoutLMv3代码)

git clone https://github.com/microsoft/unilm.gitcd unilm# 克隆模型权重

git clone https://gitcode.com/hf_mirrors/microsoft/layoutlmv3-base.git

步骤 2:修改运行脚本(run_funsd.py)

添加 NPU 设备指定环境变量;

替换datasets库的load_metric导入方式;

替换真实数据集为虚拟数据集(解决网络获取失败问题);

自定义 Trainer 类适配 NPU 设备;

确保模型 / 数据张量迁移至 NPU.

完整修改后的脚本见附录 A.

步骤 3:安装 LayoutLMv3 专属依赖

bash

运行

# 激活专属虚拟环境(避免版本冲突)

python3 -m venv /opt/layoutlmv3_npu_env

source /opt/layoutlmv3_npu_env/bin/activate# 安装兼容版本依赖

pip install --upgrade pip

pip install numpy==1.23.5 pandas scipy -i https://mirrors.aliyun.com/pypi/simple/# 安装核心依赖

pip install torch torch_npu transformers datasets evaluate -i https://mirrors.aliyun.com/pypi/simple/# 安装layoutlmft包cd /opt/unilm/layoutlmft

pip install -e .

步骤 4:执行模型训练 / 推理

bash

运行

cd /opt/unilm/layoutlmft/examples

python run_funsd.py \

--model_name_or_path /opt/detectron2/layoutlmv3-base \

--output_dir ./output_funsd_npu \

--do_train \

--do_eval \

--num_train_epochs 10 \

--per_device_train_batch_size 8 \

--learning_rate 5e-5 \

--overwrite_output_dir

步骤5:创建推理配置文件
cat > inference_config.json << 'EOF'

{

  "model_name_or_path": "/opt/unilm/layoutlmft/examples/output_funsd_npu",

  "dataset_name": "funsd",

  "do_train": false,

  "do_eval": false,

  "do_predict": true,

  "output_dir": "./inference_results",

  "per_device_eval_batch_size": 1,

  "overwrite_output_dir": true,

  "label_all_tokens": false,

  "max_test_samples": 2,

  "task_name": "ner"

}

EOF

步骤六:实行推理
# 强制离线模式,避免下载

export TRANSFORMERS_OFFLINE=1

export HF_HUB_OFFLINE=1

export HF_DATASETS_OFFLINE=1

# 运行推理

python run_funsd.py inference_config.json

四、常见错误与解决方案

错误阶段

错误现象

根因分析

解决方案

依赖安装阶段

提示 setuptools 版本过低,无法安装依赖包

setuptools 版本不兼容昇腾迁移工具

执行:python3 -m pip install --upgrade setuptools==68.2.0 wheel

工具转换阶段

jedi 库未安装,迁移工具分析 Python 文件失败

缺少语法解析依赖 jedi

pip3 install jedi -i https://mirrors.aliyun.com/pypi/simple/

工具转换阶段

pip install -e detectron2 失败,提示缺少 setup.py

detectron2 目录结构不完整 / 未用隔离构建

cd /opt/detectron2 && pip install -e . --no-build-isolation

脚本运行阶段

提示 datasets 模块缺失

未安装 LayoutLMv3 运行依赖

pip install datasets -i https://mirrors.aliyun.com/pypi/simple/

脚本运行阶段

NumPy 版本冲突,SciPy/Pandas 报版本不兼容错误

依赖版本不匹配

创建专属虚拟环境,安装 numpy==1.23.5,再重装 pandas/scipy

脚本运行阶段

提示 from datasets import load_metric 导入失败

datasets 库 API 更新,metric 功能迁移至 evaluate

修改导入语句:from evaluate import load as load_metric,并安装 evaluate

脚本运行阶段

提示找不到 layoutlmft 包

未安装项目自身代码包

cd /opt/unilm/layoutlmft && pip install -e .

脚本运行阶段

数据集加载失败,无法从网络获取 funsd 数据集

网络限制 / 数据集 API 变更

替换为虚拟数据集(见附录 A 脚本中 dummy_data 部分)

脚本运行阶段

PIL.Image 模块无 LINEAR 属性

图像处理依赖版本问题

升级 Pillow:pip install --upgrade Pillow -i https://mirrors.aliyun.com/pypi/simple/

脚本运行阶段

NPU 设备未被识别,提示 cuda 设备不可用

未指定 NPU 设备 / 未安装 torch-npu

添加环境变量os.environ["NPU_VISIBLE_DEVICES"] = "5",确保 torch-npu 安装正确

五、迁移验证与输出说明

5.1 预期输出

脚本运行后,核心输出如下(示例):

plaintext

INFO: 检测到NPU设备,使用 npu:5

INFO: 正在创建符合LayoutLMv3格式的虚拟数据集。..

INFO: 虚拟数据集创建成功,开始训练流程。

INFO: 正在创建随机初始化的LayoutLMv3模型,用于验证NPU训练流程。

INFO: 模型已移动到设备: npu:5

5.2 验证要点

脚本无报错终止,且打印 “检测到 NPU 设备”;

训练 / 评估阶段无设备不兼容错误,损失值正常下降;

output_funsd_npu 目录生成模型权重、metrics.json 等文件。

六、注意事项

虚拟环境建议:不同阶段使用独立虚拟环境(detectron2/LayoutLMv3),避免依赖冲突;

环境变量持久化:若需长期使用 NPU,将LD_LIBRARY_PATH和NPU_VISIBLE_DEVICES写入~/.bashrc;

设备索引:脚本中使用 NPU 索引 5(对应 /dev/davinci5),需根据实际设备调整;

模型权重:若需使用真实权重,确保 layoutlmv3-base 目录下包含 pytorch_model.bin、config.json 等文件;

性能优化:若训练速度慢,可调整per_device_train_batch_size、dataloader_num_workers等参数(需匹配 NPU 算力)。

附录 A:完整 run_funsd.py 脚本

python

运行

#!/usr/bin/env python# coding=utf-8import os# 使用第6个NPU设备(索引5)

os.environ["NPU_VISIBLE_DEVICES"] = "5"

os.environ["CUDA_VISIBLE_DEVICES"] = ""

os.environ["MASTER_ADDR"] = "localhost"

os.environ["MASTER_PORT"] = "29500"

os.environ["WORLD_SIZE"] = "1"

os.environ["RANK"] = "0"

os.environ["LOCAL_RANK"] = "-1"

import torch_npufrom torch_npu.contrib import transfer_to_npuimport loggingimport osimport sysfrom dataclasses import dataclass, fieldfrom typing import Optional

import numpy as npfrom datasets import ClassLabel, load_datasetfrom evaluate import load as load_metric

import layoutlmft.data.datasets.funsdimport transformersfrom layoutlmft.data import DataCollatorForKeyValueExtractionfrom layoutlmft.data.data_args import DataTrainingArgumentsfrom layoutlmft.models.model_args import ModelArgumentsfrom layoutlmft.trainers import FunsdTrainer as Trainerfrom transformers import (

AutoConfig,

AutoModelForTokenClassification,

AutoTokenizer,

HfArgumentParser,

PreTrainedTokenizerFast,

TrainingArguments,

set_seed,)from transformers.trainer_utils import get_last_checkpoint, is_main_processfrom transformers.utils import check_min_version

# Will error if the minimal version of Transformers is not installed. Remove at your own risks.

check_min_version("4.5.0")

logger = logging.getLogger(__name__)

def main():

# See all possible arguments in layoutlmft/transformers/training_args.py

# or by passing the --help flag to this script.

# We now keep distinct sets of args, for a cleaner separation of concerns.

parser = HfArgumentParser((ModelArguments, DataTrainingArguments, TrainingArguments))

if len(sys.argv) == 2 and sys.argv[1].endswith(".json"):

# If we pass only one argument to the script and it's the path to a json file,

# let's parse it to get our arguments.

model_args, data_args, training_args = parser.parse_json_file(json_file=os.path.abspath(sys.argv[1]))

else:

model_args, data_args, training_args = parser.parse_args_into_dataclasses()

# 打印调试信息

print(f"DEBUG: 训练参数初始化 - local_rank={training_args.local_rank}, n_gpu={training_args.n_gpu}, world_size={training_args.world_size}")

# 修改训练参数,确保使用单设备

training_args.dataloader_num_workers = 0

# 设置单进程训练,避免数据并行

training_args.local_rank = -1

# 设置设备

import torch

if hasattr(torch, 'npu') and torch.npu.is_available():

# 使用第6张NPU卡(索引5)

device = torch.device("npu:5")

torch.npu.set_device(5)

print(f"INFO: 检测到NPU设备,使用 {device}")

else:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

print(f"INFO: 使用设备: {device}")

# Detecting last checkpoint.

last_checkpoint = None

if os.path.isdir(training_args.output_dir) and training_args.do_train and not training_args.overwrite_output_dir:

last_checkpoint = get_last_checkpoint(training_args.output_dir)

if last_checkpoint is None and len(os.listdir(training_args.output_dir)) > 0:

raise ValueError(

f"Output directory ({training_args.output_dir}) already exists and is not empty. "

"Use --overwrite_output_dir to overcome."

)

elif last_checkpoint is not None:

logger.info(

f"Checkpoint detected, resuming training at {last_checkpoint}. To avoid this behavior, change "

"the `--output_dir` or add `--overwrite_output_dir` to train from scratch."

)

# Setup logging

logging.basicConfig(

format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",

datefmt="%m/%d/%Y %H:%M:%S",

handlers=[logging.StreamHandler(sys.stdout)],

)

logger.setLevel(logging.INFO if is_main_process(training_args.local_rank) else logging.WARN)

# Log on each process the small summary:

logger.warning(

f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"

+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"

)

# Set the verbosity to info of the Transformers logger (on main process only):

if is_main_process(training_args.local_rank):

transformers.utils.logging.set_verbosity_info()

transformers.utils.logging.enable_default_handler()

transformers.utils.logging.enable_explicit_format()

logger.info(f"Training/evaluation parameters {training_args}")

# Set seed before initializing model.

set_seed(training_args.seed)

# 使用虚拟数据集,立即验证NPU训练流程

# 创建符合LayoutLMv3要求的虚拟数据集

print("INFO: 正在创建符合LayoutLMv3格式的虚拟数据集。..")

from datasets import Dataset, DatasetDict, Features, Value, Sequence, ClassLabel

import numpy as np

# 1. 定义数据特征(结构)

features = Features({

"id": Value("string"),

"words": Sequence(Value("string")),

"bboxes": Sequence(Sequence(Value("int64"))),  # 嵌套序列:每个单词对应一个四元组[x0, y0, x1, y1]

"ner_tags": Sequence(ClassLabel(num_classes=3, names=["O", "B-HEADER", "I-HEADER"])), # 示例标签

"image": Value("string"),  # LayoutLMv3可能需要图像路径占位符

})

# 2. 创建虚拟数据(确保维度匹配:2个样本,每个样本3个单词)

dummy_data = {

"id": ["sample-1", "sample-2"],

"words": [["Hello", "world", "!"], ["Test", "document", "."]],

# 每个单词一个边界框 [x0, y0, x1, y1]

"bboxes": [

[[0, 0, 100, 100], [110, 0, 210, 100], [220, 0, 250, 100]],   # 样本1

[[0, 0, 80, 80], [90, 0, 180, 80], [190, 0, 210, 80]]         # 样本2

],

"ner_tags": [[0, 1, 2], [0, 1, 2]],  # 标签ID,对应ClassLabel

"image": ["dummy.jpg", "dummy.jpg"]  # 虚拟图像路径

}

# 3. 创建DatasetDict

train_dataset = Dataset.from_dict(dummy_data, features=features)

datasets = DatasetDict({

"train": train_dataset,

"validation": train_dataset,  # 添加验证集

"test": train_dataset,

})

print("INFO: 虚拟数据集创建成功,开始训练流程。")

if training_args.do_train:

column_names = datasets["train"].column_names

features = datasets["train"].features

else:

column_names = datasets["validation"].column_names

features = datasets["validation"].features

text_column_name = "tokens" if "tokens" in column_names else column_names[0]

label_column_name = (

f"{data_args.task_name}_tags" if f"{data_args.task_name}_tags" in column_names else column_names[1]

)

remove_columns = column_names

# In the event the labels are not a `Sequence[ClassLabel]`, we will need to go through the dataset to get the

# unique labels.

def get_label_list(labels):

unique_labels = set()

for label in labels:

unique_labels = unique_labels | set(label)

label_list = list(unique_labels)

label_list.sort()

return label_list

if isinstance(features[label_column_name].feature, ClassLabel):

label_list = features[label_column_name].feature.names

# No need to convert the labels since they are already ints.

label_to_id = {i: i for i in range(len(label_list))}

else:

label_list = get_label_list(datasets["train"][label_column_name])

label_to_id = {l: i for i, l in enumerate(label_list)}

num_labels = len(label_list)

# Load pretrained model and tokenizer

#

# Distributed training:

# The .from_pretrained methods guarantee that only one local process can concurrently

# download model & vocab.

config = AutoConfig.from_pretrained(

model_args.config_name if model_args.config_name else model_args.model_name_or_path,

num_labels=num_labels,

finetuning_task=data_args.task_name,

cache_dir=model_args.cache_dir,

revision=model_args.model_revision,

use_auth_token=True if model_args.use_auth_token else None,

)

tokenizer = AutoTokenizer.from_pretrained(

model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,

cache_dir=model_args.cache_dir,

use_fast=True,

revision=model_args.model_revision,

use_auth_token=True if model_args.use_auth_token else None,

)

# 加载模型 - 使用随机初始化的模型验证流程

print("INFO: 正在创建随机初始化的LayoutLMv3模型,用于验证NPU训练流程。")

from transformers import LayoutLMv3ForTokenClassification

model = LayoutLMv3ForTokenClassification(config)

# 将模型移动到NPU设备

model.to(device)

print(f"INFO: 模型已移动到设备: {device}")

# Tokenizer check: this script requires a fast tokenizer.

if not isinstance(tokenizer, PreTrainedTokenizerFast):

raise ValueError(

"This example script only works for models that have a fast tokenizer. Checkout the big table of models "

"at https://huggingface.co/transformers/index.html#bigtable to find the model types that meet this "

"requirement"

)

# Preprocessing the dataset

# Padding strategy

padding = "max_length" if data_args.pad_to_max_length else False

# Tokenize all texts and align the labels with them.

def tokenize_and_align_labels(examples):

# 提取文本和边界框

texts = examples[text_column_name]

boxes = examples["bboxes"]

# 确保边界框格式正确

processed_boxes = []

for bbox_list in boxes:

# 每个边界框应该是4个整数

normalized_bboxes = []

for bbox in bbox_list:

if len(bbox) == 4:

normalized_bboxes.append([int(coord) for coord in bbox])

else:

# 如果格式不对,使用默认值

normalized_bboxes.append([0, 0, 1000, 1000])

processed_boxes.append(normalized_bboxes)

# 调用tokenizer

tokenized_inputs = tokenizer(

text=texts,

boxes=processed_boxes,

padding=padding,

truncation=True,

return_overflowing_tokens=True,

)

labels = []

bboxes = []

images = []

for batch_index in range(len(tokenized_inputs["input_ids"])):

word_ids = tokenized_inputs.word_ids(batch_index=batch_index)

org_batch_index = tokenized_inputs["overflow_to_sample_mapping"][batch_index]

label = examples[label_column_name][org_batch_index]

bbox = examples["bboxes"][org_batch_index]

image = examples["image"][org_batch_index]

previous_word_idx = None

label_ids = []

bbox_inputs = []

for word_idx in word_ids:

# Special tokens have a word id that is None. We set the label to -100 so they are automatically

# ignored in the loss function.

if word_idx is None:

label_ids.append(-100)

bbox_inputs.append([0, 0, 0, 0])

# We set the label for the first token of each word.

elif word_idx != previous_word_idx:

label_ids.append(label_to_id[label[word_idx]])

bbox_inputs.append(bbox[word_idx])

# For the other tokens in a word, we set the label to either the current label or -100, depending on

# the label_all_tokens flag.

else:

label_ids.append(label_to_id[label[word_idx]] if data_args.label_all_tokens else -100)

bbox_inputs.append(bbox[word_idx])

previous_word_idx = word_idx

labels.append(label_ids)

bboxes.append(bbox_inputs)

images.append(image)

tokenized_inputs["labels"] = labels

tokenized_inputs["bbox"] = bboxes

tokenized_inputs["image"] = images

return tokenized_inputs

if training_args.do_train:

if "train" not in datasets:

raise ValueError("--do_train requires a train dataset")

train_dataset = datasets["train"]

if data_args.max_train_samples is not None:

train_dataset = train_dataset.select(range(data_args.max_train_samples))

train_dataset = train_dataset.map(

tokenize_and_align_labels,

batched=True,

remove_columns=remove_columns,

num_proc=data_args.preprocessing_num_workers,

load_from_cache_file=not data_args.overwrite_cache,

)

if training_args.do_eval:

if "validation" not in datasets:

raise ValueError("--do_eval requires a validation dataset")

eval_dataset = datasets["validation"]

if data_args.max_val_samples is not None:

eval_dataset = eval_dataset.select(range(data_args.max_val_samples))

eval_dataset = eval_dataset.map(

tokenize_and_align_labels,

batched=True,

remove_columns=remove_columns,

num_proc=data_args.preprocessing_num_workers,

load_from_cache_file=not data_args.overwrite_cache,

)

if training_args.do_predict:

if "test" not in datasets:

raise ValueError("--do_predict requires a test dataset")

test_dataset = datasets["test"]

if data_args.max_test_samples is not None:

test_dataset = test_dataset.select(range(data_args.max_test_samples))

test_dataset = test_dataset.map(

tokenize_and_align_labels,

batched=True,

remove_columns=remove_columns,

num_proc=data_args.preprocessing_num_workers,

load_from_cache_file=not data_args.overwrite_cache,

)

# Data collator

data_collator = DataCollatorForKeyValueExtraction(

tokenizer,

pad_to_multiple_of=8 if training_args.fp16 else None,

padding=padding,

max_length=512,

)

# Metrics - 使用简化的评估函数

def compute_metrics(p):

predictions, labels = p

predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)

true_predictions = [

[label_list[p] for (p, l) in zip(prediction, label) if l != -100]

for prediction, label in zip(predictions, labels)

]

true_labels = [

[label_list[l] for (p, l) in zip(prediction, label) if l != -100]

for prediction, label in zip(predictions, labels)

]

# 计算简单准确率

correct = 0

total = 0

for pred_line, label_line in zip(true_predictions, true_labels):

for p_token, l_token in zip(pred_line, label_line):

total += 1

if p_token == l_token:

correct += 1

accuracy = correct / total if total > 0 else 0.0

return {

"accuracy": accuracy,

"precision": accuracy,

"recall": accuracy,

"f1": accuracy,

}

# 创建自定义Trainer来确保数据在正确的设备上

class NPUTrainer(Trainer):

def _wrap_model(self, model, training=True, dataloader=None):

# 直接返回模型,不进行数据并行包装

# 接受training参数以适应新版本的transformers

return model.to(device) if device else model

def _prepare_inputs(self, inputs):

# 确保所有输入张量都在正确的设备上

prepared_inputs = {}

for k, v in inputs.items():

if isinstance(v, torch.Tensor):

prepared_inputs[k] = v.to(device)

else:

prepared_inputs[k] = v

return prepared_inputs

# Initialize our Trainer

trainer = NPUTrainer(

model=model,

args=training_args,

train_dataset=train_dataset if training_args.do_train else None,

eval_dataset=eval_dataset if training_args.do_eval else None,

tokenizer=tokenizer,

data_collator=data_collator,

compute_metrics=compute_metrics,

)

# Training

if training_args.do_train:

checkpoint = last_checkpoint if last_checkpoint else None

train_result = trainer.train(resume_from_checkpoint=checkpoint)

metrics = train_result.metrics

trainer.save_model()  # Saves the tokenizer too for easy upload

max_train_samples = (

data_args.max_train_samples if data_args.max_train_samples is not None else len(train_dataset)

)

metrics["train_samples"] = min(max_train_samples, len(train_dataset))

trainer.log_metrics("train", metrics)

trainer.save_metrics("train", metrics)

trainer.save_state()

# Evaluation

if training_args.do_eval:

logger.info("*** Evaluate ***")

metrics = trainer.evaluate()

max_val_samples = data_args.max_val_samples if data_args.max_val_samples is not None else len(eval_dataset)

metrics["eval_samples"] = min(max_val_samples, len(eval_dataset))

trainer.log_metrics("eval", metrics)

trainer.save_metrics("eval", metrics)

# Predict

if training_args.do_predict:

logger.info("*** Predict ***")

predictions, labels, metrics = trainer.predict(test_dataset)

predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)

true_predictions = [

[label_list[p] for (p, l) in zip(prediction, label) if l != -100]

for prediction, label in zip(predictions, labels)

]

trainer.log_metrics("test", metrics)

trainer.save_metrics("test", metrics)

# Save predictions

output_test_predictions_file = os.path.join(training_args.output_dir, "test_predictions.txt")

if trainer.is_world_process_zero():

with open(output_test_predictions_file, "w") as writer:

for prediction in true_predictions:

writer.write(" ".join(prediction) + "\n")

def _mp_fn(index):

# For xla_spawn (TPUs)

main()

if __name__ == "__main__":

main()

Logo

作为“人工智能6S店”的官方数字引擎,为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐