8.6 RuntimeError: call aclnnCast failed之bf16数据类型支持

RuntimeError: call aclnnCast failed, detail:EZ1001: 2024-11-25-14:47:16.347.323 self not implemented for DT_BFLOAT16, should be in dtype support list [DT_FLOAT16,DT_FLOAT,DT_DOUBLE,DT_INT8,DT_UINT8,DT_INT16,DT_INT32,DT_INT64,DT_UINT16,DT_UINT32,DT_UINT64,DT_BOOL,DT_COMPLEX64,DT_COMPLEX128,].

[rank2]: [ERROR] 2024-11-25-14:47:16 (PID:20733, Device:2, RankID:2) ERR01005 OPS internal error
bf16是Facebook新提出的深度学习数据格式,华为的机器并不一定支持,所以将其设置为false
在这里插入图片描述

8.7 aclnnCast算子

File “/root/anaconda3/lib/python3.9/site-packages/peft/peft_model.py”, line 159, in init
[rank0]: self.base_model._cast_adapter_dtype(
[rank0]: File “/root/anaconda3/lib/python3.9/site-packages/peft/tuners/tuners_utils.py”, line 357, in _cast_adapter_dtype
[rank0]: param.data = param.data.to(torch.float32)
[rank0]: RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnCast.
[rank0]: Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, pleace set the environment variable ASCEND_LAUNCH_BLOCKING=1.
[rank0]: [ERROR] 2024-11-26-10:53:07 (PID:1580674, Device:0, RankID:0) ERR00100 PTA call acl api failed
[rank5]:[E compiler_depend.ts:270] call aclnnCast failed, detail:EZ9999: Inner Error!
EZ9999: 2024-11-26-10:53:07.228.835 Op Cast does not has any binary.
TraceBack (most recent call last):
Kernel Run failed. opType: 3, Cast
launch failed for Cast, errno:561000.
[ERROR] 2024-11-26-10:53:07 (PID:1580685, Device:5, RankID:5) ERR01005 OPS internal error
这个问题在最新的A2系列服务器可直接解决,驱动用24及以上。只有A2系列的才支持
在这里插入图片描述

重新装kernel算子,opp_kernel
在这里插入图片描述

确实是的,在ascend-toolkit下ls
在这里插入图片描述

Logo

作为“人工智能6S店”的官方数字引擎,为AI开发者与企业提供一个覆盖软硬件全栈、一站式门户。

更多推荐