[Big tensor] fix nan in cross_entropy #74070

hxzd5568 · 2025-07-16T08:13:59Z

PR Category

Operator Mechanism

PR Types

Bug fixes

Description

Pcard-67164
修复cross_entropy(reduction="sum") float16下出现nan。

原因

由于参与sum的数字过大，结果出现inf， inf参与运算后结果为nan

修复方法

由于问题出现在sum的结果溢出，似乎无法在kernel层面修改（输出类型一般和输入一致，即便中间提升精度，最终也会转化回原始类型而导致溢出）。而在上层分配的时候修改其数据类型更加方便。

paddle-bot · 2025-07-16T08:14:04Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

hxzd5568 · 2025-07-16T16:02:59Z

/re-run coverage test

codecov-commenter · 2025-07-16T17:31:05Z

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Please upload report for BASE (develop@1107fe4). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
python/paddle/nn/functional/loss.py	0.00%	4 Missing ⚠️

❌ Your patch status has failed because the patch coverage (0.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             develop   #74070   +/-   ##
==========================================
  Coverage           ?    0.00%           
==========================================
  Files              ?        1           
  Lines              ?        4           
  Branches           ?        0           
==========================================
  Hits               ?        0           
  Misses             ?        4           
  Partials           ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

wanghuancoder

这种情况出nan我觉得可以接受。就是fp16撑不住了。不需要cast成32.否则在一些场景下会出现性能问题。可以再看书豪评估一下。

lshpku · 2025-07-18T02:24:58Z

python/paddle/nn/functional/loss.py

+                out_type = out.dtype
+                if out_type == paddle.float16:
+                    out = paddle.cast(out, dtype=paddle.float32)
+
                out_sum = _C_ops.sum(out, [], None, False)


不能在sum这里改吗，改成这样sum(out, [], paddle.float32, False)，这样会把cast和sum融合成一个kernel
你对比一下下面两个代码的 nsys trace：

x = paddle.randn([32, 32], dtype='bfloat16') y = paddle.sum(x, axis=1, dtype='float32') print(y.numpy().dtype)

x = paddle.randn([32, 32], dtype='bfloat16') y = paddle.sum(x.cast('float32'), axis=1) print(y.numpy().dtype)

hxzd5568 · 2025-07-21T15:31:46Z

/re-run all-failed

hxzd5568 marked this pull request as draft July 16, 2025 16:01

hxzd5568 marked this pull request as ready for review July 16, 2025 16:02

XieYunshen added the skip-ci: coverage label Jul 17, 2025

wanghuancoder previously approved these changes Jul 17, 2025

View reviewed changes

lshpku reviewed Jul 18, 2025

View reviewed changes

[BIG tensor] fix nan in cross_entropy

a113c3a

hxzd5568 dismissed wanghuancoder’s stale review via a113c3a July 21, 2025 06:08

hxzd5568 force-pushed the fix_nan_cross branch from 1addbc2 to a113c3a Compare July 21, 2025 06:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Big tensor] fix nan in cross_entropy #74070

[Big tensor] fix nan in cross_entropy #74070

hxzd5568 commented Jul 16, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 16, 2025

Uh oh!

hxzd5568 commented Jul 16, 2025

Uh oh!

codecov-commenter commented Jul 16, 2025

Uh oh!

wanghuancoder left a comment

Uh oh!

lshpku Jul 18, 2025

Uh oh!

hxzd5568 Jul 21, 2025

Uh oh!

hxzd5568 commented Jul 21, 2025

Uh oh!

Uh oh!

[Big tensor] fix nan in cross_entropy #74070

Are you sure you want to change the base?

[Big tensor] fix nan in cross_entropy #74070

Conversation

hxzd5568 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

原因

修复方法

Uh oh!

paddle-bot bot commented Jul 16, 2025

Uh oh!

hxzd5568 commented Jul 16, 2025

Uh oh!

codecov-commenter commented Jul 16, 2025

Codecov Report

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

lshpku Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

hxzd5568 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

hxzd5568 commented Jul 21, 2025

Uh oh!

Uh oh!

hxzd5568 commented Jul 16, 2025 •

edited

Loading