Skip to content

[Auto Parallel] fix recompute reentrant:false bugs in auto parallel&dynamic graph #74075

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 18, 2025

Conversation

GITD245
Copy link
Contributor

@GITD245 GITD245 commented Jul 16, 2025

PR Category

Auto Parallel

PR Types

Bug fixes

Description

动态图自动并行使用recompute并设置reentrant:false时的一些问题:

  1. 使用原先的tmp_tensor = core.eager.Tensor会出现backward过程中读取ctx里存储的tensor未初始化的情况(fused_layers.py中)
  2. 如果ctx中保存的tensor有grad,使用原先的tmp_tensor会将grad清除
    该pr修复上述两种情况

NOTE: 原先使用以下代码

# TODO(jeff41404): it seems better to use `tmp_tensor = core.eager.Tensor(inner_x)`,
# but other errors will be triggered during the current period, and can be modified after resolution
tmp_tensor = core.eager.Tensor(
    inner_x.dtype,
    inner_x.shape,
    inner_x.name + "cpy",
    core.VarDesc.VarType.DENSE_TENSOR,
    inner_x.persistable,
    inner_x.process_mesh,
    inner_x.placements,
)

是因为早前框架存在某些bug,需通过这种方式绕过,现替换为

tmp_tensor = core.eager.Tensor(inner_x)

修改后已在动半qwen、baichuan、gpt2、llama7b上完成测试,可正常生效且loss对齐
pcard-86802

Copy link

paddle-bot bot commented Jul 16, 2025

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

liym27
liym27 previously approved these changes Jul 17, 2025
Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Please upload report for BASE (develop@726b769). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...on/paddle/distributed/fleet/recompute/recompute.py 50.00% 1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (50.00%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #74075   +/-   ##
==========================================
  Coverage           ?   50.00%           
==========================================
  Files              ?        1           
  Lines              ?        2           
  Branches           ?        0           
==========================================
  Hits               ?        1           
  Misses             ?        1           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liym27 liym27 merged commit 1f27316 into PaddlePaddle:develop Jul 18, 2025
58 of 60 checks passed
@GITD245 GITD245 deleted the recompute branch July 18, 2025 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants