Skip to content

Fix issue(s) preventing conus13km_restart and conus13km_decomp regression tests from running #2647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MatthewPyle-NOAA opened this issue Mar 12, 2025 · 14 comments · May be fixed by NOAA-EMC/fv3atm#974 or #2753
Assignees
Labels
bug Something isn't working

Comments

@MatthewPyle-NOAA
Copy link
Collaborator

MatthewPyle-NOAA commented Mar 12, 2025

Description

We have reason to believe that the RRFS system isn't bit-reproducible with restarts compared to a continuous integration. We wanted to look at the regression tests for confirmation within a simpler framework, and it looks like the restart problem is a known issue, as the conus13km_restart test is purposely not run.

To Reproduce:

Numerous tests within tests/rt.conf in the regional "conus13km" realm are commented out under "# Expected to fail:" headings.

Additional context

Ideally would want these conus13km tests to use the RRFS_sas physics being used for RRFS rather than the FV3_HRRR suite.

Output

@MatthewPyle-NOAA MatthewPyle-NOAA added the bug Something isn't working label Mar 12, 2025
@DusanJovic-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA I think I fixed two issues that prevented bit-reproducible results for conus13km restart and decomp tests. Please take a look at this branch (https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/rrfs_conus13km_tests) and try to run your test with it. Once you confirm that it works for you, I can open a PR.

To fix the restart tests, I had to add one variable ('ebu_smoke') to the restart file. The decomp tests were fixed by reducing the number of MPI tasks in both I and J direction, such that the size of MPI subdomains is greater than 'nrows_blend'.

@MatthewPyle-NOAA
Copy link
Collaborator Author

Thanks @DusanJovic-NOAA for this quick fix. I'll try to take a look soonish and let you know.

@MatthewPyle-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA Could you also make these changes on top of the production/RRFS.v1 branch? That would simplify testing it more fully for RRFS. Thanks!

@DusanJovic-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA When I try to run one of the regression tests from production/RRFS.v1 branch compilation fails with this error:

/lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/prod_rrfsv1/ufs-weather-model/FV3/atmos_cubed_sphere/tools/fv_io.F90(98): error #6580: Name in only-list does not exist or is not accessible.   [MPP_COMM_NULL]
                                     mpp_npes, MPP_COMM_NULL
-----------------------------------------------^
/lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/prod_rrfsv1/ufs-weather-model/FV3/atmos_cubed_sphere/tools/fv_io.F90(106): error #6580: Name in only-list does not exist or is not accessible.   [MPP_GET_DOMAIN_TILE_COMMID]
                                     mpp_get_global_domain, mpp_get_domain_tile_commid, mpp_define_io_domain
------------------------------------------------------------^

How do you run the tests in this branch? Standard ./rt.sh -e?

@MatthewPyle-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA We run the tests within rt.conf_rrfs - not sure if that explains the difference, though.

@DusanJovic-NOAA
Copy link
Collaborator

@MatthewPyle-NOAA Please check this branch https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/rrfs_v1_conus13km_tests This is based on production/RRFS.v1 branch

@JiliDong-NOAA
Copy link
Contributor

@MatthewPyle-NOAA When I try to run one of the regression tests from production/RRFS.v1 branch compilation fails with this error:

/lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/prod_rrfsv1/ufs-weather-model/FV3/atmos_cubed_sphere/tools/fv_io.F90(98): error #6580: Name in only-list does not exist or is not accessible.   [MPP_COMM_NULL]
                                     mpp_npes, MPP_COMM_NULL
-----------------------------------------------^
/lfs/h2/emc/eib/noscrub/dusan.jovic/ufs/prod_rrfsv1/ufs-weather-model/FV3/atmos_cubed_sphere/tools/fv_io.F90(106): error #6580: Name in only-list does not exist or is not accessible.   [MPP_GET_DOMAIN_TILE_COMMID]
                                     mpp_get_global_domain, mpp_get_domain_tile_commid, mpp_define_io_domain
------------------------------------------------------------^

How do you run the tests in this branch? Standard ./rt.sh -e?

The compiling failure is due to the compiling option

-DENABLE_PARALLELRESTART

and discussed here:
#2488

I wrote these at the time:

it appears the above option is set to True or Yes by default. When compiling explicitly or use "rt.sh -n" without setting it to NO, the compiling will fail. most of rt.conf has "-DENABLE_PARALLELRESTART=NO". I guess that's why there is no problem running it with original rt confs.

@MatthewPyle-NOAA
Copy link
Collaborator Author

Thanks @JiliDong-NOAA I figured it was related to the PARALLELRESTART material.

@junwang-noaa
Copy link
Collaborator

@MatthewPyle-NOAA would you please confirm if Dusan's branch resolved the issue?

@MatthewPyle-NOAA Please check this branch https://github.com/DusanJovic-NOAA/ufs-weather-model/tree/rrfs_v1_conus13km_tests This is based on production/RRFS.v1 branch

@MatthewPyle-NOAA
Copy link
Collaborator Author

@junwang-noaa I'm waiting on somebody to run a full test with the RRFS, but they are occupied with a couple of other higher priority items right now. Hopefully it isn't a huge problem for it to linger for a bit longer?

@MatthewPyle-NOAA
Copy link
Collaborator Author

@JiliDong-NOAA has finally started trying to run it using a full RRFS configuration case this week. Initial tests comparing a 2 h continuous forecast and one restart at 1 h showed difference. Since a major difference between the regression test configuration that works properly and the RRFS configuration that failed is the choice of convection (RRFS uses saSAS), Jili did a sensitivity test with convection disabled. This worked somewhat better, with only a small number of diagnostic fields differing.

Summary from Jili:

After turning off deep convection, the prognostic and most diagnostic variables are reproducible in the restart run, except for three diagnostic variables: hailcast_dhail, accswe_land and accswe_ice. In a summary, the current rrfsv1 deterministic forecast has the following restart reproducibility issues:

saSAS deep convection
hailcast_dhail (it is likely hailcast in current rrfsv1 doesn't account for restart reproducibility)
accswe_land/accswe_ice or water equivalent snow accumulation over land/ice from RUC LSM (it looks like there is no water equivalent snow accumulation carried over from the previous checkpoint)

@JiliDong-NOAA
Copy link
Contributor

@MatthewPyle-NOAA @DusanJovic-NOAA to fix saSAS related restart reproducibility of rrfsv1, I just submitted a fv3atm PR to this branch:

DusanJovic-NOAA/fv3atm#10

The changes are only in ccpp/physics though.

@DusanJovic-NOAA
Copy link
Collaborator

@JiliDong-NOAA I merged your PR to my rrfs_v1_conus13km_tests branch. Is there a regression test in develop branch that uses the saSAS scheme, which can be used to test restart functionality. Will this change need to be made to develop branch as well?

@JiliDong-NOAA
Copy link
Contributor

@JiliDong-NOAA I merged your PR to my rrfs_v1_conus13km_tests branch. Is there a regression test in develop branch that uses the saSAS scheme, which can be used to test restart functionality. Will this change need to be made to develop branch as well?

thanks @DusanJovic-NOAA ! There should be a bunch of global regression tests using saSAS in the develop branch. But the restart reproducibility issue in rrfsv1 only happens when sigmab_coldstart is turned on, which I don't believe any regression test does. @MatthewPyle-NOAA had suggested to add a regression test with configurations similar to rrfsv1, which I think is a great idea.

As for whether making the same changes to develop branch, the logic there looks fine as long as sigmab_coldstart is not enabled:


      if(flag_init .and. .not. flag_restart) then
           sigmind_new=0.0
      else
           sigmind_new=sigmind
      end if

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
4 participants