Skip to content

Upgrade UFS-WM to spack-stack/1.9.1 #2619

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #2984
ulmononian opened this issue Feb 25, 2025 · 146 comments · May be fixed by #2650
Open
Tracked by #2984

Upgrade UFS-WM to spack-stack/1.9.1 #2619

ulmononian opened this issue Feb 25, 2025 · 146 comments · May be fixed by #2650
Labels
enhancement New feature or request

Comments

@ulmononian
Copy link
Collaborator

Description

spack-stack/1.9.1 will be released in the near-term. the ufs-wm should upgrade to this stack.

Solution

make necessary changes to module files, test scripts, etc. to utilize the spack-stack/1.9.1 release.

@RatkoVasic-NOAA @rickgrubin-tomorrow @jkbk2004 @AlexanderRichert-NOAA @BrianCurtis-NOAA

@ulmononian ulmononian added the enhancement New feature or request label Feb 25, 2025
@ulmononian
Copy link
Collaborator Author

@BrianCurtis-NOAA @AlexanderRichert-NOAA @jkbk2004 do any of you know the status of nco approvals for the ops. machines for the new lib versions to be introduced w/ spack-stack/1.9.1?

@AlexanderRichert-NOAA
Copy link
Collaborator

AlexanderRichert-NOAA commented Feb 25, 2025

@BrianCurtis-NOAA @AlexanderRichert-NOAA @jkbk2004 do any of you know the status of nco approvals for the ops. machines for the new lib versions to be introduced w/ spack-stack/1.9.1?

Not offhand-- @Hang-Lei-NOAA can you comment?

@BrianCurtis-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA might know

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Feb 25, 2025 via email

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Mar 4, 2025

@NickSzapiro-NOAA tested hera/gnu but some issues. @NickSzapiro-NOAA What is the error message?

@NickSzapiro-NOAA
Copy link
Collaborator

Like in #2263

The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this release.
Workarounds are to run on a single node, or to use a system with an RDMA
capable network such as Infiniband.

@DusanJovic-NOAA
Copy link
Collaborator

I was also testing spack-stack/1.9.0 installation on Hera/intel last week (/contrib/spack-stack/spack-stack-1.9.0/envs/ue-oneapi-2024.2.1/install/modulefiles/Core) and few tests failed with this error:

198: [h17c45:159296:0:159296] Caught signal 8 (Floating point exception: floating-point invalid operation)
198: ==== backtrace (tid: 159296) ====
198:  0 0x0000000000053519 ucs_debug_print_backtrace()  ???:0
198:  1 0x0000000000012d10 __funlockfile()  :0
198:  2 0x0000000002ceface _SCOTCHbdgraphInit()  ???:0
198:  3 0x0000000002cef4db kdgraphMapRbPart2()  kdgraph_map_rb_part.c:0
198:  4 0x0000000002cef636 kdgraphMapRbPart2()  kdgraph_map_rb_part.c:0
198:  5 0x0000000002cef15c _SCOTCHkdgraphMapRbPart()  ???:0
198:  6 0x0000000002cebc68 SCOTCH_dgraphMapCompute()  ???:0
198:  7 0x0000000002ceb5ce SCOTCH_ParMETIS_V3_PartKway()  ???:0
198:  8 0x0000000002ceb6f6 SCOTCH_ParMETIS_V3_PartGeomKway()  ???:0
198:  9 0x0000000002cea987 scotch_parmetis_v3_partgeomkway_()  ???:0
198: 10 0x000000000283ccee yowpdlibmain_mp_runparmetis_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/PDLIB/yowpdlibmain.F90:632
198: 11 0x000000000282a3e8 yowpdlibmain_mp_initfromgriddim_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/PDLIB/yowpdlibmain.F90:127
198: 12 0x0000000002671dfb pdlib_w3profsmd_mp_pdlib_init_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/w3profsmd_pdlib.F90:263
198: 13 0x000000000195b3c5 w3initmd_mp_w3init_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/w3initmd.F90:765
198: 14 0x000000000141951b wav_comp_nuopc_mp_waveinit_ufs_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/wav_comp_nuopc.F90:1741
198: 15 0x00000000013fa0c8 wav_comp_nuopc_mp_initializerealize_()  /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/ufs/esmf880_mapl2530/ufs-weather-model/WW3/model/src/wav_comp_nuopc.F90:785
198: 16 0x00000000009f2fc0 ESMCI::FTable::callVFuncPtr()  /contrib/spack-stack/spack-stack-1.9.0/cache/build_stage/role.epic/spack-stage-esmf-8.8.0-6egfjcmjlfy35fyjclznkm34g63ekvtj/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2187

@JessicaMeixner-NOAA
Copy link
Collaborator

Which version of SCOTCH is being used?

@DusanJovic-NOAA
Copy link
Collaborator

Which version of SCOTCH is being used?

7.0.4 same as currently used

@RatkoVasic-NOAA
Copy link
Collaborator

Can you test with current spack-stack-1.9.1 branch:
https://github.com/RatkoVasic-NOAA/ufs-weather-model/tree/ss-191
On Hera, I got these failures:

fail_compile_rrfs_dyn32_phy32_intelllvm
fail_compile_rrfs_dyn64_phy32_intelllvm
fail_test_cpld_debug_gfsv17_intel
fail_test_cpld_debug_p8_intel
fail_test_cpld_debug_pdlib_p8_intel

@DeniseWorthen
Copy link
Collaborator

@RatkoVasic-NOAA Can you check the err log for your cpld_debug_gfsv17_intel test?

@RatkoVasic-NOAA
Copy link
Collaborator

@DeniseWorthen here:
/scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_1773941/cpld_debug_gfsv17_intel/err
I see SCOTCH (as you mentioned at meeting):

246:  5 0x0000000002d53b6c _SCOTCHkdgraphMapRbPart()  ???:0
246:  6 0x0000000002d50678 SCOTCH_dgraphMapCompute()  ???:0
246:  7 0x0000000002d4ffde SCOTCH_ParMETIS_V3_PartKway()  ???:0
246:  8 0x0000000002d50106 SCOTCH_ParMETIS_V3_PartGeomKway()  ???:0
246:  9 0x0000000002d4f397 scotch_parmetis_v3_partgeomkway_()  ???:0
246: 10 0x00000000028a12c2 yowpdlibmain_mp_runparmetis_()  /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/WM-SS-191/WW3/model/src/PDLIB/yowpdlibmain.F90:632
246: 11 0x000000000288e9bc yowpdlibmain_mp_initfromgriddim_()  /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/WM-SS-191/WW3/model/src/PDLIB/yowpdlibmain.F90:127

@MatthewMasarik-NOAA
Copy link
Collaborator

@DeniseWorthen here: /scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_1773941/cpld_debug_gfsv17_intel/err I see SCOTCH (as you mentioned at meeting):

246:  5 0x0000000002d53b6c _SCOTCHkdgraphMapRbPart()  ???:0
246:  6 0x0000000002d50678 SCOTCH_dgraphMapCompute()  ???:0
246:  7 0x0000000002d4ffde SCOTCH_ParMETIS_V3_PartKway()  ???:0
246:  8 0x0000000002d50106 SCOTCH_ParMETIS_V3_PartGeomKway()  ???:0
246:  9 0x0000000002d4f397 scotch_parmetis_v3_partgeomkway_()  ???:0
246: 10 0x00000000028a12c2 yowpdlibmain_mp_runparmetis_()  /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/WM-SS-191/WW3/model/src/PDLIB/yowpdlibmain.F90:632
246: 11 0x000000000288e9bc yowpdlibmain_mp_initfromgriddim_()  /scratch2/NCEPDEV/fv3-cam/Ratko.Vasic/WM-SS-191/WW3/model/src/PDLIB/yowpdlibmain.F90:127

@RatkoVasic-NOAA I was curious if you have a branch available on your fork you used for this testing?

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented Apr 7, 2025 via email

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Apr 7, 2025

https://github.com/RatkoVasic-NOAA/ufs-weather-model.git

Branch: ss-191

Awesome, thank you!

@MatthewMasarik-NOAA
Copy link
Collaborator

MatthewMasarik-NOAA commented Apr 8, 2025

@RatkoVasic-NOAA could I ask if you're using intel or the intelllvm modules when scotch fails?

@RatkoVasic-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA could I ask if you're using intel or the intelllvm modules when scotch fails?

It uses modulefiles/ufs_hera.intel.lua modulefile.
And this is from rt.conf:

COMPILE | s2swa_32bit_pdlib_debug | intel | -DAPP=S2SWA -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON -DDEBUG=ON | - noaacloud jet | fv3 |
RUN | cpld_debug_gfsv17                                 | - noaacloud jet                      | baseline |

@MatthewMasarik-NOAA
Copy link
Collaborator

@RatkoVasic-NOAA could I ask if you're using intel or the intelllvm modules when scotch fails?

It uses modulefiles/ufs_hera.intel.lua modulefile. And this is from rt.conf:

COMPILE | s2swa_32bit_pdlib_debug | intel | -DAPP=S2SWA -D32BIT=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8_ugwpv1 -DPDLIB=ON -DDEBUG=ON | - noaacloud jet | fv3 |
RUN | cpld_debug_gfsv17                                 | - noaacloud jet                      | baseline |

Perfect. Thank you for that info.

@MatthewMasarik-NOAA
Copy link
Collaborator

Good news from @GeorgeVandenberghe-NOAA who tested scotch v7.0.7 on hera/intel:

I was able to replicate a floating point error with scotch/7.0.4 and it went away when I rebuilt with scotch/7.07.

To get that scotch on hera, after the load of ufs_hera.intel, add the following line to a build script before cmake:

export CMAKE_PREFIX_PATH=/scratch1/NCEPDEV/global/gwv/simstacks/simstack.1115/netcdf140.492.460.mapl241.fms2301.crtm240/scotch.707:$CMAKE_PREFIX_PATH

@BrianCurtis-NOAA
Copy link
Collaborator

@GeorgeVandenberghe-NOAA if this scotch fix will need to make it into WCOSS2, please get that process started ASAP.

@ulmononian
Copy link
Collaborator Author

great news. @MatthewMasarik-NOAA @JessicaMeixner-NOAA @junwang-noaa will the plan be to go to scotch 7.07? we'd have to get this into spack-stack/1.9.1 to move this forward, so just want to see if there is a consensus.

@JessicaMeixner-NOAA
Copy link
Collaborator

If we can confirm this fixes all the issues, then yes we'd want to move to Scotch 7.07. We'll need to get a request in for WCOSS2 ASAP.

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Apr 9, 2025

Just noting for awareness that requests for WCOSS2 library installations should be opened here: https://github.com/NOAA-EMC/WCOSS2-requests.

@MatthewMasarik-NOAA
Copy link
Collaborator

If we can confirm this fixes all the issues, then yes we'd want to move to Scotch 7.07. We'll need to get a request in for WCOSS2 ASAP.

I made several attempts running the RTs with George's scotch build, but am getting compile errors in the WW3 source. I'm working on troubleshooting.

@mathomp4
Copy link

and if you'd like to see what cray-mpich/8.1.30 reports via module show, please ask.

@rickgrubin-noaa Can you provide what cray-mpich is showing? We are still a bit baffled here.

@rickgrubin-tomorrow
Copy link
Contributor

@mathomp4 this is a bit long, but hopefully provides appropriate context and info.

Cray MPICH 8.1.30:
=======================================

Release Date:
-------------
  June  1, 2024


Purpose:
--------
  Cray MPICH 8.1.30 is based upon ANL MPICH 3.4a2 with support for libfabric
  and is optimized for the Cray Programming Environment.
    
  Major Differences Cray MPICH 8.1.30 from the XC Cray MPICH include:

      - Uses the new ANL MPICH CH4 code path and libfabric for network
        support.

      - Does not support -default64 mode for Fortran

      - Does not support C++ language bindings

  New Cray MPICH features for HPE Cray EX and Apollo systems:
      - Starting from the 8.1.26 release, Cray MPICH supports the Intel Sapphire Rapids CPU HBM 
        processor architecture.

      - On systems with AMD GPUs, Cray MPICH 8.1.26 supports all ROCm
        versions starting from ROCm 5.0, including the latest ROCm 5.5.0 
        release. 
 
        The Cray MPICH 8.1.25 release and prior versions of 
        Cray MPICH are only compatible with ROCm versions up to (and 
        including) the ROCm 5.4.0 release.

      - Cray MPICH uses the libfabric "verbs;ofi_rxm" provider by default.
        This is the supported and optimized OFI libfabric provider for
        Slingshot-10 and Apollo systems.

      - Cray MPICH offers support for multiple NICs per node. Starting with
        version 8.0.8, by default Cray MPICH will use all available NICs on
        a node. Several rank-to-NIC assignment policies are supported. For
        details on choosing a policy for assigning ranks to NICS, or for
        selecting a subset of available NICs, please see the following
        environment variables documented in the mpi man page.

        MPICH_OFI_NIC_VERBOSE
        MPICH_OFI_NIC_POLICY
        MPICH_OFI_NIC_MAPPING
        MPICH_OFI_NUM_NICS

      - Enhancements to the MPICH_OFI_NIC_POLICY NUMA mode have been added.
        Starting with version 8.0.14, if the user selects the NUMA policy,
        the NIC closest to the rank is selected. A NIC no longer needs to
        reside in the same numa node as the rank. If multiple NICs are
        assigned to the same numa node, the local ranks will round-robin
        between them. Numa distances are analyzed to select the closest NIC.

      - Cray MPICH supports creating a full connection grid during MPI_Init.
        By default, OFI connections between ranks are set up on demand. This
        allows for optimal performance while minimizing memory requirements.
        However, for jobs requiring an all-to-all communication pattern, it
        may be beneficial to create all OFI connections in a coordinated
        manner at startup. See the MPICH_OFI_STARTUP_CONNECT description in
        the mpi man page.

      - Cray MPICH supports runtime switching to the UCX netmod starting
        with version 8.0.14. To do this load the craype-network-ucx module
        and module swap between Cray-MPICH and Cray-MPICH-UCX modules.  For
        more information including relevant environment variables reference
        the intro_mpi man page with the Cray-MPICH-UCX module loaded.

      - Lmod support for HPE Cray EX starting with Cray MPICH 8.0.16.


Key Changes and Bugs Closed:
----------------------------

   Changes in Cray MPICH 8.1.30

     New Features:

        - Cray MPICH 8.1.30 for aarch64 is in early stage of support and
          has some issues. Setting these variables will help circumvent most
          known issues.

            export MPICH_SMP_SINGLE_COPY_MODE=CMA
            export MPICH_MALLOC_FALLBACK=1

Product and OS Dependencies:
----------------------------
  The Cray MPICH 8.1.30 release is supported on the following HPE systems:
  * HPE Cray EX systems with CLE
  * HPE Apollo systems as part of the Cray Programming Environment

  Product and OS Dependencies by network type:
  --------------------------------------------------+
                              |       Shasta        |
  ----------------------------+---------------------+
        craype                | >= 2.7.6            |
  ----------------------------+---------------------+
        cray-pals             | >= 1.0.6            |
  ----------------------------+---------------------+
        cray-pmi              | >= 6.0.1            |
  ----------------------------+---------------------+
        libfabric             | >= 1.9.0            |
  ----------------------------+---------------------+

  One or more compilers:
  * AMD ROCM 6.0 or later
  * AOCC 4.1 or later
  * CCE 17.0 or later
  * GNU 12.3 or later
  * Intel 2022.1 or later
  * Nvidia 23.3 or later

whatis("cray-mpich - Cray MPICH Message Passing Interface")
setenv("CRAY_MPICH_VER","8.1.30")
setenv("CRAY_MPICH_VERSION","8.1.30")
setenv("CRAY_MPICH_ROOTDIR","/opt/cray/pe/mpich/8.1.30")
setenv("CRAY_MPICH_BASEDIR","/opt/cray/pe/mpich/8.1.30/ofi")
setenv("PE_MPICH_MODULE_NAME","cray-mpich")
setenv("CRAY_MPICH_DIR","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1")
setenv("CRAY_MPICH_PREFIX","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1")
setenv("MPICH_DIR","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1")
setenv("PE_MPICH_PKGCONFIG_VARIABLES","PE_MPICH_GTL_DIR_@accelerator@:PE_MPICH_GTL_LIBS_@accelerator@")
setenv("PE_MPICH_GTL_DIR_amd_gfx906","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_amd_gfx908","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_amd_gfx90a","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_amd_gfx940","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_amd_gfx942","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_nvidia70","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_nvidia80","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_nvidia90","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_DIR_ponteVecchio","-L/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("PE_MPICH_GTL_LIBS_amd_gfx906","-lmpi_gtl_hsa")
setenv("PE_MPICH_GTL_LIBS_amd_gfx908","-lmpi_gtl_hsa")
setenv("PE_MPICH_GTL_LIBS_amd_gfx90a","-lmpi_gtl_hsa")
setenv("PE_MPICH_GTL_LIBS_amd_gfx940","-lmpi_gtl_hsa")
setenv("PE_MPICH_GTL_LIBS_amd_gfx942","-lmpi_gtl_hsa")
setenv("PE_MPICH_GTL_LIBS_nvidia70","-lmpi_gtl_cuda")
setenv("PE_MPICH_GTL_LIBS_nvidia80","-lmpi_gtl_cuda")
setenv("PE_MPICH_GTL_LIBS_nvidia90","-lmpi_gtl_cuda")
setenv("PE_MPICH_GTL_LIBS_ponteVecchio","-lmpi_gtl_ze")
prepend_path("PE_MPICH_GENCOMPILERS_INTEL","2022.1")
prepend_path("PE_MPICH_FIXED_PRGENV","INTEL")
prepend_path("PE_INTEL_FIXED_PKGCONFIG_PATH","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1/lib/pkgconfig")
prepend_path("PE_PKGCONFIG_PRODUCTS","PE_MPICH")
prepend_path("PE_MPICH_PKGCONFIG_LIBS","mpich")
prepend_path("PE_MPICH_FORTRAN_PKGCONFIG_LIBS","mpichf90")
prepend_path("PE_PKGCONFIG_LIBS","mpich")
prepend_path("PE_FORTRAN_PKGCONFIG_LIBS","mpichf90")
setenv("PE_PERFTOOLS_MPICH_LIBDIR","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1/lib")
prepend_path("MANPATH","/opt/cray/pe/mpich/8.1.30/man/mpich")
prepend_path("MANPATH","/opt/cray/pe/mpich/8.1.30/ofi/man")
prepend_path("PATH","/opt/cray/pe/mpich/8.1.30/bin")
prepend_path("PATH","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1/bin")
prepend_path("CRAY_LD_LIBRARY_PATH","/opt/cray/pe/mpich/8.1.30/ofi/intel/2022.1/lib:/opt/cray/pe/mpich/8.1.30/gtl/lib")
setenv("CRAY_LMOD_MPI","cray-mpich/8.0")
prepend_path("MODULEPATH","/opt/cray/pe/lmod/modulefiles/mpi/intel/2023.2/ofi/1.0/cray-mpich/8.0")

@mathomp4
Copy link

mathomp4 commented Apr 29, 2025

@RatkoVasic-NOAA If you can, can you try the MAPL branch:

feature/v2.53.0-disable-ssi

in your tests? That brings over some SSI related changes (essentially disabling it) from @atrayano

ETA: Note: This cannot run with the hybrid MPI+OpenMP GOCART you all run. It should run as pure MPI

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 Great news! I ran one test on Gaea-C6 (cpld_control_p8_lnd_intel) using new MAPL branch (feature/v2.53.0-disable-ssi) it passed successfully.
Tomorrow, I'll continue using other cases and see if that fix solved other cases as well... fingers crossed!

@mathomp4
Copy link

mathomp4 commented May 1, 2025

@mathomp4 Great news! I ran one test on Gaea-C6 (cpld_control_p8_lnd_intel) using new MAPL branch (feature/v2.53.0-disable-ssi) it passed successfully. Tomorrow, I'll continue using other cases and see if that fix solved other cases as well... fingers crossed!

@RatkoVasic-NOAA Encouraging news! I'll talk with my team and see if we can make this a run-time switch (right now we sort of "bulk replaced" some code others might need).

Once we can figure that out, we can look to release a MAPL v2.53.3.

@mathomp4
Copy link

mathomp4 commented May 1, 2025

@RatkoVasic-NOAA I hate to ask it, but we might have another branch for you to try:

feature/v2.53.0-hardcode-pet

This one is a one line change that if it works for you, would be much easier for us to put in a run-time option for your situation.

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 sure, I'll try it today

@mathomp4
Copy link

mathomp4 commented May 1, 2025

CC @atrayano so he can monitor

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 , @atrayano , @ulmononian
I ran tests with feature/v2.53.0-hardcode-pet branch on Gaea-C6.
All, but one test passed, but there's no problem in MAPL, rather it was in GOCART, debug case. We have another division with zero, in file ./GOCART/Process_Library/GOCART2G_Process.F90 and debug caught it.

I'm going to test new MAPL (using feature/v2.53.0-hardcode-pet branch) on some other machines (like Hera) where we had no problem with [email protected].

@mathomp4 do you plan on tagging this version (feature/v2.53.0-hardcode-pet)?

@mathomp4
Copy link

mathomp4 commented May 1, 2025

@RatkoVasic-NOAA We'll need to make it flexible with a run time option. This branch hardcoded something that is non-default (for us) so we need to make it an option that can be set in an RC file. We should have something tomorrow I hope and we can make a v2.53.3 tag

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 @atrayano @ulmononian
Just FYI - I got with UFS-WM bit identical results on Hera using [email protected] and feature/v2.53.0-hardcode-pet branch. Once that branch is tagged, we can switch to new MAPL version everywhere.

@ulmononian
Copy link
Collaborator Author

@mathomp4 @atrayano @ulmononian
Just FYI - I got with UFS-WM bit identical results on Hera using [email protected] and feature/v2.53.0-hardcode-pet branch. Once that branch is tagged, we can switch to new MAPL version everywhere.

this is great news. thanks @RatkoVasic-NOAA, @mathomp4, and @atrayano!

@mathomp4
Copy link

mathomp4 commented May 2, 2025

@ulmononian @RatkoVasic-NOAA We are working on our mods now. The hope is we can get a tag made soon (later today we hope, Monday if we find weird issues) and submitted to spack.

We are targeting that you activate the ESMF_PIN you'll want by add an option to CAP.rc. Once we have it finalized on our end, we'll let you know the config option to use.

@mathomp4
Copy link

mathomp4 commented May 2, 2025

@RatkoVasic-NOAA One more test please. Can you build this branch:

feature/add-esmf-pin-2.53.3

and in your top-level configuration file, e.g. CAP.rc, the place where you set:

ROOT_CF: AGCM.rc

then add:

ESMF_PINFLAG: PET

That should run-time select the same option that worked in feature/v2.53.0-hardcode-pet.

Once you confirm it works, I will issue a v2.53.3 for you.

@RatkoVasic-NOAA
Copy link
Collaborator

RatkoVasic-NOAA commented May 3, 2025

@mathomp4 - bad news. It doesn't work with feature/add-esmf-pin-2.53.3 branch.
This is how top of my CAP.rc file looks like:

MAPLROOT_COMPNAME: AERO
ROOT_NAME: AERO
ROOT_CF:    AERO.rc
ESMF_PINFLAG: PET
HIST_CF:    AERO_HISTORY.rc
EXTDATA_CF: AERO_ExtData.rc
REPORT_THROUGHPUT: .false.
USE_SHMEM: 0

And error message is:

 53: forrtl: severe (174): SIGSEGV, segmentation fault occurred
 53: Image              PC                Routine            Line        Source
 53: libmpi_intel.so.1  00007F52E85BD460  PMPI_Win_allocate     Unknown  Unknown
 53: fv3.exe            000000000295A83C  mapl_base_mp_mapl         286  Base_Base_implementation.F90
 53: fv3.exe            00000000025683A6  mapl_genericmod_m        6784  MAPL_Generic.F90
 53: fv3.exe            000000000258EB11  mapl_genericmod_m        6413  MAPL_Generic.F90
 53: fv3.exe            00000000025BB912  enericmodmapl_gen        1560  MAPL_Generic.F90
 53: fv3.exe            00000000025B3F72  mapl_genericmod_m        1126  MAPL_Generic.F90
 53: fv3.exe            000000000206B538  du2g_gridcompmod_         477  DU2G_GridCompMod.F90

When I run with feature/v2.53.0-hardcode-pet branch, everything is working.

@mathomp4
Copy link

mathomp4 commented May 4, 2025

@RatkoVasic-NOAA Dang. Okay. That seems to say somehow the flag isn't being set. We'll take a look...

@mathomp4
Copy link

mathomp4 commented May 4, 2025

@RatkoVasic-NOAA I just decided to put in some prints and, it should be working. Everything I see shows MAPL uses ESMF_PIN_DE_TO_PET if you set ESMF_PINFLAG: PET. Indeed, I printed the value right before we actually pass pinflag to ESMF and, it's ESMF_PIN_DE_TO_PET.

I can also confirm the flag does do something. I tried setting it to ESMF_PIN_DE_TO_VAS and our model crashes. So, the runtime use is valid.

Can you now try:

feature/add-esmf-pin-2.53.3-prints

and see what happens. I just want to see what it prints out in your log. NOTE: This will be one print per process, so, well, it'll be voluminous!

@mathomp4
Copy link

mathomp4 commented May 5, 2025

@RatkoVasic-NOAA We think we found the issue. The prints one will probably fail much the same. Hopefully soon I can push a fix once we test.

ETA: Yes. We think we have it. @RatkoVasic-NOAA Please pull feature/add-esmf-pin-2.53.3 and try again. I've pushed the change to it. (We mistakenly put the "set the pin" code in a bit of MAPL that we call, but not a bit of MAPL you call.)

@RatkoVasic-NOAA
Copy link
Collaborator

SHOOT this was my message from yesterday, but I didn't press button to submit :-)
Were you expecting this output:

143:  enter get_nggps_ic is=  65 ie=  96 js=  85 je=  96 isd=  62 ied=  99 jsd=  82 jed=  99
143:  Base_Base.F90         840 From Get: pinflag_global: ESMF_PIN_DE_TO_SSI_CONTIG
143:  Base_Base.F90         841 From Get: pinflag: ESMF_PIN_DE_TO_SSI_CONTIG
143:  Base_Base_implementation.F90         175 pinflag where used:
143:  ESMF_PIN_DE_TO_SSI_CONTIG

Sure, I'm going to try with new branch.

@mathomp4
Copy link

mathomp4 commented May 5, 2025

@RatkoVasic-NOAA Yeah, that pretty much shows you were never calling the setter in your workflow (as SSI_CONTIG is the default). The updated branch should now fix that.

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 - good news. First test passed. I'm now running all tests that call MAPL. Fingers crossed.

@mathomp4
Copy link

mathomp4 commented May 5, 2025

@mathomp4 - good news. First test passed. I'm now running all tests that call MAPL. Fingers crossed.

Huzzah! If all pass, we'll get out a v2.53.3 for you ASAP.

@RatkoVasic-NOAA
Copy link
Collaborator

It passed all tests on gaea-c6!!!
Let me test one more machine (Hera), one test - just to make sure that we still have bit-identical results with [email protected].

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 I can confirm, gaea-C6 passed all tests with feature/add-esmf-pin-2.53.3 branch. Also, on Hera, I got bit identical results when comparing [email protected] vs. feature/add-esmf-pin-2.53.3.
Also, including necessary change in CAP.rc: ESMF_PINFLAG: PET
Looking forward for new tag!

@mathomp4
Copy link

mathomp4 commented May 5, 2025

@RatkoVasic-NOAA The tag has been issued:

https://github.com/GEOS-ESM/MAPL/releases/tag/v2.53.3

and I have a PR into spack:

spack/spack#50306

We'll also get this into develop so that MAPL v2.56.0 will have it. Let us know if it needs ported to any other MAPL minor versions.

@RatkoVasic-NOAA
Copy link
Collaborator

@mathomp4 thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.