Skip to content

Improvement of delayed allocation mechanism #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

dareg
Copy link
Collaborator

@dareg dareg commented Aug 1, 2025

In the current field api version, the delayed allocation mechanism is very basic. One can ask to delay the allocation of the pointer to the CPU data and the pointer to the GPU data but the two are still allocated at the same time.
With this PR the delayed allocation mechanism is improved, and the the allocations of the CPU pointer and the GPU pointer are entirely separated. If the data is never accessed on CPU, then the CPU pointer is never allocated, same on the GPU side.

Main changes:

  • delayed mechanism rewritten
  • new way of managing field api status. No more get_status with bit comparison. Can now use proper add_status, remove_status, and is_status subroutines
  • store lbound and ubound in wrapper field (was necessary and simplify the code since it reduce the difference between wrapper and owner)
  • get_view is now callable even when the data are not on CPU or not allocated. They will be moved or allocated if necessary by get_view
  • update NVHPC compiler version used in CI because of a bug in test_gang. Was 23.5, is now updated to 24.5, first version to compile it properly.

dareg added 30 commits June 6, 2025 15:57
It's necessary to call set_device_dirty in get_view, because when called
from a child of a parent (in a gang), the child will call the overloaded
set_device_dirty which also call set_device_dirty on the parent.
Before this modification, the delayed option let the user allocate the
data on CPU and GPU after the call of field_new, but the two were still
happening at the same time.
With this modification, the allocation on CPU and the allocation on GPU
are separated. To do so, the satus system had to be slighty modified
since the whole library used to assume that the two data pointers were
allocated at the same time. Finally the usege of the status system has
been simplified with the addition of three helper functions (add, remove,
is) that let us modify or query the status without having to do bits
comparisons in the middle of the code.
The checks in these test-cases assumed the data pointers werea allocated
in the field_new, but with implicit delayed it is not necessary true.
Add a call to get_host_data to force the allocation of the pointers.
@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Aug 4, 2025
@github-actions github-actions bot removed the approved-for-ci Approved to run hpc-ci label Aug 4, 2025
@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Aug 4, 2025
Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @dareg for this contribution 🙏 Most of the changes look fine to me, and some things like the more explicit ADD_STATUS/REMOVE_STATUS are big and very welcome improvements. I've left a few minor comments, mostly about redundant constants or code paths now being removed.

The only major disagreement I have is about the changes to %GET_VIEW. The abort there is an incredibly useful safety mechanism, and being forced to synchronise explicitly before calls to %GET_VIEW has saved us from many many painful debugging sessions. In a codebase as complex as the IFS, I think enforcing explicit synchronisation is a better, safer and simpler design choice.

As far as I understand your work, reverting the changes to %GET_VIEW shouldn't affect any of the new delayed allocation functionality, and you would just have to do the synchronisation explicitly before calling %GET_VIEW.

@@ -277,7 +271,7 @@ CONTAINS
SELF%UBOUNDS(${ft.rank}$) = OML_MAX_THREADS ()
ENDIF

CALL SELF%SET_STATUS (UNALLOCATED)
CALL SELF%SET_STATUS (UNINITIALIZED)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the UNALLOCATED status is now defunct? If so, we should delete it and all the other constants can be shifted right by 1.

CALL FIELD_ABORT ("GET_VIEW WAS CALLED, BUT DATA IS NOT PRESENT ON HOST")
${ft.type}$, POINTER :: ZPTR(${ft.shape}$)

!When the first thread reach this line it checks if the data are available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%GET_VIEW calling abort has been a super useful safety feature, and it's often highlighted examples where we've forgotten to synchronise a field. The abort in %GET_VIEW has also helped us catch several OpenMP firstprivatization errors, which are notoriously hard to track down.

I think an explicit API is always better than an implicit API, and rather than %GET_VIEW implicitly synchronising or allocating the field, we should force the user to explicitly SYNCHRONISE. With this kind of implicit synchronisation, data transfers and allocations become much harder to track. Based on the above, I'm not in favour of the changes to %GET_VIEW.

@@ -483,12 +491,14 @@ $:offload_macros.detach(ptr=['SELF%DEVPTR'], indent=6)
${ft.type}$, POINTER, INTENT(INOUT) :: PTR(${ft.shape}$)
INTEGER (KIND=JPIM), OPTIONAL, INTENT(IN) :: QUEUE

IF(SELF%GET_STATUS ()==UNALLOCATED)THEN
IF(.NOT. ASSOCIATED(SELF%PTR))THEN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This confirms doesn't it that the UNALLOCATED status is now redundant and can be removed?

CALL SELF%COPY_DATA (ND2H, QUEUE)
CALL SELF%SET_STATUS (IOR (SELF%GET_STATUS (), NHSTFRESH))
CALL SELF%ADD_STATUS (NHSTFRESH)
CALL SELF%REMOVE_STATUS (UNINITIALIZED_CPU)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of the UNINITIALIZED_CPU status should be unconditional, since we are always returning a pointer to access the host data.

@@ -533,7 +543,8 @@ $:offload_macros.detach(ptr=['SELF%DEVPTR'], indent=6)
SUBROUTINE ${ftn}$_CREATE_DEVICE_DATA (SELF)
CLASS(${ftn}$) :: SELF

CALL DEV_ALLOCATE_HST (DEV=SELF%DEVPTR, HST=SELF%PTR, MAP_DEVPTR=SELF%MAP_DEVPTR)
CALL DEV_ALLOCATE_DIM (DEV=SELF%DEVPTR, LBOUNDS=SELF%LBOUNDS, UBOUNDS=SELF%UBOUNDS, MAP_DEVPTR=SELF%MAP_DEVPTR)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEV_ALLOCATE_HST is now redundant and should therefore be removed.

ENDIF
PTR (${ft.lbptr}$) => SELF%DEVPTR (${','.join(':' for _ in range(ft.rank))}$)
IF (IAND (MODE, NWR) /= 0) THEN
CALL SELF%SET_STATUS (IAND (SELF%GET_STATUS (), NOT (NHSTFRESH)))
CALL SELF%REMOVE_STATUS (NHSTFRESH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do really like the more explicit ADD_STATUS and REMOVE_STATUS calls, much more intuitive and readable than the binary operators. Nice!

@classmethod
def reinit_gpu_context(cls):
"""
Used to force reinitialization of GPU device.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the detailed documentation 🙏


POOL_BLOCK_SIZE = 1024_INT64*SIZEOF(REAL(1., KIND=JPRD))

!... Check creation of memory pool
CALL FIELD_NEW(F1, UBOUNDS=[256], PERSISTENT=.TRUE., POOLED=.TRUE.)
CALL F1%GET_HOST_DATA_RDWR(PTR1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really necessary? Since the default behaviour is to allocate, the host-side allocation should be performed regardless of the call to %GET_HOST_DATA_RDWR?

@dareg
Copy link
Collaborator Author

dareg commented Aug 5, 2025

Hello @awnawab, thanks for your feedbacks. I will address them in the next days.

Regarding the new get_view behaviour, Philippe would really like to keep it. So would it be fine for you if we keep the new behaviour, but make it optional? By default get_view will abort when the data are not on CPU.

@awnawab
Copy link
Collaborator

awnawab commented Aug 5, 2025

Hi @dareg. Sure that works for me, as long as we keep the current behaviour as the default. Thanks!

@github-actions github-actions bot removed the approved-for-ci Approved to run hpc-ci label Aug 6, 2025
@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Aug 7, 2025
! - if GET_VIEW_ABORT is TRUE then GET_VIEW will abort the program
! - otherwise it then reachs a critical area, and check again in case another
! thread has moved them meanwhile. And if they are still not, then they are moved.
IF(.NOT. ASSOCIATED(SELF%PTR) &
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A wrapped 0-sized field can also return .FALSE. for this association test, and we can't tell such a field apart from a delayed allocation owned field. In this case we'll wrongly call FIELD_ABORT. We have quite a few wrappers around 0-sized fields, and we should be able to safely call GET_VIEW on them, as typically the block dimension is not the one with zero size.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanest workaround I can think of is to reinstate an UNALLOCATED status, and mark a delayed allocation field as such. Then the condition here would be SELF%IS_STATUS(UNALLOCATED) .OR. (SELF%IS_STATUS(NDEVFRESH) .AND. .NOT. SELF%IS_STATUS(NHSTFRESH)). All the checks in the new delayed allocation logic should then be based on the UNALLOCATED status rather than checking the association status of SELF%PTR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved-for-ci Approved to run hpc-ci contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants