-
Notifications
You must be signed in to change notification settings - Fork 12
Improvement of delayed allocation mechanism #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It's necessary to call set_device_dirty in get_view, because when called from a child of a parent (in a gang), the child will call the overloaded set_device_dirty which also call set_device_dirty on the parent.
Before this modification, the delayed option let the user allocate the data on CPU and GPU after the call of field_new, but the two were still happening at the same time. With this modification, the allocation on CPU and the allocation on GPU are separated. To do so, the satus system had to be slighty modified since the whole library used to assume that the two data pointers were allocated at the same time. Finally the usege of the status system has been simplified with the addition of three helper functions (add, remove, is) that let us modify or query the status without having to do bits comparisons in the middle of the code.
The checks in these test-cases assumed the data pointers werea allocated in the field_new, but with implicit delayed it is not necessary true. Add a call to get_host_data to force the allocation of the pointers.
This subroutine was testing that the program did not work when calling get_view on data recently updated on GPU. But get_view is now able to handle this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @dareg for this contribution 🙏 Most of the changes look fine to me, and some things like the more explicit ADD_STATUS/REMOVE_STATUS
are big and very welcome improvements. I've left a few minor comments, mostly about redundant constants or code paths now being removed.
The only major disagreement I have is about the changes to %GET_VIEW
. The abort there is an incredibly useful safety mechanism, and being forced to synchronise explicitly before calls to %GET_VIEW
has saved us from many many painful debugging sessions. In a codebase as complex as the IFS, I think enforcing explicit synchronisation is a better, safer and simpler design choice.
As far as I understand your work, reverting the changes to %GET_VIEW
shouldn't affect any of the new delayed allocation functionality, and you would just have to do the synchronisation explicitly before calling %GET_VIEW
.
@@ -277,7 +271,7 @@ CONTAINS | |||
SELF%UBOUNDS(${ft.rank}$) = OML_MAX_THREADS () | |||
ENDIF | |||
|
|||
CALL SELF%SET_STATUS (UNALLOCATED) | |||
CALL SELF%SET_STATUS (UNINITIALIZED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the UNALLOCATED
status is now defunct? If so, we should delete it and all the other constants can be shifted right by 1.
src/core/field_RANKSUFF_module.fypp
Outdated
CALL FIELD_ABORT ("GET_VIEW WAS CALLED, BUT DATA IS NOT PRESENT ON HOST") | ||
${ft.type}$, POINTER :: ZPTR(${ft.shape}$) | ||
|
||
!When the first thread reach this line it checks if the data are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%GET_VIEW
calling abort has been a super useful safety feature, and it's often highlighted examples where we've forgotten to synchronise a field. The abort in %GET_VIEW
has also helped us catch several OpenMP firstprivatization errors, which are notoriously hard to track down.
I think an explicit API is always better than an implicit API, and rather than %GET_VIEW
implicitly synchronising or allocating the field, we should force the user to explicitly SYNCHRONISE
. With this kind of implicit synchronisation, data transfers and allocations become much harder to track. Based on the above, I'm not in favour of the changes to %GET_VIEW
.
@@ -483,12 +491,14 @@ $:offload_macros.detach(ptr=['SELF%DEVPTR'], indent=6) | |||
${ft.type}$, POINTER, INTENT(INOUT) :: PTR(${ft.shape}$) | |||
INTEGER (KIND=JPIM), OPTIONAL, INTENT(IN) :: QUEUE | |||
|
|||
IF(SELF%GET_STATUS ()==UNALLOCATED)THEN | |||
IF(.NOT. ASSOCIATED(SELF%PTR))THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This confirms doesn't it that the UNALLOCATED
status is now redundant and can be removed?
CALL SELF%COPY_DATA (ND2H, QUEUE) | ||
CALL SELF%SET_STATUS (IOR (SELF%GET_STATUS (), NHSTFRESH)) | ||
CALL SELF%ADD_STATUS (NHSTFRESH) | ||
CALL SELF%REMOVE_STATUS (UNINITIALIZED_CPU) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of the UNINITIALIZED_CPU
status should be unconditional, since we are always returning a pointer to access the host data.
@@ -533,7 +543,8 @@ $:offload_macros.detach(ptr=['SELF%DEVPTR'], indent=6) | |||
SUBROUTINE ${ftn}$_CREATE_DEVICE_DATA (SELF) | |||
CLASS(${ftn}$) :: SELF | |||
|
|||
CALL DEV_ALLOCATE_HST (DEV=SELF%DEVPTR, HST=SELF%PTR, MAP_DEVPTR=SELF%MAP_DEVPTR) | |||
CALL DEV_ALLOCATE_DIM (DEV=SELF%DEVPTR, LBOUNDS=SELF%LBOUNDS, UBOUNDS=SELF%UBOUNDS, MAP_DEVPTR=SELF%MAP_DEVPTR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEV_ALLOCATE_HST
is now redundant and should therefore be removed.
ENDIF | ||
PTR (${ft.lbptr}$) => SELF%DEVPTR (${','.join(':' for _ in range(ft.rank))}$) | ||
IF (IAND (MODE, NWR) /= 0) THEN | ||
CALL SELF%SET_STATUS (IAND (SELF%GET_STATUS (), NOT (NHSTFRESH))) | ||
CALL SELF%REMOVE_STATUS (NHSTFRESH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do really like the more explicit ADD_STATUS
and REMOVE_STATUS
calls, much more intuitive and readable than the binary operators. Nice!
@classmethod | ||
def reinit_gpu_context(cls): | ||
""" | ||
Used to force reinitialization of GPU device. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really appreciate the detailed documentation 🙏
|
||
POOL_BLOCK_SIZE = 1024_INT64*SIZEOF(REAL(1., KIND=JPRD)) | ||
|
||
!... Check creation of memory pool | ||
CALL FIELD_NEW(F1, UBOUNDS=[256], PERSISTENT=.TRUE., POOLED=.TRUE.) | ||
CALL F1%GET_HOST_DATA_RDWR(PTR1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary? Since the default behaviour is to allocate, the host-side allocation should be performed regardless of the call to %GET_HOST_DATA_RDWR
?
Hello @awnawab, thanks for your feedbacks. I will address them in the next days. Regarding the new get_view behaviour, Philippe would really like to keep it. So would it be fine for you if we keep the new behaviour, but make it optional? By default get_view will abort when the data are not on CPU. |
Hi @dareg. Sure that works for me, as long as we keep the current behaviour as the default. Thanks! |
It helps to keep the test-case functional whatever the option is.
! - if GET_VIEW_ABORT is TRUE then GET_VIEW will abort the program | ||
! - otherwise it then reachs a critical area, and check again in case another | ||
! thread has moved them meanwhile. And if they are still not, then they are moved. | ||
IF(.NOT. ASSOCIATED(SELF%PTR) & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A wrapped 0-sized field can also return .FALSE.
for this association test, and we can't tell such a field apart from a delayed allocation owned field. In this case we'll wrongly call FIELD_ABORT
. We have quite a few wrappers around 0-sized fields, and we should be able to safely call GET_VIEW
on them, as typically the block dimension is not the one with zero size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanest workaround I can think of is to reinstate an UNALLOCATED
status, and mark a delayed allocation field as such. Then the condition here would be SELF%IS_STATUS(UNALLOCATED) .OR. (SELF%IS_STATUS(NDEVFRESH) .AND. .NOT. SELF%IS_STATUS(NHSTFRESH))
. All the checks in the new delayed allocation logic should then be based on the UNALLOCATED
status rather than checking the association status of SELF%PTR
.
In the current field api version, the delayed allocation mechanism is very basic. One can ask to delay the allocation of the pointer to the CPU data and the pointer to the GPU data but the two are still allocated at the same time.
With this PR the delayed allocation mechanism is improved, and the the allocations of the CPU pointer and the GPU pointer are entirely separated. If the data is never accessed on CPU, then the CPU pointer is never allocated, same on the GPU side.
Main changes: