Skip to content

Threads assume they will record leave events in LIFO order (can be violated for tasks) #12

Open
@adamtuft

Description

@adamtuft

Limitation

Threads assume that they will always record leave events for the regions they visit in LIFO order, due to the fact that each thread maintains a stack of OTF2 region definitions for the regions it visits. Any callback that corresponds to entering or leaving a region invokes trace_event_enter or trace_event_leave.

Signatures:

void trace_event_enter(trace_location_def_t *self, trace_region_def_t *region);
void trace_event_leave(trace_location_def_t *self);

In trace_event_enter:

/* Push region onto location's region stack */
stack_push(self->rgn_stack, (data_item_t) {.ptr = region});

In trace_event_leave:

/* For the region-end event, the region was previously pushed onto the 
   location's region stack so should now be at the top (as long as regions
   are correctly nested) */
trace_region_def_t *region = NULL;
stack_pop(self->rgn_stack, (data_item_t*) &region);

Problem

This presents a problem because threads can switch between partially-complete tasks. For example, consider thread x executing the untied task p which enters a task-scheduling region, records a region-enter event, pushes the region definition onto its stack and suspends the task. If thread y then resumes and completes p, it would record a leave event against the task-scheduling region which x previously entered - the region-leave event will not be recorded by the thread that recorded the region-enter event, or against the correct region definition, and both threads will appear to have entered a different number of regions than they left.

A similar error is possible with tied tasks, in which region-leave and region-enter events could become unmatched in the trace. A thread will eventually record region-leave events for all region-enter events (since it must eventually complete all the tasks it started) but the task scheduling means the order of these events is not fixed. I suspect a workaround is possible for tied tasks during post-processing by breaking up event sequences at task-switch events and then stitching each event back together with its sub-sequences in the correct order.

Possible Fixes

As this limitation is due to a low-lying design decision I think it will need a fairly significant re-write of Otter. Ideas include:

  • Have tasks maintain a stack of the regions encountered instead of threads. Should be possible as there is always a task being executed (implicit if not an explicit task) so threads can just query the task's stack to record events against the correct region definition.
  • Have all regions represented by singleton definitions except for those which can be given persistent definitions (parallel & task regions only AFAIK) - I don't like this idea as it might look strange in the trace if a program only appears to have exactly 1 instance of each region...

Metadata

Metadata

Assignees

No one assigned

    Labels

    limitationThis issue is a consequence of a deliberate design choice

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions