Stack-walker jitInfo race condition



## Issue

Following up from https://github.com/eclipse-openj9/openj9/issues/19249#issuecomment-3109557980 :

A race condition in the stack-walker, specifically getting a method's jitInfo/artifact/metadata. 
The AVLTree holding the jitInfo hash-tables is synchronized with VMAccess, and the hash-tables themselves are synchronized with a monitor. 
When a stack-walker accesses the hash-tables to retrieve a certain jitInfo (when walking, at every JIT frame, if cache miss) it does not hold the monitor, hence if the hash-table is being modified (e.g. JIT inserting a new compiled method's jitInfo) the stack-walker can retrieve incorrect info.

## Train of thought

### Current state
* CodeCache changes acquire VMExclusiveAccess (modifies AVLTree)
* JITInfo changes acquire monitor (modifies hash-table)
* GC gets VMAccess (not exclusive)
* Stack-walkers don’t use synchrony when getting jitInfo
* No need to synchronize AVLTree access when adding/accessing/removing jitInfo, only hash-table.

### Solutions
1. Use VMExclusiveAccess/VMAccess for jitInfo changes/access (hash-table)
   - Using existing reader/writer synchrony infrastructure
   - No need for monitor if hash-table access use VMAccess for synchrony
   - Using VMAccess on hash-table synchrony intrudes/may-hold other VMExclusiveAccess actions unnecessarily
   - Least intrusive solution(?)
2. Add monitor to stack-walker
   - “Obvious” solution, not performant (Sequentializes different stack-walkers, getting jitInfo)
   - Might be better than expected if cache-hit most of the time and rarely need the monitor when stack walking(?)
   - Walking parked virtual-threads continuation does not use the cache.
3. Use reader/writer synchrony (instead of normal monitor) in artifact modification/access
   - “Obvious” solution improved
   - Change monitor to reader/writer mutex
   - Overhead of reader/writer synchrony(?)

Ended up prototyping solution number 3.

## Experiments

### Drivers 

Testing 4 different drivers:
* A: Base/Nightly build
* B: Replace hash-table monitor with reader/writer mutex
  * Insert/remove artifact gets writer mutex
  * Retrieve artifact gets reader mutex
  * Does not solve race-condition, stack-walker does not get the mutex
  * Evaluate reader/writer mutex overhead
* C: Stack-walker gets reader mutex when needed
  * Only get mutex when getting jitInfo/artifact, if cache miss and going to hash-table
  * Possible many-short mutex acquire/release
* D: Stack-walker always hold reader mutex
  * Acquire reader mutex during all of stack-walk
  * Few-long mutex acquire/release
  * Holds reader access more than needed
* E: Solution 2 (use the same monitor for the stackwalker)
  * Will synchronize stack-walkers getting jitInfo
  * Only when accessing hash-table (cache-miss)

### Results (UPDATED)


| SPECjbb2015  | max-jOPS | Diff    |
|---|----------|---------|
| A | 67801    | 100.00% |
| B | 68292    | 100.73% |
| C | 67699    | 99.85%  |
| D | 36185    | 53.37%  |
| E | 67894    | 100.14% |


Currently looking at profiles, only immediate apparent observation in D 25% of time spent in `omrthread_spinlock_acquire`

## Conclusion

Seem like C (solution 3) and E (solution 2) are valid solutions.

~~Best solution is "C", getting the mutex only when needed even if more frequent acquires/releases.
Overhead from using reader/writer mutex instead of the monitor alone is 4.5%, and using the mutex to solve the race condition is another 1%.~~

~~Might be worth prototyping and measuring solution number 2, as difference between C and D might indicate high cache-hits, making solution 2 less problematic. But with virtual threads, you don't have the cache used, and having GC stack-walking many vThreads and synchronize around the monitor to stack-walk might be problematic, depend on parallel GC threads/stack-walkers.~~

~~I'll analyze profiles of the different drivers to see if something might help reducing the 4.5% effected by the reader/writer mutex change.~~


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stack-walker jitInfo race condition #22371

Issue

Train of thought

Current state

Solutions

Experiments

Drivers

Results (UPDATED)

Conclusion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SPECjbb2015	max-jOPS	Diff
A	67801	100.00%
B	68292	100.73%
C	67699	99.85%
D	36185	53.37%
E	67894	100.14%

Stack-walker jitInfo race condition #22371

Description

Issue

Train of thought

Current state

Solutions

Experiments

Drivers

Results (UPDATED)

Conclusion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions