-
Notifications
You must be signed in to change notification settings - Fork 2.7k
[Performance] Solve high memory usage issue during model compilation using OpenVINO backend on Keras 3 #31482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…sing OpenVINO backend on Keras 3
Here is the link for the IR without |
@evkotov |
0174a70
to
126b77d
Compare
@itikhono |
build_jenkins |
@CuriousPanCake |
// the order is important | ||
const char* enable_einsum = std::getenv("OV_ENABLE_EINSUM_DECOMPOSITION"); | ||
if (enable_einsum) { | ||
REGISTER_PASS(manager, EinsumDecomposition) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good way to fix this. Doing this in MOC means we will have decomposed einsum in IR.
As I understand this is really needed only for einsum that have constant inputs to constant fold it before reaching plugin. Can we do it differently? Maybe modify this transformation to work only on constant inputs for offline step? @CuriousPanCake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvafin
I updated it to check if at least one of the inputs is a constant, and it worked too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from:
================================================================================
FIXED MEMORY TEST: KERAS GPT2 + OPENVINO
================================================================================
[STAGE] 0_INITIAL: 775.24 MB (swap: 0.00 MB) - Initial state after imports
>>> Loading GPT2 model from preset...
[STAGE] 1_MODEL_LOADED: 2314.67 MB (swap: 0.00 MB) - gpt2_medium_en model loaded (10.0s)
[STAGE] 2_BEFORE_INFERENCE: 2314.67 MB (swap: 0.00 MB) - Before first inference
>>> Running first inference (compilation + execution)...
⏳ Converting Keras -> OPENVINO and compiling...
[STAGE] 3_FIRST_INFERENCE: 4512.82 MB (swap: 0.00 MB) - First inference completed via generate (7.7s)
>>> Second inference (no compilation)...
[STAGE] 4_SECOND_INFERENCE: 4510.38 MB (swap: 0.00 MB) - Second inference (2.0s)
[STAGE] 5_FINAL: 4510.38 MB (swap: 0.00 MB) - Final state
================================================================================
PERFORMANCE RESULTS
================================================================================
✅ Generated text: 'Hello everyone,
We've been busy'
✅ Second generation: 'Testimony before the House Judiciary Committee on April'
Backend: openvino
First inference latency: 7.69s
Second inference latency: 2.045s
Throughput: 0.65 tokens/sec
Speedup: 3.8x
📊 DETAILED MEMORY ANALYSIS:
+---------------------+------------+-------------+--------------+---------------+
| STAGE | RAM (MB) | SWAP (MB) | RAM CHANGE | SWAP CHANGE |
+=====================+============+=============+==============+===============+
| Initial | 775.2 | 0 | - | - |
+---------------------+------------+-------------+--------------+---------------+
| After model load | 2314.7 | 0 | +1539.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Before inference | 2314.7 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 1st inference | 4512.8 | 0 | +2198.1 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 2nd inference | 4510.4 | 0 | -2.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Final | 4510.4 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Peak recorded | 4522.9 | 0 | +3747.7 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
🔍 MAIN MEMORY CONSUMERS:
📚 Model loading: +1539.4 MB RAM +0.0 MB swap (41.2% of total)
⚡ Compilation/inference: +2198.1 MB RAM +0.0 MB swap (58.9% of total)
📈 SUMMARY:
💾 Total RAM growth: +3735.1 MB
💿 Total swap change: +0.0 MB
📊 Peak RAM consumption: +3747.7 MB above initial
🔥 Highest RAM recorded: 4522.9 MB
💿 Peak swap consumption: +0.0 MB above initial
🔥 Highest swap recorded: 0.0 MB
🎯 MEMORY HEALTH CHECK:
❌ CRITICAL: RAM usage 3748 MB is very high (target <1GB)
✅ GOOD: Low peak swap usage 0 MB
🚨 ALERT: Combined memory impact 4523 MB is very high
🎯 Test completed: {'success': True, 'model_loading_mb': 1539.4296875, 'compilation_mb': 2198.1484375, 'total_mb': 3735.13671875, 'peak_mb': 3747.6640625, 'peak_swap_mb': 0.0}
to
[STAGE] 0_INITIAL: 781.90 MB (swap: 0.00 MB) - Initial state after imports
>>> Loading GPT2 model from preset...
[STAGE] 1_MODEL_LOADED: 2321.91 MB (swap: 0.00 MB) - gpt2_medium_en model loaded (13.4s)
[STAGE] 2_BEFORE_INFERENCE: 2321.91 MB (swap: 0.00 MB) - Before first inference
>>> Running first inference (compilation + execution)...
⏳ Converting Keras -> OPENVINO and compiling...
[STAGE] 3_FIRST_INFERENCE: 3548.79 MB (swap: 0.00 MB) - First inference completed via generate (7.6s)
>>> Second inference (no compilation)...
[STAGE] 4_SECOND_INFERENCE: 3546.42 MB (swap: 0.00 MB) - Second inference (2.7s)
[STAGE] 5_FINAL: 3546.42 MB (swap: 0.00 MB) - Final state
================================================================================
PERFORMANCE RESULTS
================================================================================
✅ Generated text: 'Hello! I'm a student studying computer programming'
✅ Second generation: 'Testimonials
I was a new'
Backend: openvino
First inference latency: 7.62s
Second inference latency: 2.673s
Throughput: 0.92 tokens/sec
Speedup: 2.9x
📊 DETAILED MEMORY ANALYSIS:
+---------------------+------------+-------------+--------------+---------------+
| STAGE | RAM (MB) | SWAP (MB) | RAM CHANGE | SWAP CHANGE |
+=====================+============+=============+==============+===============+
| Initial | 781.9 | 0 | - | - |
+---------------------+------------+-------------+--------------+---------------+
| After model load | 2321.9 | 0 | +1540.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Before inference | 2321.9 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 1st inference | 3548.8 | 0 | +1226.9 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| After 2nd inference | 3546.4 | 0 | -2.4 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Final | 3546.4 | 0 | +0.0 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
| Peak recorded | 3567.8 | 0 | +2785.9 | +0.0 |
+---------------------+------------+-------------+--------------+---------------+
🔍 MAIN MEMORY CONSUMERS:
📚 Model loading: +1540.0 MB RAM +0.0 MB swap (55.7% of total)
⚡ Compilation/inference: +1226.9 MB RAM +0.0 MB swap (44.4% of total)
📈 SUMMARY:
💾 Total RAM growth: +2764.5 MB
💿 Total swap change: +0.0 MB
📊 Peak RAM consumption: +2785.9 MB above initial
🔥 Highest RAM recorded: 3567.8 MB
💿 Peak swap consumption: +0.0 MB above initial
🔥 Highest swap recorded: 0.0 MB
🎯 MEMORY HEALTH CHECK:
❌ CRITICAL: RAM usage 2786 MB is very high (target <1GB)
✅ GOOD: Low peak swap usage 0 MB
🎯 Test completed: {'success': True, 'model_loading_mb': 1540.0078125, 'compilation_mb': 1226.88671875, 'total_mb': 2764.5234375, 'peak_mb': 2785.86328125, 'peak_swap_mb': 0.0}
@@ -163,7 +164,8 @@ bool ov::pass::MOCTransformations::run_on_model(const std::shared_ptr<ov::Model> | |||
REGISTER_PASS(manager, PushConstantToSubgraph) | |||
REGISTER_PASS(manager, ConstantFolding) | |||
REGISTER_PASS(manager, Validate) | |||
|
|||
// the order is important |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a better comment before which transformation it should be called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
if (m_check_const) { | ||
bool has_const = false; | ||
for (auto& input : einsum_node->input_values()) { | ||
auto node_ptr = input.get_node_shared_ptr(); | ||
auto constant_ptr = ov::as_type_ptr<ov::op::v0::Constant>(node_ptr); | ||
if (constant_ptr) { | ||
has_const = true; | ||
break; | ||
} | ||
} | ||
if (!has_const) | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide more detains about the einsum operation you want to optimize? Maybe link to a code of the model or a picture of subgraph
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization targets specific Einsum operations in transformer models like GPT-2, where at least one input is a constant tensor. After ConstantFolding, weight matrices become constants enabling more efficient decomposition patterns.
Specific Einsum Operations Being Optimized:
1. Query-Key Attention Scores Computation:
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/attention/multi_head_attention.py#L493
- Pattern:
einsum("aecd,abcd->acbe", key, query)
- Code:
attention_scores = ops.einsum(self._dot_product_equation, key, query)
2. Attention-Value Combination:
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/attention/multi_head_attention.py#L509-L511
- Pattern:
einsum("acbe,aecd->abcd", attention_scores, value)
- Code:
attention_output = ops.einsum(self._combine_equation, final_attn_scores, value)
3. Weight Matrix Projections (Q/K/V Transformations):
- Location: https://github.com/keras-team/keras/blob/master/keras/src/layers/core/einsum_dense.py#L214
- Pattern:
einsum("abc,cd->abd", input, weight_matrix)
- Code:
x = ops.einsum(self.equation, inputs, self.kernel)
Optimization Application:
Note: The optimization is only applied when at least one einsum input is constant. In the examples above:
✅ Weight Matrix Projections (example 3): weight_matrix
becomes constant after ConstantFolding → Optimization Applied
❌ Attention Scores (examples 1&2): Both key
and query
are variable tensors → No Optimization
For more details and examples visit:
https://gist.github.com/Mohamed-Ashraf273/59eddcd120918cb0761ffa5020800d5d
a5001be
to
f5dd8f1
Compare
55c03e6
to
cebca9e
Compare
@rkazants
@itikhono
Solving Issue #31390, and back to #30934
Adding
EinsumDecomposition
toMOC transformations
helped reduce memory usage during model compilation.Running this script using memory profiling form #31516:
Use keras source https://github.com/keras-team/keras.git
Also use this PR from keras_hub: keras-team/keras-hub#2350
Then run the following script.
Then Enable
os.environ["OV_ENABLE_MEMORY_PROFILING"] = "1"
by uncommentinng it.without fix:
with fix
by adding:
Note: the order of its postion is important.
I am still exploring what else can help reduce memory usage further. I would appreciate any suggestions or recommendations.