[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442
Labels
ep:WebGPU
ort-web webgpu provider
platform:web
issues related to ONNX Runtime web; typically submitted using template
Describe the issue
We're using
onnxruntime-web
withWebGPU backend
on different platforms and Electron is one of them.We observe unstable/inaccurate predictions from an ONNX segmentation model when running inference via ONNX Runtime Web in Electron on specific Intel integrated Intel GPUs (Gen-12LP, Gen-9, Gen-11). The issue does not occur in Chrome on the same devices. The problem manifests as significant tensor value mismatches (e.g., abs/rel errors) in convolution layers, leading to invalid segmentation masks.
On
1.20.1
we faced this problem mostly on intelgen-12lp
devices: i5-12400, i7-13700H, i7-11850H, i7-12700, i5-1235U and on a lot of others.I tried to cherry-pick versions to find solution to this problem to found problem solution for devices above. I found out that it is broken until
1.21.0-dev.20241107-6a295eb75b
and fixed after1.21.0-dev.20241109-d3ad76b2cf
.After that I decided to use d27fecd3d3837864a268bc96f00f2b8dce294697 commit, because everything seemed stable and for devices above problem was solved.
But after that we've faced problem on various devices. Examples:
gen-12lp
: i7-12700H (breaks after model reinitialization), i3-1215U, i5-1035G1gen-11
: i5-11320Hgen-9
: i3-7100U, i5-7200U, i7-8565UI noticed similar problems, for example, that the prediction models are too different from the reference (atol > 0.1) on Ampere and Turing GPUs in Chrome, and also found in many devices for fp16. But we face this problems much less.
I also tried versions above, but faced look-alike problems on
i7-13700H
for example.To help sort out this problem I can produce more info like WebGPU reports, provide more devices examples, try some commits more on this devices.
To reproduce
I can reproduce on my own devices:
I attach some Convs from my model, on which tests fail - test_examples.zip
Master - 4d03aeff0ef86a62dacf02d67624cf26050125fd
Move test cases from above into
onnxruntime/js/web/test/data/node/opset_20
(opset_20
is random name that the testing scripts work with)Change
onnxruntime/js/web/test/suite-test-list.jsonc
to:After that I run tests for this ops on all of my devices
After that I checkout to 6a295eb75b
I see following mismatch on
gen-12lp
(I print here only first 10 tensor numbers):After that I checkout to d3ad76b2cf
So this fixes the issue for my device, but I assume that on devices with the incorrect predictions listed above we will face the same errors.
So, It seems that Convolutions are unstable on Electron for a lot of Intel devices.
Urgency
I'm working on segmentation model, and I see on some devices weird model predictions, so this problem is very important. And I face it a lot. But as a workaround I developed some tests, which I run on the initialization, so I can turn off model if it provides incorrect predictions.
Here is picture of incorrect Convolution behaviour (It's not because model was trained badly, It's 100% because of incorrect predictions.)
So I think this problem is critical for
onnxruntime-web
usage on Electron.ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
d27fecd
Execution Provider
'webgpu' (WebGPU)
The text was updated successfully, but these errors were encountered: