Skip to content

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
grazder opened this issue Apr 16, 2025 · 1 comment
Assignees
Labels
ep:WebGPU ort-web webgpu provider platform:web issues related to ONNX Runtime web; typically submitted using template

Comments

@grazder
Copy link

grazder commented Apr 16, 2025

Describe the issue

We're using onnxruntime-web with WebGPU backend on different platforms and Electron is one of them.

We observe unstable/inaccurate predictions from an ONNX segmentation model when running inference via ONNX Runtime Web in Electron on specific Intel integrated Intel GPUs (Gen-12LP, Gen-9, Gen-11). The issue does not occur in Chrome on the same devices. The problem manifests as significant tensor value mismatches (e.g., abs/rel errors) in convolution layers, leading to invalid segmentation masks.

On 1.20.1 we faced this problem mostly on intel gen-12lp devices: i5-12400, i7-13700H, i7-11850H, i7-12700, i5-1235U and on a lot of others.

I tried to cherry-pick versions to find solution to this problem to found problem solution for devices above. I found out that it is broken until 1.21.0-dev.20241107-6a295eb75b and fixed after 1.21.0-dev.20241109-d3ad76b2cf.

After that I decided to use d27fecd3d3837864a268bc96f00f2b8dce294697 commit, because everything seemed stable and for devices above problem was solved.

But after that we've faced problem on various devices. Examples:

  • gen-12lp: i7-12700H (breaks after model reinitialization), i3-1215U, i5-1035G1
  • gen-11: i5-11320H
  • gen-9: i3-7100U, i5-7200U, i7-8565U

I noticed similar problems, for example, that the prediction models are too different from the reference (atol > 0.1) on Ampere and Turing GPUs in Chrome, and also found in many devices for fp16. But we face this problems much less.

I also tried versions above, but faced look-alike problems on i7-13700H for example.

To help sort out this problem I can produce more info like WebGPU reports, provide more devices examples, try some commits more on this devices.

To reproduce

I can reproduce on my own devices:

  • Mac M1 (metal-3)
  • NVIDIA GeForce RTX 3060 (ampere)
  • i5-12400 (gen-12lp) - here is I can see some problems

I attach some Convs from my model, on which tests fail - test_examples.zip

Master - 4d03aeff0ef86a62dacf02d67624cf26050125fd

git checkout 4d03aeff0ef86a62dacf02d67624cf26050125fd
cd onnxruntime/js
npm ci
cd common
npm ci
cd ../web
npm ci
npm run pull:wasm
npm run build

Move test cases from above into onnxruntime/js/web/test/data/node/opset_20 (opset_20 is random name that the testing scripts work with)

Change onnxruntime/js/web/test/suite-test-list.jsonc to:

{
  "webgpu": {
    "onnx": [],
    "node": ["8_Conv", "21_Conv", "31_Conv"],
    "ops": []
  }
}

After that I run tests for this ops on all of my devices

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

After that I checkout to 6a295eb75b

git checkout 6a295eb75b
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// FAIL
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

I see following mismatch on gen-12lp (I print here only first 10 tensor numbers):

LOG: 'e Validator 2025-04-16T13:01:25.774Z|abs/rel check failed-- index:163839: actual=1.9458719491958618,expected=3.159862518310547'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Tensor mismatch:
ACTUAL: type=float32; dims=[1,16,80,128]; data=[-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426]
EXPECT: type=float32; dims=[1,16,80,128]; data=[0.6060669422149658,0.5686113834381104,0.5930850505828857,0.5984766483306885,0.5964930057525635,0.5918130874633789,0.5929081439971924,0.6105263233184814,0.6307907104492188,0.6446692943572998]'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|  Result: FAILED'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Failed to run test data from folder: test_data_set_0. Error: [AssertionError: tensor data should match: expected false to be true]'

After that I checkout to d3ad76b2cf

git checkout d3ad76b2cf
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

So this fixes the issue for my device, but I assume that on devices with the incorrect predictions listed above we will face the same errors.

So, It seems that Convolutions are unstable on Electron for a lot of Intel devices.

Urgency

I'm working on segmentation model, and I see on some devices weird model predictions, so this problem is very important. And I face it a lot. But as a workaround I developed some tests, which I run on the initialization, so I can turn off model if it provides incorrect predictions.

Here is picture of incorrect Convolution behaviour (It's not because model was trained badly, It's 100% because of incorrect predictions.)

Image

So I think this problem is critical for onnxruntime-web usage on Electron.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

d27fecd

Execution Provider

'webgpu' (WebGPU)

@grazder grazder added the platform:web issues related to ONNX Runtime web; typically submitted using template label Apr 16, 2025
@github-actions github-actions bot added the ep:WebGPU ort-web webgpu provider label Apr 16, 2025
@grazder grazder changed the title [Web] Incorrect predictions in ONNX model when using Electron on Intel devices [Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices Apr 16, 2025
@grazder
Copy link
Author

grazder commented Apr 16, 2025

cc @fs-eire

@fs-eire fs-eire self-assigned this Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider platform:web issues related to ONNX Runtime web; typically submitted using template
Projects
None yet
Development

No branches or pull requests

2 participants