[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

grazder · 2025-04-16T15:02:37Z

Describe the issue

We're using onnxruntime-web with WebGPU backend on different platforms and Electron is one of them.

We observe unstable/inaccurate predictions from an ONNX segmentation model when running inference via ONNX Runtime Web in Electron on specific Intel integrated Intel GPUs (Gen-12LP, Gen-9, Gen-11). The issue does not occur in Chrome on the same devices. The problem manifests as significant tensor value mismatches (e.g., abs/rel errors) in convolution layers, leading to invalid segmentation masks.

On 1.20.1 we faced this problem mostly on intel gen-12lp devices: i5-12400, i7-13700H, i7-11850H, i7-12700, i5-1235U and on a lot of others.

I tried to cherry-pick versions to find solution to this problem to found problem solution for devices above. I found out that it is broken until 1.21.0-dev.20241107-6a295eb75b and fixed after 1.21.0-dev.20241109-d3ad76b2cf.

After that I decided to use d27fecd3d3837864a268bc96f00f2b8dce294697 commit, because everything seemed stable and for devices above problem was solved.

But after that we've faced problem on various devices. Examples:

gen-12lp: i7-12700H (breaks after model reinitialization), i3-1215U, i5-1035G1
gen-11: i5-11320H
gen-9: i3-7100U, i5-7200U, i7-8565U

I noticed similar problems, for example, that the prediction models are too different from the reference (atol > 0.1) on Ampere and Turing GPUs in Chrome, and also found in many devices for fp16. But we face this problems much less.

I also tried versions above, but faced look-alike problems on i7-13700H for example.

To help sort out this problem I can produce more info like WebGPU reports, provide more devices examples, try some commits more on this devices.

To reproduce

I can reproduce on my own devices:

Mac M1 (metal-3)
NVIDIA GeForce RTX 3060 (ampere)
i5-12400 (gen-12lp) - here is I can see some problems

I attach some Convs from my model, on which tests fail - test_examples.zip

Master - 4d03aeff0ef86a62dacf02d67624cf26050125fd

git checkout 4d03aeff0ef86a62dacf02d67624cf26050125fd
cd onnxruntime/js
npm ci
cd common
npm ci
cd ../web
npm ci
npm run pull:wasm
npm run build

Move test cases from above into onnxruntime/js/web/test/data/node/opset_20 (opset_20 is random name that the testing scripts work with)

Change onnxruntime/js/web/test/suite-test-list.jsonc to:

{
  "webgpu": {
    "onnx": [],
    "node": ["8_Conv", "21_Conv", "31_Conv"],
    "ops": []
  }
}

After that I run tests for this ops on all of my devices

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

After that I checkout to 6a295eb75b

git checkout 6a295eb75b
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// FAIL
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

I see following mismatch on gen-12lp (I print here only first 10 tensor numbers):

LOG: 'e Validator 2025-04-16T13:01:25.774Z|abs/rel check failed-- index:163839: actual=1.9458719491958618,expected=3.159862518310547'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Tensor mismatch:
ACTUAL: type=float32; dims=[1,16,80,128]; data=[-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426,-1.5702481269836426]
EXPECT: type=float32; dims=[1,16,80,128]; data=[0.6060669422149658,0.5686113834381104,0.5930850505828857,0.5984766483306885,0.5964930057525635,0.5918130874633789,0.5929081439971924,0.6105263233184814,0.6307907104492188,0.6446692943572998]'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|  Result: FAILED'
LOG: 'e TestRunner 2025-04-16T13:01:25.774Z|Failed to run test data from folder: test_data_set_0. Error: [AssertionError: tensor data should match: expected false to be true]'

After that I checkout to d3ad76b2cf

git checkout d3ad76b2cf
js\build_jsep.bat r

// building etc

// metal-3
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// gen-12lp
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

// ampere
npm run test -- suite1 --backend webgpu --env electron 
// SUCCESS
npm run test -- suite1 --backend webgpu --env chrome
// SUCCESS

So this fixes the issue for my device, but I assume that on devices with the incorrect predictions listed above we will face the same errors.

So, It seems that Convolutions are unstable on Electron for a lot of Intel devices.

Urgency

I'm working on segmentation model, and I see on some devices weird model predictions, so this problem is very important. And I face it a lot. But as a workaround I developed some tests, which I run on the initialization, so I can turn off model if it provides incorrect predictions.

Here is picture of incorrect Convolution behaviour (It's not because model was trained badly, It's 100% because of incorrect predictions.)

So I think this problem is critical for onnxruntime-web usage on Electron.

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

d27fecd

Execution Provider

'webgpu' (WebGPU)

The text was updated successfully, but these errors were encountered:

grazder · 2025-04-16T15:13:36Z

cc @fs-eire

grazder added the platform:web issues related to ONNX Runtime web; typically submitted using template label Apr 16, 2025

github-actions bot added the ep:WebGPU ort-web webgpu provider label Apr 16, 2025

grazder changed the title ~~[Web] Incorrect predictions in ONNX model when using Electron on Intel devices~~ [Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices Apr 16, 2025

fs-eire self-assigned this Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

grazder commented Apr 16, 2025 •

edited

Loading

grazder commented Apr 16, 2025

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

[Web] WebGPU Incorrect predictions in ONNX model when using Electron on Intel devices #24442

Comments

grazder commented Apr 16, 2025 • edited Loading

Describe the issue

To reproduce

Urgency

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

Execution Provider

grazder commented Apr 16, 2025

grazder commented Apr 16, 2025 •

edited

Loading