Performance impact of disabling MKLDNN in Tensorflow #47991

makortel · 2025-04-30T15:28:53Z

This issue is a followup to @gartung's talk in Core Software meeting 2025-04-29 https://indico.cern.ch/event/1543134/#17-profiling-and-tensorflow to record the performance impact and discuss how to proceed.

makortel · 2025-04-30T15:28:59Z

assign core, ml

cmsbuild · 2025-04-30T15:29:08Z

New categories assigned: core,ml

@Dr15Jones,@makortel,@smuzaffar,@valsdav,@y19y19 you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2025-04-30T15:29:10Z

cms-bot internal usage

cmsbuild · 2025-04-30T15:29:10Z

A new Issue was created by @makortel.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

gartung · 2025-04-30T15:44:45Z

Workflow 29834.21 80 events single threaded

CMSSW_15_1_X_2025-04-21-2300
-- Tensorflow with MKLDNN

CMSSW_15_1_MKLDNN0_X_2025-04-21-2300
-- Tensorflow without MKLDNN

RECO Step

Without MKLDNN	With MKLDNN	
0.0271875	0.0270157	
0.0274473	0.027512	
0.0274606	0.0275002	
0.0275526	0.0274082	
0.0274737	0.0276156	
0.027548	0.0274871	
0.0275574	0.0276823	
0.0274444	0.0269425	
0.0273316	0.0266684	
0.0266689	0.0272566	
		
0.0273672	0.0273089	0.2136303 %
0.0002702	0.0003310

PAT step

Without MKLDNN	With MKLDNN	
0.293232	0.28897	
0.284163	0.290577	
0.288738	0.293765	
0.262822	0.289624	
0.291147	0.291533	
0.290975	0.29029	
0.292249	0.288425	
0.292383	0.285111	
0.29239	        0.28795	
0.291838	0.290082	
		
0.2879937	0.2896327	-0.5658891%
0.0092291	0.0022946

makortel · 2025-04-30T15:56:36Z

@gartung What CPU did the node have?

gartung · 2025-04-30T15:58:37Z

AMD EPYC-Genoa Processor
cmsdev31

makortel · 2025-04-30T17:59:04Z

I'm a bit surprised the JIT for AVX512 ("With MKLDNN") does not result in bigger difference wrt. AVX2 binaries ("Without MKLDNN").

gartung · 2025-04-30T18:20:22Z

The Genoa cores have avx-512 instructions.
The eigen matrix operations use the available instruction set.

gartung · 2025-04-30T18:22:21Z

On an Intel CPU with deep learning instructions OneDNN might be faster.

gartung · 2025-04-30T18:23:39Z

The VNNI extension to avx-512
https://www.intel.com/content/www/us/en/developer/articles/guide/deep-learning-with-avx512-and-dl-boost.html

gartung · 2025-04-30T18:25:22Z

The inclusion of VNNI optimization only occurs in the OneDNN included in Tensorflow 2.17.0.

makortel · 2025-04-30T18:26:42Z

The Genoa cores have avx-512 instructions. The eigen matrix operations use the available instruction set.

Right, but only when JITted in "with MKLDNN" setup, right? The "without MLKDNN" should use only the x86-64-v3 instructions.

gartung · 2025-04-30T18:28:36Z

If I recall correctly, yes.

gartung · 2025-05-16T15:31:02Z

@makortel Since compiling the stack with frame pointers seems to resolve the segfaults in libunwind when profiling workflows with Tensorflow/OneDNN JITing, should this issue be closed?

makortel · 2025-05-16T16:09:14Z

Yes, let's close

gartung · 2025-05-20T15:08:47Z

After changing the profiling script to NOT configure the build with the Tensorflow with the MKLDNN0 IB's the profiling jobs are again segfaulting. I was mistaken in assuming that enabling frame pointers was enough to prevent the segfaults.
@makortel you would need to reopen this issue.

cmsbuild added core-pending pending-signatures ml-pending labels Apr 30, 2025

makortel closed this as completed May 16, 2025

cmsbuild added core-rejected and removed core-pending labels May 16, 2025

makortel reopened this May 20, 2025

cmsbuild added core-pending and removed core-rejected labels May 20, 2025

Performance impact of disabling MKLDNN in Tensorflow #47991

Performance impact of disabling MKLDNN in Tensorflow #47991

Comments

makortel commented Apr 30, 2025

makortel commented Apr 30, 2025

Uh oh!

cmsbuild commented Apr 30, 2025

Uh oh!

cmsbuild commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makortel commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makortel commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gartung commented Apr 30, 2025

Uh oh!

makortel commented Apr 30, 2025

Uh oh!

gartung commented Apr 30, 2025

Uh oh!

gartung commented May 16, 2025

Uh oh!

makortel commented May 16, 2025

Uh oh!

gartung commented May 20, 2025

Uh oh!

cmsbuild commented Apr 30, 2025 •

edited

Loading

gartung commented Apr 30, 2025 •

edited

Loading

gartung commented Apr 30, 2025 •

edited

Loading

gartung commented Apr 30, 2025 •

edited

Loading