Skip to content

Compression options ignored on output when no change is made on input data #48182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rsreds opened this issue May 27, 2025 · 8 comments
Open

Comments

@rsreds
Copy link
Contributor

rsreds commented May 27, 2025

When opening a file with PoolSource and then using an PoolOutputModule to write it back and specifying a different compression method, the file does not seem to be compressed, even if the compression setting is correctly assigned.

The minimal config:

import FWCore.ParameterSet.Config as cms

process = cms.Process("reCompress")

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring('file:RAW_TTree_1000.root'),
)

process.maxEvents = cms.untracked.PSet(
    input = cms.untracked.int32(1000),
    output = cms.optional.untracked.allowed(cms.int32,cms.PSet)
)

process.output = cms.OutputModule("PoolOutputModule",
    compressionAlgorithm = cms.untracked.string("ZSTD"),
    compressionLevel = cms.untracked.int32(3),
    fileName = cms.untracked.string(f'file:RAW_TTree_1000_ZSTD3.root')
)

process.end_step = cms.EndPath(process.output)

run on a TTBar sample with 1000 events containing only FEDRawDataCollection (follows the content of the input file as per the edmFileUtil print):

$ edmFileUtil -P  RAW_TTree_1000.root 
RAW_TTree_1000.root
RAW_TTree_1000.root (1 runs, 1 lumis, 1000 events, 1969175500 bytes)
Branch 0 of Events tree: EventAuxiliary Total size = 123446 Compressed size = 121281
Branch 1 of Events tree: EventProductProvenance Total size = 35206 Compressed size = 33009
Branch 2 of Events tree: EventSelections Total size = 66541 Compressed size = 64372
Branch 3 of Events tree: BranchListIndexes Total size = 28731 Compressed size = 26554
Branch 4 of Events tree: FEDRawDataCollection_rawDataCollector__HLT. Total size = 1961011549 Compressed size = 1960965125

results in this file:

$ edmFileUtil -P  RAW_TTree_1000_ZSTD3.root 
RAW_TTree_1000_ZSTD3.root
RAW_TTree_1000_ZSTD3.root (1 runs, 1 lumis, 1000 events, 1962327487 bytes)
Branch 0 of Events tree: EventAuxiliary Total size = 115787 Compressed size = 14233
Branch 1 of Events tree: EventProductProvenance Total size = 25151 Compressed size = 3060
Branch 2 of Events tree: EventSelections Total size = 57581 Compressed size = 4494
Branch 3 of Events tree: BranchListIndexes Total size = 18913 Compressed size = 2776
Branch 4 of Events tree: FEDRawDataCollection_rawDataCollector__HLT. Total size = 1961011549 Compressed size = 1960965125

When checking the compression settings with root:

$ root -l RAW_TTree_1000_ZSTD3.root 
root [0] 
Attaching file RAW_TTree_1000_ZSTD3.root as _file0...
(TFile *) 0x47da050
root [1] _file0->GetCompressionSettings()
(int) 503

Menawhile, iff doing the compression with hadd:

$ hadd -f503 RAW_TTRee_1000_ZSTD3_hadd.root RAW_TTree_1000.root 

The edmFileUtil print is:

$ edmFileUtil -P  RAW_TTRee_1000_ZSTD3_hadd.root 
RAW_TTRee_1000_ZSTD3_hadd.root
RAW_TTRee_1000_ZSTD3_hadd.root (1 runs, 1 lumis, 1000 events, 1313913576 bytes)
Branch 0 of Events tree: EventAuxiliary Total size = 123446 Compressed size = 29314
Branch 1 of Events tree: EventProductProvenance Total size = 35206 Compressed size = 15103
Branch 2 of Events tree: EventSelections Total size = 66541 Compressed size = 18565
Branch 3 of Events tree: BranchListIndexes Total size = 28731 Compressed size = 13764
Branch 4 of Events tree: FEDRawDataCollection_rawDataCollector__HLT. Total size = 1961011549 Compressed size = 1312499390

Confirmed also by the compression settings in root.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 27, 2025

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @rsreds.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@rsreds
Copy link
Contributor Author

rsreds commented May 27, 2025

Additionally, I tested the same thing in RNTUPLE_X using the RNTupleSource and RNTupleOutputModule and, in this case, the compression is correctly performed with a similar minimal config.

@Dr15Jones
Copy link
Contributor

Dr15Jones commented May 27, 2025

What is happening is your job is using 'fast cloning'. If the job can guarantee that the order in which the events are stored in the input will be the exact same order as they will be stored in the output (i.e. only 1 thread is running and there are no EDFilters in the job) then the default is to 'fast clone'. What a fast copy does is it takes the raw bytes from the input file and stores then in the output, without doing any decompressing/recompressing. This makes the job substantially faster but also means any changes to compression settings are ignored.

You can stop this behavior by doing

process.out.fastCloning = cms.untracked.bool(False)

@rsreds
Copy link
Contributor Author

rsreds commented May 27, 2025

I understand. I would argue though that this fast cloning should be performed only if the compression settings stays the same. Otherwise the behavior does not match what is actually stored in the root compression settings.

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Unfortunately this behavior follows from decision from a decade (or two) ago, and if changed, there is a high chance of breaking something in the offline computing system. Yes, the setup is brittle, but doesn't seem worth of touching at this point.

We should do better with RNTuple.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants