Skip to content

Edge case for processing 1 file when >1 workers are provided #773

Open
@pweigel

Description

@pweigel

Describe the bug
There seems to be a weird edge case when processing single file "datasets" using more than one worker. I guess this is because

n_workers = min(self._num_workers, nb_files)
if n_workers > 1:
self.info(
f"Starting pool of {n_workers} workers to process"
" {nb_files} {unit}"
)
manager = Manager()
index = Value("i", 0)
output_files = manager.list()
pool = Pool(
processes=n_workers,
initializer=init_global_index,
initargs=(index, output_files),
)
map_fn = pool.imap
is setting n_workers = 1 when there is one file and does not use multiprocessing, but
if self._num_workers > 1:
with global_index.get_lock(): # type: ignore[name-defined]
start_idx = global_index.value # type: ignore[name-defined]
event_nos = np.arange(start_idx, start_idx + n_ids, 1).tolist()
global_index.value += n_ids # type: ignore[name-defined]
uses self._num_workers and tries to access the global variables that are used for multiprocessing.

To Reproduce
Steps to reproduce the behavior:

  1. Process i3 files using >1 workers and only one file in the input folder

Expected behavior
It should allocate just one worker and be processed normally.

Full traceback

File "<path>/graphnet/src/graphnet/data/dataconverter.py", line 260, in _request_event_nos

    with global_index.get_lock():  # type: ignore[name-defined]
         ^^^^^^^^^^^^
NameError: name 'global_index' is not defined. Did you mean: 'init_global_index'?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions