How to optimize TorchIO's read operation? #568
-
Hi, How does TorchIO read the underlying data? Does it read all the data only once and then save it in memory and apply the requisite transforms per epoch or does it read the data from the supplied files per epoch and then apply the transforms (if any)? The reason why I am asking is because I am running on systems that share a network filesystem. Because of this reason, any I/O is a premium operation and takes a long time. I am curious because for some cases, an epoch takes ~10 hours on this cluster but <1 hour on my local machine(s). Cheers, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 9 replies
-
Normally, only once, |
Beta Was this translation helpful? Give feedback.
-
Hi, @sarthakpati. Good question! It depends. When you instantiate an image, the data is not loaded. If you put it into a subject and you put that subject into a dataset, nothing is loaded until you actually need the data (e.g., for a transform). In the from contextlib import contextmanager
import torch
import psutil # installed it with pip
from tqdm import tqdm
import torchio as tio
def print_used_memory():
gib = psutil.virtual_memory().used / 2**30
print(f'RAM used: {gib:.1f} GiB')
colin = tio.datasets.Colin27()
subject_dict = dict(
t1=tio.ScalarImage(colin.t1.path),
label=tio.LabelMap(colin.brain.path),
)
subjects = [
tio.Subject(
t1=tio.ScalarImage(colin.t1.path),
label=tio.LabelMap(colin.brain.path),
)
for _ in range(100)
]
dataset = tio.SubjectsDataset(subjects)
loader = torch.utils.data.DataLoader(dataset, batch_size=2, num_workers=4)
print_used_memory()
for batch in tqdm(loader): # subject is deep-copied, images from copies are loaded
pass
print_used_memory()
print('Loading data...')
for subject in tqdm(subjects):
subject.load() # load images, caching the voxel data in RAM
print_used_memory()
for batch in tqdm(loader): # images were already loaded before
pass
print_used_memory() Output:
As you can see, for the cases in which images were already preloaded, each iteration was about 6x faster, at the cost of some more RAM usage. Does that make sense? |
Beta Was this translation helpful? Give feedback.
Hi, @sarthakpati. Good question!
It depends. When you instantiate an image, the data is not loaded. If you put it into a subject and you put that subject into a dataset, nothing is loaded until you actually need the data (e.g., for a transform). In the
SubjectsDataset
the loaded data is from a deep copied version of the subject, so the original instance is untouched. If you do load the data from the original subject instance, you won't need to read from disk every single time. I suppose you could use that approach if you have slow I/O and a lot of RAM.