Atlas Dataset

A map-style dataset format for PyTorch made for storing large amounts of data. You can read the format specification here.

Features

Map style (__getitem__ and __len__ methods to support random access)
Sharding
Compression (optional and with different supported strategies, more info here)
Fast in-memory decompression (only a small section of a shard containing the requested index is decompressed)
Fast random access and even faster sequential access
Store examples in any format you want (uses pickle to serialize examples)

Limitations

Streaming is not supported to improve random access efficiency (the dataset must be stored on the machine / cluster you use for training)
Currently, you cannot modify / append / delete examples in an existing Atlas Dataset (unless you create your own script), but I plan to improve this in the near future

Installation

Currently the project is hosted only on github, to install it use:

pip install git+https://github.com/EIDOSLAB/torch-atlas-ds.git

or, if you use poetry:

poetry add git+https://github.com/EIDOSLAB/torch-atlas-ds.git

Example Workflows

Creating a Dataset

from torch_atlas_ds import AtlasDatasetWriter

# Initialize the writer
with AtlasDatasetWriter("dataset_root", shard_size=1000, block_size=100) as writer:
    for example in examples:  # `examples` is your data source
        writer.add_example(example)

Reading a Dataset

from torch_atlas_ds import AtlasDataset

# Load the dataset
dataset = AtlasDataset("dataset_root")

# Access an example
example = dataset[42]

Warnings

Since the dataset uses pickle, only use datasets in atlas dataset format from trusted parties, since pickle may be used to execute unwanted code. If you are the author of the dataset, no need to worry about this.
The format of the dataset may change in the future if the need arises since this project is still young (currently no change is planned)

Author

Luca Molinaro, PhD Student @ UniTO

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
commands		commands
docs		docs
tests		tests
torch_atlas_ds		torch_atlas_ds
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Atlas Dataset

Features

Limitations

Installation

Example Workflows

Creating a Dataset

Reading a Dataset

Warnings

Author

About

Releases

Packages

Languages

License

EIDOSLAB/torch-atlas-ds

Folders and files

Latest commit

History

Repository files navigation

Atlas Dataset

Features

Limitations

Installation

Example Workflows

Creating a Dataset

Reading a Dataset

Warnings

Author

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages