For converting image annotations datasets from one format into another. Filters can be supplied as well, e.g., for cleaning up the data.
Additional libraries further extend the available plugins with additional readers, filters and/or writers.
Via PyPI:
pip install image_dataset_converter
The latest code straight from the repository:
pip install git+https://github.com/waikato-datamining/image-dataset-converter.git
Docker images are available as well. Please see the following page por more information:
https://github.com/waikato-datamining/image-dataset-converter-all/tree/main/docker
The following dataset formats are supported:
Domain | Format | Read | Write |
---|---|---|---|
Depth data | CSV | Y | Y |
Depth data | Grayscale | Y | Y |
Depth data | Numpy | Y | Y |
Depth data | PFM | Y | Y |
Image classification | ADAMS | Y | Y |
Image classification | subdir | Y | Y |
Image segmentation | Blue-channel | Y | Y |
Image segmentation | Grayscale | Y | Y |
Image segmentation | Indexed PNG | Y | Y |
Image segmentation | Layer segments | Y | Y |
Object detection | ADAMS | Y | Y |
Object detection | COCO | Y | Y |
Object detection | OPEX | Y | Y |
Object detection | ROI CSV | Y | Y |
Object detection | VOC | Y | Y |
Object detection | YOLO | Y | Y |
Readers and writers typically have a suffix that denotes their data domain:
-dp
- Depth data-ic
- Image classification-is
- Image segmentation-od
- Object detection
usage: idc-convert [-h|--help|--help-all|--help-plugin NAME]
[-u INTERVAL] [-b|--force_batch] [--placeholders FILE] [--dump_pipeline FILE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
reader
[filter [filter [...]]]
[writer]
Tool for converting between image annotation dataset formats.
readers (20):
from-adams-ic, from-adams-od, from-blue-channel-is, from-coco-od,
from-csv-dp, from-data, from-grayscale-dp, from-grayscale-is,
from-indexed-png-is, from-layer-segments-is, from-multi,
from-numpy-dp, from-opex-od, from-pfm-dp, from-pyfunc,
from-roicsv-od, from-subdir-ic, from-voc-od, from-yolo-od, poll-dir
filters (49):
any-to-rgb, apply-ext-mask, apply-label-mask,
check-duplicate-filenames, coerce-box, coerce-mask,
convert-image-format, depth-to-grayscale, depth-to-greyscale*,
dimension-discarder, discard-by-name, discard-invalid-images,
discard-negatives, filter-labels, grayscale-to-binary,
greyscale-to-binary*, inspect, label-from-name, label-present*,
label-present-ic, label-present-is, label-present-od,
label-to-metadata, map-labels, max-records, metadata,
metadata-from-name, metadata-od, metadata-to-placeholder, od-to-ic,
od-to-is, passthrough, polygon-discarder, polygon-simplifier,
pyfunc-filter, randomize-records, record-window, remove-classes,
rename, rgb-to-grayscale, rgb-to-greyscale*, sample, set-placeholder,
sort-pixels, split-records, strip-annotations, tee, use-mask,
write-labels
writers (19):
to-adams-ic, to-adams-od, to-blue-channel-is, to-coco-od, to-csv-dp,
to-data, to-grayscale-dp, to-grayscale-is, to-indexed-png-is,
to-layer-segments-is, to-multi, to-numpy-dp, to-opex-od, to-pfm-dp,
to-pyfunc, to-roicsv-od, to-subdir-ic, to-voc-od, to-yolo-od
optional arguments:
-h, --help show basic help message and exit
--help-all show basic help message plus help on all plugins and exit
--help-plugin NAME show help message for plugin NAME and exit
-u INTERVAL, --update_interval INTERVAL
outputs the progress every INTERVAL records (default: 1000)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
the logging level to use (default: WARN)
-b, --force_batch processes the data in batches
--placeholders FILE
The file with custom placeholders to load (format: key=value).
--dump_pipeline FILE
The file to dump the pipeline command in.
usage: idc-exec [-h] -p PIPELINE -g GENERATOR [-n] [-P PREFIX]
[--placeholders FILE] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for executing a pipeline multiple times, each time with a different set
of variables expanded. A variable is surrounded by curly quotes (e.g.,
variable 'i' gets referenced with '{i}'). Available generators: dirs, list,
null, range
options:
-h, --help show this help message and exit
-p PIPELINE, --pipeline PIPELINE
The pipeline template with variables to expand and
then execute. (default: None)
-g GENERATOR, --generator GENERATOR
The generator plugin to use. (default: None)
-n, --dry_run Applies the generator to the pipeline template and
only outputs it on stdout. (default: False)
-P PREFIX, --prefix PREFIX
The string to prefix the pipeline with when in dry-run
mode. (default: None)
--placeholders FILE The file with custom placeholders to load (format:
key=value). (default: None)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
Readers tend to support input via file lists. The idc-find
tool can generate
these.
usage: idc-find [-h] -i DIR [DIR ...] [-r] -o FILE [-m [REGEXP ...]]
[-n [REGEXP ...]] [--split_ratios [SPLIT_RATIOS ...]]
[--split_names [SPLIT_NAMES ...]]
[--split_name_separator SPLIT_NAME_SEPARATOR]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for locating files in directories that match certain patterns and store
them in files.
options:
-h, --help show this help message and exit
-i DIR [DIR ...], --input DIR [DIR ...]
The dir(s) to scan for files. (default: None)
-r, --recursive Whether to search the directories recursively
(default: False)
-o FILE, --output FILE
The file to store the located file names in (default:
None)
-m [REGEXP ...], --match [REGEXP ...]
The regular expression that the (full) file names must
match to be included (default: None)
-n [REGEXP ...], --not-match [REGEXP ...]
The regular expression that the (full) file names must
match to be excluded (default: None)
--split_ratios [SPLIT_RATIOS ...]
The split ratios to use for generating the splits
(int; must sum up to 100) (default: None)
--split_names [SPLIT_NAMES ...]
The split names to use as filename suffixes for the
generated splits (before .ext) (default: None)
--split_name_separator SPLIT_NAME_SEPARATOR
The separator to use between file name and split name
(default: -)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
usage: idc-help [-h] [-c [PACKAGE ...]] [-e EXCLUDED_CLASS_LISTERS]
[-T {pipeline,generator}] [-p NAME] [-f {text,markdown}]
[-L INT] [-o PATH] [-i FILE] [-t TITLE]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for outputting help for plugins in various formats.
options:
-h, --help show this help message and exit
-c [PACKAGE ...], --custom_class_listers [PACKAGE ...]
The custom class listers to use, uses the default ones
if not provided. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to exclude.
(default: None)
-T {pipeline,generator}, --plugin_type {pipeline,generator}
The types of plugins to generate the help for.
(default: pipeline)
-p NAME, --plugin_name NAME
The name of the plugin to generate the help for,
generates it for all if not specified (default: None)
-f {text,markdown}, --help_format {text,markdown}
The output format to generate (default: text)
-L INT, --heading_level INT
The level to use for the heading (default: 1)
-o PATH, --output PATH
The directory or file to store the help in; outputs it
to stdout if not supplied; if pointing to a directory,
automatically generates file name from plugin name and
help format (default: None)
-i FILE, --index_file FILE
The file in the output directory to generate with an
overview of all plugins, grouped by type (in markdown
format, links them to the other generated files)
(default: None)
-t TITLE, --index_title TITLE
The title to use in the index file (default: image-
dataset-converter plugins)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
usage: idc-registry [-h] [-c CUSTOM_CLASS_LISTERS] [-e EXCLUDED_CLASS_LISTERS]
[-l {plugins,pipeline,custom-class-listers,env-class-listers,readers,filters,writers,generators}]
For inspecting/querying the registry.
options:
-h, --help show this help message and exit
-c CUSTOM_CLASS_LISTERS, --custom_class_listers CUSTOM_CLASS_LISTERS
The comma-separated list of custom class listers to
use. (default: None)
-e EXCLUDED_CLASS_LISTERS, --excluded_class_listers EXCLUDED_CLASS_LISTERS
The comma-separated list of class listers to exclude.
(default: None)
-l {plugins,pipeline,custom-class-listers,env-class-listers,readers,filters,writers,generators}, --list {plugins,pipeline,custom-class-listers,env-class-listers,readers,filters,writers,generators}
For outputting various lists on stdout. (default:
None)
usage: idc-test-generator [-h] -g GENERATOR
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Tool for testing generators by outputting the generated variables and their
associatd values. Available generators: dirs, list, null, range
options:
-h, --help show this help message and exit
-g GENERATOR, --generator GENERATOR
The generator plugin to use. (default: None)
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --logging_level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
The logging level to use. (default: WARN)
You can find help screens for the plugins here:
- Pipeline plugins (readers/filters/writers)
- Generator plugins (used by
idc-exec
)
Examples can be found on the image-dataset-converter-examples website.
The image-dataset-converter uses the class lister registry provided by the seppl library.
Each module defines a function, typically called list_classes
that returns
a dictionary of names of superclasses associated with a list of modules that
should be scanned for derived classes. Here is an example:
from typing import List, Dict
def list_classes() -> Dict[str, List[str]]:
return {
"seppl.io.Reader": [
"mod.ule1",
"mod.ule2",
],
"seppl.io.Filter": [
"mod.ule3",
"mod.ule4",
],
"seppl.io.Writer": [
"mod.ule5",
],
}
Such a class lister gets referenced in the entry_points
section of the setup.py
file:
entry_points={
"class_lister": [
"unique_string=module_name:function_name",
],
},
:function_name
can be omitted if :list_classes
.
The following environment variables can be used to influence the class listers:
IDC_CLASS_LISTERS
IDC_CLASS_LISTERS_EXCL
Each variable is a comma-separated list of module_name:function_name
, defining the class listers.
Whenever possible, images get copied rather than read and then rewritten, to avoid loss of quality by subsequent read/write operations. To minimize introduction of artifacts in JPEG images, a default quality of 90% is used, which is a good compromise between size and quality.
However, this quality parameter can be overridden globally with the following environment variable:
IDC_JPEG_QUALITY
For instance, the following setting would use a quality of 100%:
IDC_JPEG_QUALITY=100