Skip to content

[Feature]: Document and Add Control Flag for Automatic Object Identity-Based Deduplication #1303

@bendichter

Description

@bendichter

Summary

HDMF currently performs automatic object identity-based deduplication when the same Python object is referenced multiple times in a hierarchical data structure. This behavior creates soft links instead of duplicating data, but it is undocumented and cannot be controlled by users.

Current Behavior

When the same Python Data or Container object is used in multiple locations within a hierarchical structure:

  1. The first occurrence is stored as a full dataset/group
  2. Subsequent references become soft links to the original location
  3. This happens automatically based on Python object identity (id(obj))

This could become an issue if the user wants to edit one of the objects but not the other.

Example:

import numpy as np
from hdmf import Container, Data

# Same data object used in multiple places
shared_data = Data(name="shared", data=np.array([1, 2, 3, 4, 5]))

container1 = SomeContainer(name="container1", data=shared_data)
container2 = SomeContainer(name="container2", data=shared_data)  # Will become soft link

# Results in HDF5 file:
# /container1/data -> actual dataset
# /container2/data -> soft link to /container1/data

Issues with Current Implementation

1. Undocumented Behavior

  • This behavior is not documented in user guides
  • Users may be surprised by soft links appearing in their files
  • The distinction between object identity vs. content equality is not clear

2. No User Control

  • Users cannot disable this behavior if they want separate copies
  • No way to force duplication even when using the same object
  • Behavior is always implicit based on object identity

3. Schema Documentation Gap

  • HDMF schema documentation doesn't mention this linking behavior

It is important for any downstream tool developers to know that they need to handle soft links anywhere a identical dataset might be stored.

What solution would you like?

  1. I would like to update documentation
  2. I would like control over whether this happens:
# In HDF5IO.write() and related methods
io.write(container, deduplicate_objects=True)  # Current default behavior
io.write(container, deduplicate_objects=False) # Force separate copies

# Or in BuildManager
manager.build(container, deduplicate_objects=False)

Do you have any interest in helping implement the feature?

Yes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions