-
Notifications
You must be signed in to change notification settings - Fork 25
Open
Description
Summary
HDMF currently performs automatic object identity-based deduplication when the same Python object is referenced multiple times in a hierarchical data structure. This behavior creates soft links instead of duplicating data, but it is undocumented and cannot be controlled by users.
Current Behavior
When the same Python Data
or Container
object is used in multiple locations within a hierarchical structure:
- The first occurrence is stored as a full dataset/group
- Subsequent references become soft links to the original location
- This happens automatically based on Python object identity (
id(obj)
)
This could become an issue if the user wants to edit one of the objects but not the other.
Example:
import numpy as np
from hdmf import Container, Data
# Same data object used in multiple places
shared_data = Data(name="shared", data=np.array([1, 2, 3, 4, 5]))
container1 = SomeContainer(name="container1", data=shared_data)
container2 = SomeContainer(name="container2", data=shared_data) # Will become soft link
# Results in HDF5 file:
# /container1/data -> actual dataset
# /container2/data -> soft link to /container1/data
Issues with Current Implementation
1. Undocumented Behavior
- This behavior is not documented in user guides
- Users may be surprised by soft links appearing in their files
- The distinction between object identity vs. content equality is not clear
2. No User Control
- Users cannot disable this behavior if they want separate copies
- No way to force duplication even when using the same object
- Behavior is always implicit based on object identity
3. Schema Documentation Gap
- HDMF schema documentation doesn't mention this linking behavior
It is important for any downstream tool developers to know that they need to handle soft links anywhere a identical dataset might be stored.
What solution would you like?
- I would like to update documentation
- I would like control over whether this happens:
# In HDF5IO.write() and related methods
io.write(container, deduplicate_objects=True) # Current default behavior
io.write(container, deduplicate_objects=False) # Force separate copies
# Or in BuildManager
manager.build(container, deduplicate_objects=False)
Do you have any interest in helping implement the feature?
Yes.
Metadata
Metadata
Assignees
Labels
No labels