A high-performance Python tool for automatically compressing subfolders based on size criteria. This tool recursively scans directories, calculates folder sizes, and creates compressed archives for folders that meet specified size thresholds.
- Size-based compression: Only compress folders smaller than a specified size threshold
- Parallel processing: Multi-threaded processing for optimal performance on large datasets
- Non-destructive: Original folders remain unchanged, archives are stored in separate subdirectories
- Configurable: Customizable archive names, folder names, and size thresholds
- Robust error handling: Graceful handling of permission errors and inaccessible files
- Detailed logging: Comprehensive logging with progress tracking and statistics
- Cross-platform: Works on Windows, macOS, and Linux
- Single directory mode: Option to process only the specified directory (not subdirectories)
- Resume functionality: Log file tracking prevents reprocessing already compressed folders
- Compression statistics: Detailed reporting of original vs compressed sizes and ratios
- Plan mode: Preview what would be processed without creating any archives (dry run)
- Python 3.6 or higher (Python 3.7+ recommended for optimal compression)
- No external dependencies (uses only Python standard library)
- Python 3.6+: Basic functionality (default compression)
- Python 3.7+: Optimal compression with
compresslevel
parameter support - Python 3.8+: Enhanced pathlib support for better performance
The tool automatically detects your Python version and uses the appropriate compression method.
- Clone or download this repository
- Make the script executable (optional):
chmod +x mdf_zipper.py
- Test compatibility (optional):
python test_python_compatibility.py
python mdf_zipper.py /path/to/datasets
This will:
- Scan all subfolders in
/path/to/datasets
- Compress folders smaller than 10 GB (default threshold)
- Create
dataset.zip
files in.mdf
subdirectories
# Set custom size threshold (5 GB)
python mdf_zipper.py ~/datasets/abcd --max-size 5.0
# Use custom archive name and folder
python mdf_zipper.py ~/datasets/abcd --archive-name "backup.zip" --archive-folder "archives"
# Increase parallel processing (8 workers)
python mdf_zipper.py ~/datasets/abcd --workers 8
# Enable verbose logging
python mdf_zipper.py ~/datasets/abcd --verbose
# Process only a single directory (not subdirectories)
python mdf_zipper.py ~/datasets/specific_folder --single-directory
# Use log file for resume functionality
python mdf_zipper.py ~/datasets/abcd --log-file "processing.log"
# Preview what would be processed (plan mode)
python mdf_zipper.py ~/datasets/abcd --plan
# Combine features for comprehensive processing
python mdf_zipper.py ~/datasets/abcd --max-size 5.0 --workers 8 --log-file "~/logs/processing.json" --verbose
Option | Description | Default |
---|---|---|
directory |
Root directory to process | Required |
--max-size |
Maximum size in GB for compression | 10.0 |
--archive-name |
Name of the zip file to create | dataset.zip |
--archive-folder |
Folder name to store archives | .mdf |
--workers |
Number of parallel worker threads | 4 |
--verbose |
Enable verbose logging | False |
--single-directory |
Process only the specified directory | False |
--log-file |
Path to log file for resume functionality | None |
--plan |
Show what would be processed (dry run mode) | False |
Use --plan
to preview what would be processed without creating any archives:
# See what would be compressed with current settings
python mdf_zipper.py ~/datasets/abcd --plan
# Preview with different threshold
python mdf_zipper.py ~/datasets/abcd --plan --max-size 2.0
# Plan single directory processing
python mdf_zipper.py ~/datasets/experiment1 --plan --single-directory
# Detailed plan with verbose output
python mdf_zipper.py ~/datasets/abcd --plan --verbose
Plan mode shows:
- Which folders would be compressed
- Which folders would be skipped (too large)
- Estimated compression ratios and space savings
- Total data size that would be processed
- Archive locations that would be created
Benefits:
- Preview operations before running on large datasets
- Estimate storage requirements for compressed archives
- Validate settings like size thresholds and folder selection
- Safe exploration of directory structures without modifications
Use --single-directory
to process only the specified directory itself, rather than its subdirectories:
# Process only the contents of ~/datasets/experiment1 (not subdirectories)
python mdf_zipper.py ~/datasets/experiment1 --single-directory
The --log-file
option enables resume functionality by tracking processed folders:
# First run - processes all folders and saves log
python mdf_zipper.py ~/datasets/abcd --log-file "processing.log"
# Second run - skips already processed folders
python mdf_zipper.py ~/datasets/abcd --log-file "processing.log"
The log file contains detailed information about each processed folder:
- Processing timestamp
- Original and compressed sizes
- File counts and compression ratios
- Processing status (compressed/skipped/failed)
The tool intelligently determines when to reprocess folders:
- Archive missing: If the ZIP file was deleted, the folder is reprocessed
- Content changed: If folder size changed since last processing, it's reprocessed
- Settings changed: Different archive names or folders trigger reprocessing
- Directory Scanning: The tool scans the specified root directory for immediate subfolders
- Size Calculation: For each subfolder, it recursively calculates the total size of all files
- Size Filtering: Folders larger than the specified threshold are skipped
- Compression: Qualifying folders are compressed into ZIP archives
- Archive Storage: ZIP files are stored in a subdirectory within each processed folder
Before processing:
datasets/
├── small_dataset/ # 2 GB
│ ├── data1.txt
│ ├── data2.txt
│ └── subfolder/
│ └── data3.txt
├── medium_dataset/ # 8 GB
│ ├── images/
│ └── annotations/
└── large_dataset/ # 15 GB (exceeds 10 GB threshold)
├── videos/
└── metadata/
After processing:
datasets/
├── small_dataset/ # 2 GB
│ ├── data1.txt
│ ├── data2.txt
│ ├── subfolder/
│ │ └── data3.txt
│ └── .mdf/
│ └── dataset.zip # Contains all files from small_dataset/
├── medium_dataset/ # 8 GB
│ ├── images/
│ ├── annotations/
│ └── .mdf/
│ └── dataset.zip # Contains all files from medium_dataset/
└── large_dataset/ # 15 GB (unchanged - too large)
├── videos/
└── metadata/
The tool is optimized for large datasets with the following features:
- Parallel Processing: Multiple folders are processed simultaneously using ThreadPoolExecutor
- Efficient Size Calculation: Uses
os.walk()
for fast directory traversal - Memory Efficient: Processes files one at a time during compression
- Skip Logic: Avoids processing archive folders to prevent infinite loops
- Compression Level: Uses balanced compression (level 6) for good speed/size ratio
The tool gracefully handles various error conditions:
- Permission Errors: Logs warnings for inaccessible files/folders and continues
- Missing Directories: Validates directory existence before processing
- Disk Space: ZIP creation failures are logged and don't stop other operations
- Interruption: Supports Ctrl+C for clean cancellation
The tool provides detailed logging information:
- Folder processing progress
- Size calculations and file counts
- Compression status and results
- Error messages and warnings
- Final summary statistics
# See what would happen with 2 GB threshold
python mdf_zipper.py ~/research/datasets --plan --max-size 2.0
python mdf_zipper.py ~/research/datasets --max-size 2.0
python mdf_zipper.py /data/experiments --workers 8 --verbose --log-file "processing.json"
python mdf_zipper.py ~/projects --archive-name "project_backup.zip" --archive-folder "backups" --log-file "~/logs/projects.json"
# First preview what would happen
python mdf_zipper.py ~/datasets/specific_experiment --plan --single-directory --max-size 1.0
# Then execute if satisfied with the plan
python mdf_zipper.py ~/datasets/specific_experiment --single-directory --max-size 1.0
The tool provides a comprehensive summary after processing, including detailed compression statistics:
Normal Mode:
============================================================
PROCESSING SUMMARY
============================================================
Total folders processed: 25
Folders compressed: 18
Folders skipped (too large): 5
Folders failed: 2
Folders already processed: 12
Total original data size: 127.45 GB
Total compressed data size: 32.18 GB
Overall compression ratio: 25.2%
Space saved: 95.27 GB
============================================================
The tool now tracks and displays:
- Original data size: Total size of all processed folders
- Compressed data size: Total size of all created ZIP archives
- Compression ratio: Percentage of original size after compression
- Space saved: Amount of storage space saved through compression
- Per-folder statistics: Individual compression ratios for each processed folder
- Already processed: Count of folders skipped due to previous processing
This information helps you understand the effectiveness of compression for your specific datasets and make informed decisions about storage optimization.
The log file is stored in JSON format with detailed information:
{
"/path/to/folder": {
"folder_name": "experiment_data",
"processed_date": "2025-05-29T13:38:35.934567",
"original_size_bytes": 1073741824,
"original_size_gb": 1.0,
"file_count": 1500,
"compressed_size_bytes": 268435456,
"compressed_size_gb": 0.25,
"compression_ratio": 25.0,
"status": "compressed",
"archive_path": "/path/to/folder/.mdf/dataset.zip"
}
}
The MDF Zipper includes an extensive test suite to ensure absolute safety for high-value datasets:
# Run critical safety tests (RECOMMENDED before processing valuable data)
python run_tests.py --critical-safety
# Run all tests
python run_tests.py --all
# Run platform-specific tests (UNIX/Linux only)
python run_tests.py --unix-linux
# Run quick tests (excludes slow tests)
python run_tests.py --quick
# Run with coverage report
python run_tests.py --coverage
- Critical Safety Tests: Atomic operations, data integrity, failure recovery
- Data Integrity Tests: Original file protection, archive validation
- Stress Tests: Large datasets, concurrent access, memory limits
- Edge Cases: Unicode files, permissions, symlinks, special file types
- UNIX/Linux Specific: File permissions, signals, special files (FIFOs, device files), extended attributes
- Cross-Platform: Windows, macOS, Linux compatibility
✅ Original files NEVER modified - SHA256 checksum verification
✅ Original files NEVER moved - Absolute path tracking
✅ Atomic archive creation - Complete success or complete cleanup
✅ Power failure protection - Data integrity across interruptions
✅ Memory exhaustion protection - Graceful error handling
✅ Concurrent access safety - Files remain accessible during compression
For high-value datasets, always run the critical safety tests first:
python run_tests.py --critical-safety --verbose
This project is open source and available under the MIT License.
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
If you encounter any issues or have questions, please create an issue in the repository.