This repository provides a robust and configurable data processing pipeline for the VQM24 (Vector Quantum Mechanical Property Prediction) dataset, specifically designed for use with PyTorch Geometric (PyG). The pipeline handles the entire workflow from raw data acquisition to creating production-ready torch_geometric.data.Data
objects with comprehensive molecular features and properties.
- Features
- Quick Start
- Installation
- Usage
- Configuration
- Project Structure
- Examples
- Contributing
- Citation
- License
- Acknowledgements
- Automated Data Acquisition: Seamlessly downloads VQM24 dataset from Zenodo
- Memory-Efficient Processing: Chunked data processing for handling large datasets
- Flexible Filtering: Configurable pre-filtering based on atom counts and heavy atom types
- RDKit Integration: Converts raw molecular data (SMILES, coordinates, atomic numbers) to RDKit objects
- PyG Data Creation: Transforms molecules into
torch_geometric.data.Data
objects - Rich Property Enrichment: Adds comprehensive molecular properties including:
- Scalar targets: HOMO/LUMO energies, dipole moments, total energies
- Node features: Atom types, partial charges
- Graph properties: Eigenvalues, vibrational frequencies and modes
- Derived properties: Atomization energies
- Vibrational Data Refinement: Cleans frequencies and modes, handles invalid entries
- Structural Feature Engineering: Configurable atom and bond-level features
- PyG Transformations: Built-in support for standard PyG pre-transforms
- Robust Error Handling: Comprehensive exception hierarchy for graceful error management
- Centralized Logging: Configurable logging with console and file output
# Clone the repository
git clone https://github.com/shahram-boshra/vqm24-pytorch-geometric.git
cd vqm24-pytorch-geometric
# Create conda environment
conda env create -f environment.yml
conda activate shah_env
# Run the processing pipeline
python main.py
- Python 3.11+
- Conda or Miniconda
-
Clone the repository:
git clone [email protected]:shahram-boshra/vqm24_database_process.git cd vqm24-pytorch-geometric
-
Create conda environment:
conda env create -f environment.yml conda activate shah_env
Alternative manual installation:
conda create -n shah_env python=3.11 \ numpy pytorch cpuonly rdkit pyyaml scipy \ torch-geometric pandas matplotlib tqdm requests \ -c pytorch -c pyg -c conda-forge -c defaults conda activate shah_env
Note: For GPU support, replace
cpuonly
withcudatoolkit=X.X
matching your CUDA version.
# Build Docker image
docker build -t vqm24-processor .
# Run container
docker run -v $(pwd)/data:/app/data vqm24-processor
Run the complete processing pipeline:
python main.py
The pipeline will:
- Initialize logging and load configurations
- Download raw data (
DFT_all.npz
) to~/Chem_Data/VQM24_PyG_Dataset/raw/
- Process data in chunks with filtering and feature engineering
- Save processed dataset as
data.pt
in~/Chem_Data/VQM24_PyG_Dataset/processed/
- Perform integrity tests on the processed data
import torch
from torch_geometric.data import InMemoryDataset
from pathlib import Path
# Define dataset path
dataset_root = Path.home() / "Chem_Data" / "VQM24_PyG_Dataset"
class VQM24Dataset(InMemoryDataset):
def __init__(self, root, transform=None, pre_transform=None):
super().__init__(root, transform, pre_transform)
self.data, self.slices = torch.load(self.processed_paths[0])
@property
def raw_file_names(self):
return ['DFT_all.npz']
@property
def processed_file_names(self):
return ['data.pt']
# Load dataset
dataset = VQM24Dataset(root=str(dataset_root))
print(f"Loaded {len(dataset)} molecular graphs")
# Access sample data
sample = dataset[0]
print(f"Atoms: {sample.z.shape[0]}")
print(f"Features: {sample.x.shape if hasattr(sample, 'x') else 'None'}")
The config.yaml
file controls all aspects of the processing pipeline:
# Global constants and conversions
global_constants:
har2ev: 27.211386245988
# Atomic energies for atomization calculations
atomic_energies_hartree:
H: -0.500607632585
C: -37.8302333826
# ... more elements
# Data properties to extract
data_properties_to_include:
scalar_graph_targets:
- homo_hartree
- lumo_hartree
- gap_hartree
# Filtering configuration
filter_config:
max_atoms: 50
min_atoms: 3
heavy_atom_filter:
mode: "include_only"
atoms: ["C", "N", "O", "F"]
# Structural features
structural_features:
atom_features:
- degree
- hybridization
- formal_charge
bond_features:
- bond_type
- bond_dir
# PyG transformations
transformations:
- name: "OneHotDegree"
kwargs:
max_degree: 10
For detailed configuration options, see the commented config.yaml
file.
vqm24-pytorch-geometric/
├── config.py # Configuration management
├── config.yaml # Main configuration file
├── data_refining.py # Vibrational data cleaning
├── data_utils.py # Data validation utilities
├── exceptions.py # Custom exception classes
├── logging_config.py # Logging configuration
├── main.py # Main processing script
├── mol_conversion.py # Molecule conversion orchestration
├── mol_conversion_utils.py # RDKit and PyG utilities
├── molecule_filters.py # Pre-filtering logic
├── mol_structural_features.py # Structural feature extraction
├── property_enrichment.py # Property addition to PyG objects
├── vqm24_dataset.py # Main PyG dataset class
├── environment.yml # Conda environment
├── Dockerfile # Docker configuration
└── README.md # This file
# Modify config.yaml for custom filtering
filter_config:
max_atoms: 30
min_atoms: 5
heavy_atom_filter:
mode: "exclude"
atoms: ["Br", "I"] # Exclude bromine and iodine
# In mol_structural_features.py, add custom atom features
def get_custom_atom_feature(atom):
"""Custom feature: atom electronegativity"""
electronegativity_map = {
'H': 2.20, 'C': 2.55, 'N': 3.04, 'O': 3.44, 'F': 3.98
}
return electronegativity_map.get(atom.GetSymbol(), 0.0)
from torch_geometric.loader import DataLoader
# Create data loader for batch processing
dataset = VQM24Dataset(root="path/to/dataset")
loader = DataLoader(dataset, batch_size=32, shuffle=True)
for batch in loader:
# Process batch of molecular graphs
print(f"Batch size: {batch.num_graphs}")
print(f"Total atoms: {batch.x.shape[0] if hasattr(batch, 'x') else 'N/A'}")
We welcome contributions! Please see our Contributing Guidelines for details.
# Fork and clone the repository
git clone https://github.com/shahram-boshra/vqm24-pytorch-geometric.git
# Create development environment
conda env create -f environment.yml
conda activate shah_env
# Install in development mode
pip install -e .
# Run tests
python -m pytest tests/
- Follow PEP 8 guidelines
- Use type hints where appropriate
- Add docstrings for all public functions
- Include unit tests for new features
If you use this processing pipeline or the VQM24 dataset in your research, please cite:
@article{li2024vqm24,
title={VQM24: A New Dataset for Virtual Quantum Mechanical Property Prediction},
author={Li, Xiaocheng and Guo, Yuzhi and Li, Jiacai and Cai, Jianfeng and Li, Minghao and Peng, Bo and Li, Jie and Yang, Fan and Li, Guangfu and Yang, Zeyi and Li, Jianan and Shao, Bin},
journal={arXiv preprint arXiv:2402.04631},
year={2024}
}
Paper: arXiv:2402.04631
This project is licensed under the MIT License - see the LICENSE file for details.
We thank the developers and maintainers of the following essential libraries:
- PyTorch & PyTorch Geometric - Deep learning and graph neural network frameworks
- RDKit - Cheminformatics toolkit for molecular processing
- NumPy & SciPy - Fundamental scientific computing
- Docker - Containerization platform
- Tqdm - Progress bar utilities
- 📖 Documentation: Check the inline code documentation and configuration comments
- 🐛 Issues: Report bugs and request features via GitHub Issues
- 💬 Discussions: Join our GitHub Discussions
- 📧 Contact: For research collaborations, contact [email protected]
Made with ❤️ for the molecular machine learning community