VQM24 PyTorch Geometric Dataset Processing Pipeline

Overview

This repository provides a robust and configurable data processing pipeline for the VQM24 (Vector Quantum Mechanical Property Prediction) dataset, specifically designed for use with PyTorch Geometric (PyG). The pipeline handles the entire workflow from raw data acquisition to creating production-ready torch_geometric.data.Data objects with comprehensive molecular features and properties.

Features

🚀 Core Capabilities

Automated Data Acquisition: Seamlessly downloads VQM24 dataset from Zenodo
Memory-Efficient Processing: Chunked data processing for handling large datasets
Flexible Filtering: Configurable pre-filtering based on atom counts and heavy atom types

🔬 Molecular Data Processing

RDKit Integration: Converts raw molecular data (SMILES, coordinates, atomic numbers) to RDKit objects
PyG Data Creation: Transforms molecules into torch_geometric.data.Data objects
Rich Property Enrichment: Adds comprehensive molecular properties including:
- Scalar targets: HOMO/LUMO energies, dipole moments, total energies
- Node features: Atom types, partial charges
- Graph properties: Eigenvalues, vibrational frequencies and modes
- Derived properties: Atomization energies

⚙️ Advanced Features

Vibrational Data Refinement: Cleans frequencies and modes, handles invalid entries
Structural Feature Engineering: Configurable atom and bond-level features
PyG Transformations: Built-in support for standard PyG pre-transforms
Robust Error Handling: Comprehensive exception hierarchy for graceful error management
Centralized Logging: Configurable logging with console and file output

Quick Start

# Clone the repository
git clone https://github.com/shahram-boshra/vqm24-pytorch-geometric.git
cd vqm24-pytorch-geometric

# Create conda environment
conda env create -f environment.yml
conda activate shah_env

# Run the processing pipeline
python main.py

Installation

Prerequisites

Python 3.11+
Conda or Miniconda

Environment Setup

Clone the repository:

git clone [email protected]:shahram-boshra/vqm24_database_process.git
cd vqm24-pytorch-geometric

Create conda environment:

conda env create -f environment.yml
conda activate shah_env

Alternative manual installation:

conda create -n shah_env python=3.11 \
  numpy pytorch cpuonly rdkit pyyaml scipy \
  torch-geometric pandas matplotlib tqdm requests \
  -c pytorch -c pyg -c conda-forge -c defaults
conda activate shah_env

Note: For GPU support, replace cpuonly with cudatoolkit=X.X matching your CUDA version.

Docker Support

# Build Docker image
docker build -t vqm24-processor .

# Run container
docker run -v $(pwd)/data:/app/data vqm24-processor

Usage

Basic Usage

Run the complete processing pipeline:

python main.py

The pipeline will:

Initialize logging and load configurations
Download raw data (DFT_all.npz) to ~/Chem_Data/VQM24_PyG_Dataset/raw/
Process data in chunks with filtering and feature engineering
Save processed dataset as data.pt in ~/Chem_Data/VQM24_PyG_Dataset/processed/
Perform integrity tests on the processed data

Loading Processed Dataset

import torch
from torch_geometric.data import InMemoryDataset
from pathlib import Path

# Define dataset path
dataset_root = Path.home() / "Chem_Data" / "VQM24_PyG_Dataset"

class VQM24Dataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super().__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])
    
    @property
    def raw_file_names(self):
        return ['DFT_all.npz']
    
    @property
    def processed_file_names(self):
        return ['data.pt']

# Load dataset
dataset = VQM24Dataset(root=str(dataset_root))
print(f"Loaded {len(dataset)} molecular graphs")

# Access sample data
sample = dataset[0]
print(f"Atoms: {sample.z.shape[0]}")
print(f"Features: {sample.x.shape if hasattr(sample, 'x') else 'None'}")

Configuration

The config.yaml file controls all aspects of the processing pipeline:

Key Configuration Sections

# Global constants and conversions
global_constants:
  har2ev: 27.211386245988

# Atomic energies for atomization calculations
atomic_energies_hartree:
  H: -0.500607632585
  C: -37.8302333826
  # ... more elements

# Data properties to extract
data_properties_to_include:
  scalar_graph_targets:
    - homo_hartree
    - lumo_hartree
    - gap_hartree
  
# Filtering configuration
filter_config:
  max_atoms: 50
  min_atoms: 3
  heavy_atom_filter:
    mode: "include_only"
    atoms: ["C", "N", "O", "F"]

# Structural features
structural_features:
  atom_features:
    - degree
    - hybridization
    - formal_charge
  bond_features:
    - bond_type
    - bond_dir

# PyG transformations
transformations:
  - name: "OneHotDegree"
    kwargs:
      max_degree: 10

For detailed configuration options, see the commented config.yaml file.

Project Structure

vqm24-pytorch-geometric/
├── config.py                    # Configuration management
├── config.yaml                  # Main configuration file
├── data_refining.py             # Vibrational data cleaning
├── data_utils.py                # Data validation utilities
├── exceptions.py                # Custom exception classes
├── logging_config.py            # Logging configuration
├── main.py                      # Main processing script
├── mol_conversion.py            # Molecule conversion orchestration
├── mol_conversion_utils.py      # RDKit and PyG utilities
├── molecule_filters.py          # Pre-filtering logic
├── mol_structural_features.py   # Structural feature extraction
├── property_enrichment.py       # Property addition to PyG objects
├── vqm24_dataset.py            # Main PyG dataset class
├── environment.yml             # Conda environment
├── Dockerfile                  # Docker configuration
└── README.md                   # This file

Examples

Custom Filtering Example

# Modify config.yaml for custom filtering
filter_config:
  max_atoms: 30
  min_atoms: 5
  heavy_atom_filter:
    mode: "exclude"
    atoms: ["Br", "I"]  # Exclude bromine and iodine

Adding Custom Features

# In mol_structural_features.py, add custom atom features
def get_custom_atom_feature(atom):
    """Custom feature: atom electronegativity"""
    electronegativity_map = {
        'H': 2.20, 'C': 2.55, 'N': 3.04, 'O': 3.44, 'F': 3.98
    }
    return electronegativity_map.get(atom.GetSymbol(), 0.0)

Batch Processing Example

from torch_geometric.loader import DataLoader

# Create data loader for batch processing
dataset = VQM24Dataset(root="path/to/dataset")
loader = DataLoader(dataset, batch_size=32, shuffle=True)

for batch in loader:
    # Process batch of molecular graphs
    print(f"Batch size: {batch.num_graphs}")
    print(f"Total atoms: {batch.x.shape[0] if hasattr(batch, 'x') else 'N/A'}")

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Development Setup

# Fork and clone the repository
git clone https://github.com/shahram-boshra/vqm24-pytorch-geometric.git

# Create development environment
conda env create -f environment.yml
conda activate shah_env

# Install in development mode
pip install -e .

# Run tests
python -m pytest tests/

Code Style

Follow PEP 8 guidelines
Use type hints where appropriate
Add docstrings for all public functions
Include unit tests for new features

Citation

If you use this processing pipeline or the VQM24 dataset in your research, please cite:

@article{li2024vqm24,
  title={VQM24: A New Dataset for Virtual Quantum Mechanical Property Prediction},
  author={Li, Xiaocheng and Guo, Yuzhi and Li, Jiacai and Cai, Jianfeng and Li, Minghao and Peng, Bo and Li, Jie and Yang, Fan and Li, Guangfu and Yang, Zeyi and Li, Jianan and Shao, Bin},
  journal={arXiv preprint arXiv:2402.04631},
  year={2024}
}

Paper: arXiv:2402.04631

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

We thank the developers and maintainers of the following essential libraries:

PyTorch & PyTorch Geometric - Deep learning and graph neural network frameworks
RDKit - Cheminformatics toolkit for molecular processing
NumPy & SciPy - Fundamental scientific computing
Docker - Containerization platform
Tqdm - Progress bar utilities

Support

📖 Documentation: Check the inline code documentation and configuration comments
🐛 Issues: Report bugs and request features via GitHub Issues
💬 Discussions: Join our GitHub Discussions
📧 Contact: For research collaborations, contact [email protected]

Made with ❤️ for the molecular machine learning community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VQM24 PyTorch Geometric Dataset Processing Pipeline

Overview

Table of Contents

Features

🚀 Core Capabilities

🔬 Molecular Data Processing

⚙️ Advanced Features

Quick Start

Installation

Prerequisites

Environment Setup

Docker Support

Usage

Basic Usage

Loading Processed Dataset

Configuration

Key Configuration Sections

Project Structure

Examples

Custom Filtering Example

Adding Custom Features

Batch Processing Example

Contributing

Development Setup

Code Style

Citation

License

Acknowledgements

Support

About

Uh oh!

Releases

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
Dockerfile		Dockerfile
LISENCE		LISENCE
README.md		README.md
config.py		config.py
config.yaml		config.yaml
data_refining.py		data_refining.py
data_utils.py		data_utils.py
environment.yml		environment.yml
exceptions.py		exceptions.py
logging_config.py		logging_config.py
main.py		main.py
mol_conversion.py		mol_conversion.py
mol_conversion_utils.py		mol_conversion_utils.py
mol_structural_features.py		mol_structural_features.py
molecule_filters.py		molecule_filters.py
property_enrichment.py		property_enrichment.py
vqm24_dataset.py		vqm24_dataset.py

shahram-boshra/vqm24_database_process

Folders and files

Latest commit

History

Repository files navigation

VQM24 PyTorch Geometric Dataset Processing Pipeline

Overview

Table of Contents

Features

🚀 Core Capabilities

🔬 Molecular Data Processing

⚙️ Advanced Features

Quick Start

Installation

Prerequisites

Environment Setup

Docker Support

Usage

Basic Usage

Loading Processed Dataset

Configuration

Key Configuration Sections

Project Structure

Examples

Custom Filtering Example

Adding Custom Features

Batch Processing Example

Contributing

Development Setup

Code Style

Citation

License

Acknowledgements

Support

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Languages