Skip to content

nasa-nccs-hpda/astrotime

Repository files navigation

Astrotime

Machine learning methods for irregularly spaced time series

Project Description

This project contains the implementation of a set of time-aware neural network (TAN) and workflows for testing their performance on the task of predicting periods of the sinusoidal timeseries dataset (STD) provided by Brian Powell. Its performance on this dataset (and a 50% reduced version) was compared with the performance of the baseline CNN network (BCN) provided by Brian Powell. The BCN operates directly on the timeseries values (without the time information). The TAN utilizes the same network as the BCN but operates on a weighted projection of the timeseries onto a set of sinusoidal basis functions, which enfolds both value and time components. When tested on the unmodified STD, the BCN achieved a mean absolute error (MAE) of 0.03 and the TAN achieved a MAE of 0.01. Because the STD is close to being regularly sampled, the BCN (which implicitly assumes regularly sampled data) performs reasonably well, and the addition of time information in the TAN yields a relatively small improvement. To compare the performance of these models on a (more) irregularly sampled dataset, we subsampled the STD by randomly removing 50% of the observations. On the sparse STD the TAN again achieved a MAE of 0.02, but the BCN performance was greatly degraded, resulting in a MAE of 0.25. These results verify that the TAN is effectively using the time information of the dataset, whereas the BCN is operating on the shape of the value curve assuming regularly sampled observations.

Analysis vs. Synthesis

  • This project implements two forms of wavelet transform: an analysis transform and a synthesis transform.
  • The analysis coefficients represent the projection of a signal onto a set of basis functions, implemented as a weighted inner product between the signal and the basis functions (evaluated at the time points).
  • The synthesis coefficients represent the optimal representation of the signal as a weighted sum of the basis functions (i.e. the minimum error projection).
  • If the basis functions are orthogonal, then the analysis and synthesis coefficients are the same (as in the FFT). However, when the time points are irregular, then the basis functions (evaluated at the time points) are never orthogonal, and additional computation is required to generate the synthesis coefficients.

Model Equations

  • There is a good summary of the equations implemented in this project in the appendix of Witt & Schumann (2005).
  • The wavelet synthesis transform generates two features described by equations A10 and A11.
  • The wavelet analysis transform generates three features by computing weighted scalar products (equation A3) between the signal values and the sinusoid basis functions described by equation A5.
  • Equation A7 shows the relationship between the analysis and synthesis coefficients.
  • Futher mathematical detail can be found in Foster (1996).

Conda environment

  • On Adapt load modules: gcc/12.1.0, nvidia/12.1
  • If mamba is not available, install miniforge (or load mamba module)
  • Execute the following to set up a conda environment for astrotime:

Torch Environment (Current)

>   * mamba create -n astrotime.pt ninja python=3.10
>   * mamba activate astrotime
>   * pip install torch jupyterlab==4.0.13 ipywidgets==7.8.4 cuda-python jupyterlab_widgets ipykernel==6.29 ipympl ipython==8.26 xarray netCDF4 pygam wotan astropy statsmodels transitleastsquares scikit-learn hydra-core rich 
>   * pip install lightkurve --upgrade

Dataset Preparation

  • The project data directory on explore is: /explore/nobackup/projects/ilab/data/astrotime.
  • This project uses a baseline dataset of artificially generated sinusoids, downloadable from a sharepoint folder.
  • The raw dataset has been downloaded to explore at: {datadir}/sinusoids/npz.
  • The script .workflow/npz2nc.py has been used to convert the .npz files to netcdf format.
  • The netcdf files, which are used in this project's ML workflows, can be found at: {datadir}/sinusoids/nc.

Workflows

This project provides three ML workflows:

  • Baseline (.workflow/train-baseline-cnn.py): This workflow runs the baseline CNN (developed by Brian Powell) which takes only timeseries value data as input.
  • Wavelet Synthesis (.workflow/wavelet-synthesis-cnn.py): This workflow runs the same baseline CNN operating on a weighted wavelet z-transform, which enfolds both the time and value data from the timeseries.
  • Wavelet Analysis (.workflow/wavelet-analysis-cnn.py): This workflow runs the same baseline CNN operating a projection of the timeseries onto a set of sinusoid basis functions, which enfolds both the time and value data from the timeseries.

The *_small versions execute the workflows on a subset (1/10) of the full training dataset. The workflows save checkpoint files at the end of each epoch. By default the model is initialized with any existing checkpoint file at the begining of script execution. To execute the script with a new set of checkpoints (while keeping the old ones), create a new script with a different value of the version parameter (and a new defaults hydra yaml file with the same name in the config dir).

Configuration

The workflows are configured using hydra.

  • All hydra yaml configuration files are found under .config.
  • The workflow configurations can be modified at runtime as supported by hydra.
  • For example, the following command runs the baseline workflow on gpu 3 with random initialization (i.e. ignoring & overwriting any existing checkpoints):

    python workflow/train-baseline-cnn.py platform.gpu=3 train.refresh_state=True

  • To run validation (no training), execute:

    python workflow/train-baseline-cnn.py train.mode=valid platform.gpu=0

Configuration Parameters

Here is a partial list of configuration parameters with typical default values. Their values are configured in the hydra yaml files and reconfigurable on the command line:

   platform.project_root:  "/explore/nobackup/projects/ilab/data/astrotime"   # Base directory for all saved files
   platform.gpu: 0                                                            # Index of gpu to execcute on
   platform.log_level: "info"                                                 # Log level: typically debug or info
   data.source: sinusoid                                            # Dataset type (currently only sinusoid is supported)
   data.dataset_root:  "${platform.project_root}/sinusoids/nc"      # Location of processed netcdf files
   data.dataset_files:  "padded_sinusoids_*.nc"                     # Glob pattern for file names
   data.file_size: 1000                                             # Number of sinusoids in a single nc file
   data.batch_size: 50                                              # Batch size for training
   data.validation_fraction: 0.1                                    # Fraction of training dataset that is used for validation
   data.dset_reduction: 1.0                                         # Fraction of the full dataset that is used for training/validation
   transform.nfeatures: 1                                # Number of feaatures to be passed to network
   transform.sparsity: 0.0                               # Fraction of observations to drop (randomly)
   model.cnn_channels: 64                                # Number of channels in first CNN layer
   model.dense_channels: 64                              # Number of channels in dense layer
   model.out_channels: 1                                 # Number of network output channels
   model.num_cnn_layers: 3                               # Number of CNN layers in a CNN block
   model.num_blocks: 7                                   # Number of CNN blocks in the network
   model.pool_size: 2                                    # Max pool size for every block
   model.stride: 1                                       # Stride value for every CNN layer
   model.kernel_size: 3                                  # Kernel size for every CNN layer
   model.cnn_expansion_factor: 4                         # Increase in the number of channels from one CNN layer to the next
   train.optim: rms                                              # Optimizer
   train.lr: 1e-3                                                # Learning rate
   train.nepochs: 5000                                           #  Training Epochs
   train.refresh_state: False                                    # Start from random weights (Ignore & overwrite existing checkpoints)
   train.overwrite_log: True                                     # Start new log file
   train.results_path: "${platform.project_root}/results"        # Checkpoint and log files are saved under this directory
   train.weight_decay: 0.0                                       # Weight decay parameter for optimizer
   train.mode:  train                                            # execution mode: 'train' or 'valid'

Working from the container

In addition to the anaconda environment, the software can be run from a container. This project provides a Docker container that can be converted to Singularity or any container engine based on the user needs. The instructions below are geared towards the use of Singularity since that is the default available in the NCCS super computing facility.

Container Download

To create a sandbox out of the container:

singularity build --sandbox /lscratch/$USER/container/astrotime docker://nasanccs/astrotime:latest

*note - /lscratch is only available on gpu### nodes

An already downloaded version of this sandbox is available under:

/explore/nobackup/projects/ilab/containers/astrotime-latest

Working from the container with a shell session

To get a shell session inside the container:

singularity shell -B $NOBACKUP,/explore/nobackup/projects,/explore/nobackup/people --nv /explore/nobackup/projects/ilab/containers/astrotime-latest

An example run training:

python /explore/nobackup/projects/ilab/ilab_testing/astrotime/workflow/baseline-cnn.py platform.project_root=/explore/nobackup/projects/ilab/ilab_testing/astrotime data.dataset_root=/explore/nobackup/projects/ilab/data/astrotime/sinusoids/nc

Expected training output files:

/explore/nobackup/projects/ilab/ilab_testing/astrotime/results/checkpoints/sinusoid_period.baseline.pt
/explore/nobackup/projects/ilab/ilab_testing/astrotime/results/checkpoints/sinusoid_period.baseline.backup.pt

An example run validation:

python /explore/nobackup/projects/ilab/ilab_testing/astrotime/workflow/baseline-cnn.py platform.project_root=/explore/nobackup/projects/ilab/ilab_testing/astrotime data.dataset_root=/explore/nobackup/projects/ilab/data/astrotime/sinusoids/nc train.mode=valid

Expected validation output:

      Loading checkpoint from /explore/nobackup/projects/ilab/ilab_testing/astrotime/results/checkpoints/sinusoid_period.baseline.pt: epoch=122, batch=0

SignalTrainer[TSet.Validation]: 2000 batches, 1 epochs, nelements = 100000, device=cuda:0
 Validation Loss: mean=0.021, median=0.021, range=(0.012 -> 0.043)
98.04user 8.85system 2:00.79elapsed 88%CPU (0avgtext+0avgdata 1080416maxresident)k
2059752inputs+1120outputs (1677major+582379minor)pagefaults 0swaps

Sending a slurm job using the container (training example):

From gpulogin1:

sbatch --mem-per-cpu=10240 -G1 -c10 -t01:00:00 -J astrotime --wrap="time singularity exec -B $NOBACKUP,/explore/nobackup/projects,/explore/nobackup/people --nv /explore/nobackup/projects/ilab/containers/astrotime-latest python /explore/nobackup/projects/ilab/ilab_testing/astrotime/workflow/baseline-cnn.py platform.project_root=/explore/nobackup/projects/ilab/ilab_testing/astrotime data.dataset_root=/explore/nobackup/projects/ilab/data/astrotime/sinusoids/nc"

References

  • Foster, G. Wavelets for period analysis of unevenly sampled time series. The Astronomical Journal 112, 1709 (1996).
  • Witt, A. & Schumann, A. Y. Holocene climate variability on millennial scales recorded in Greenland ice cores. Nonlinear Processes in Geophysics 12, 345–352 (2005).

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •