Skip to content

Group 13's project for the WASP course: Scalable Data Science and Distributed Machine Learning

License

Notifications You must be signed in to change notification settings

PengKuang/team_scalable_timeseries

Repository files navigation

Time series anomaly detection using an ensemble of Autoencoders

Authors: Kasper Bågmark (Chalmers University), Michele Di Sabato (Umeå University), Erik Jansson (Chalmers University), Peng Kuang (Lund University) and Selma Tabakovic (Chalmers University)

  • Project title: Time series anomaly detection using Autoencoders
  • Description: This project addresses scalable anomaly detection in time series data, focusing on electrocardiograms (ECGs). The goal is to identify anomalous heartbeats that deviate from normal patterns using autoencoders—a type of neural network that compresses data into a latent representation and reconstructs it. Poor reconstruction of anomalous signals, compared to normal ones, forms the basis for anomaly detection. Unlike traditional time-series models, this approach is fully data-driven, avoiding explicit modeling of time-series dynamics. A distributed ensemble method ensures scalability for large datasets, using PySpark and TorchDistributor to train models across multiple nodes. This setup supports privacy preservation by localizing data at healthcare institutions. The experiment setup consists of partitioning data and distributing training across CPU cores on a single ECG data set. Reconstruction loss distributions allow for threshold-based classification of anomalies. Preliminary results demonstrate efficient hardware utilization and reliable anomaly detection, though challenges remain in handling unseen anomaly types and node imbalance. This work demonstrates the potential of distributed deep learning for large-scale medical anomaly detection, highlighting privacy preservation, scalability, and effective resource usage.
  • Links: The file report.md contains the main part of this project, where we explain the purpose, idea, method, scalable aspects and implementation. To run this project yourself we refer to the file docker.md for instructions on how to use the appropriate docker image. The file model_pipeline.ipynb contains the main code for this project which implements the methods and analysis explained in report.md. The file utils.py defines some of the key functions used by TorchDistributor and are imported in the notebook model_pipeline.ipynb. In order to see the presentation used to explain the project, download the repository and open the presentation.html file. The live recording of our presentation is available here.

Preview of the presentation.

  • Author's contribution:
    • Kasper Bågmark: Responsibility for the ensemble idea and creating the scalable setup with TorchDistributor. Investigated different options for scalability. Finalized the code and the repository.
    • Michele Di Sabato: Provided the idea of analysis and contributed to the implementation of the scalable training and inference pipelines. Authored parts of the report and most of the results.
    • Erik Jansson: Did a thorough literature review of relevant methods for using autoencoders on timeseries. Provided the finalized project idea, created the figures in the report and integrated the content to coherent slides.
    • Peng Kuang: Organized most of the meetings and suggested the branching of scalability to divide the tasks among team members. Co-investigated scalability, especially for federated learning in general. Provided the docker image and made sure that the collaboration worked smoothly through the repository.
    • Selma Tabakovic: Most of the initial investigation of using autoencoders on different datasets and different techniques for analysis. Main responsibility of the data, finalized the code and the repository.
    • Everyone: Contributed to the report and experimented with setup, autoencoders and timeseries analysis in different ways before the final project was finalized.
  • Acknowledgement: This project was partially supported by the Wallenberg AI, Autonomous Systems and Software Program funded by Knut and Alice Wallenberg Foundation to fufill the requirements to pass the WASP Graduate School Course Scalable Data Science and Distributed Machine Learning - ScaDaMaLe-WASP-UU-2024 at https://lamastex.github.io/ScaDaMaLe. Computing infrastructure for learning was supported by Databricks Inc.'s Community Edition. The course was Industrially sponsored by Jim Dowling of Logical Clocks AB, Stockholm, Sweden, Reza Zadeh of Matroid Inc., Palo Alto, California, USA, and Andreas Hellander & Salman Toor of Scaleout Systems AB, Uppsala, Sweden.

About

Group 13's project for the WASP course: Scalable Data Science and Distributed Machine Learning

Topics

Resources

License

Stars

Watchers

Forks