Skip to content

KCGallagher/periodic-sampling

Repository files navigation

Periodic Sampling (COVID-19 Data)

This repository explores periodic trends in reported case and death data for multiple diseases. This work supports the paper 'Identification and Attribution of Weekly Periodic Biases in Global Epidemiological Time Series Data', currently available as a preprint on medRxiv.

We also provide a package for comprehensive Bayesian Inference and Gibbs Sampling methods to explore this periodic data trends in real or synthetic Covid-19 case data.

Periodic Data Trends

Covid-19 Data

We import Covid-19 case and death data from the John Hopkins Database. This data uploaded into separate .csv files on a daily basis, and so routines in the analysis module are provided to generate location-specific files over the history of the pandemic.

For example:

from analysis import generate_location_df

input_dir = "COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/"
location_key = "England, United Kingdom"

country_df = generate_location_df(input_dir, location_key)
country_df.to_csv("data/England_data.csv")

More detailed examples (along with cleaning procedures for the data) are given in data_trends.ipynb. Currently these procedures are not packaged into a separate method, but this may be updated in the future.

Further information about this data (such as collection methods) can be found in a dedicated README. Pre-generated example data files are also available.

Other Diseases

We also provide daily case data from the 1918 Spanish Flu and 2022 Haitian Cholera epidemics, in other_diseases.

Periodic Reporting Trends

In this data we typically observe a strong oscilatory trend, as depicted in both the cases and death data from England, UK. The raw daily data is given in grey, with a 7-day moving average (typically used in most publications) superimposed in colour.

UK Covid Data

There are consistent over/under reporting trends on particular weekdays across the duration of the pandemic. These may be quantified through a reporting factor, given by the ratio of observed cases on a given day to the 7-day average about that day. The distribution of reporting factor for each dataset is given below:

Weekday Bias Violin Plot

A global analysis of these trends is further provided in global_pca.ipynb.

Origin of Bias

We further use a dataset from PHE that distinguishes between the true date of death, and the date the death has been attributed to on online reporting systems. From analysis in periodicity_analysis.ipynb, we identify a weekly oscillation in the death data grouped by reporting date that is not present in the true event date, suggesting that this weekly trend is fully attributable to biases in the reporting process.

Power Spectrum of UK PHE Data

$R_{t}$ Inference

Synthetic Data

To benchmark inference approaches with a known ground truth, we generate synthetic pandemic data using a renewal model framework. Alongside this are provided various reporter functions, which can return/save this data in .csv format, as well as applying various reporting biases to replicate the trends described above.

An example of this process is given below:

from synthetic_data import RenewalModel, Reporter

model = RenewalModel(R0=0.99)
model.simulate(T=200, N_0=500)

rep = Reporter(model.case_data)
truth_df = rep.unbiased_report()
bias_df = rep.fixed_bias_report(bias = [0.5, 1.4, 1.2, 1.1, 1.1, 1.1, 0.6],
                                multinomial_dist=True)

This would generate the following data:

Synthetic Data Example

All functions have complete docstrings to record their functionality and expected arguments. Further detail is also given in the README for the periodic_sampling module.

Inference Methods

Both Metropolis-Hastings and Gibbs sampling methods are implemented for use in Bayesian inference. These have separate parameter and sampling classes, but a combined ('mixed') sampling method is also implemented to allow inference on multiple parameters of different types. We also utilise independent sampling for the discrete case values in inference of the ground truth time series.

This flexible implementation is applicable to a wide range of problems, with some examples from Ben Lambert's "A Student's Guide to Bayesian Statistics" given in exampler.ipynb. These methods are then applied to the inference of the true time series from the biased time series, under various assumptions described in a separate README.

We also introduce a number of methods in Stan using a No U-Turn Sampler, to handle larger populations without the computational limits we have imposed on our mixed sampler through the use of independent sampling on the time series. An example of predictions for the timeseries and reproduction number profile (based on the posterior mean) is given below: Stan Example Example

About

Utilising Gibbs Sampling to explore periodic data trends in Covid-19 case data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published