This repository provides a comprehensive Nextstrain analysis of "your virus". You can choose to perform either a shorter run with specific proteins or a full genome run.
For those unfamiliar with Nextstrain or needing installation guidance, please refer to the Nextstrain documentation.
- Prerequisites
- Nextstrain Environment
- Repository Organization
- Usage Examples
- Ingest
- Acknowledgments
- Contact
Ensure you have the following installed:
- Python=3.8 or higher
- Micromamba or Conda
- Snakemake=7
- Nextstrain CLI
Install the Nextstrain environment by following these instructions.
-
Clone the repository:
git clone [email protected]:hodcroftlab/template_nextstrain.git cd template_nextstrain
-
Install the Nextstrain environment:
micromamba create -n nextstrain \ --override-channels --strict-channel-priority \ -c conda-forge -c bioconda --yes \ augur auspice nextclade \ snakemake=7 git ncbi-datasets-cli micromamba activate nextstrain
-
Update/install additional dependencies:
sudo apt-get update sudo apt-get install -y unzip micromamba install -c conda-forge -c bioconda csvtk seqkit tsv-utils ipdb entrez-direct micromamba install -c conda-forge fuzzywuzzy python-dotenv ipykernel
The data for this analysis is available from NCBI Virus. Instructions for downloading sequences are provided under Sequences.
This repository includes the following directories and files:
scripts
: Custom Python scripts called by thesnakefile
.snakefile
: The entire computational pipeline, managed using Snakemake. Snakemake documentation can be found here.ingest
: Contains Python scripts and thesnakefile
for automatic downloading of <your_virus> sequences and metadata.- <
protein_xy
>: Sequences and configuration files for the specific protein_xy run. whole_genome
: Sequences and configuration files for the whole genome run.
The config
, protein_xy/config
, and whole_genome/config
directories contain necessary configuration files:
config.yaml
: Configuration file for setting parameters and options for the analysiscolors.tsv
: Color schemegeo_regions.tsv
: Geographical locationslat_longs.tsv
: Latitude datadropped_strains.txt
: It will exclude these accessions duringaugur filter
clades_genome.tsv
: Manually Labeling Clades on a Nextstrain Tree (see documentation here)reference_sequence.gb
: Reference sequence (add manually)auspice_config.json
: Auspice configuration file - has to be in all data folders!
The reference sequence used is XYZ, accession number, sampled in 19XX.
Activate the Nextstrain environment:
micromamba activate nextstrain
To perform a build, run:
snakemake --cores 9 all
For specific builds:
- protein_xy build:
snakemake auspice/<your_virus>_protein_xy.json --cores 9
- Whole genome build:
snakemake auspice/<your_virus>_whole-genome.json --cores 9
To visualize the build, use Auspice:
auspice view --datasetDir auspice
To run two visualizations simultaneously, you may need to set the port:
export PORT=4001
For more information on how to run the ingest
, please refer to the README in the ingest
folder.
Sequences can be downloaded manually or automatically.
- Manual Download: Visit NCBI Virus, search for
<your_virus>
or TaxidXXXXXX
, and download the sequences. - Automated Download: The
ingest
functionality, included in the mainsnakefile
, handles automatic downloading.
The ingest pipeline is based on the Nextstrain RSV ingest workflow. Running the ingest pipeline produces data/metadata.tsv
and data/sequences.fasta
.
For questions or support, please contact [[email protected]].
Feel free to adjust the content according to your project's specifics.