EMBL_Hinxton_Coding_Challenge

MinHash implementation for Jaccard Distance calculation between genome sequence pairs of Streptococcus pneumonia

Sections:

STEP 1: READ FASTA FILES AND CREATE K-MER FREQUENCY DICTIONARIES (K = 14)

Murmurhash3 package: (https://github.com/hajimes/mmh3)

It has a set of fast and robust hash functions
Non-cryptographic hash function suitable for general hash-based lookup
Produces a 32-bit hash value
Widely used in bioinformatics and other fields for hashing string sequences due to its efficiency, low collision rates, and good distribution properties
Highly optimized for speed and is faster than many other functions (crucial when dealing with large datasets common in bioinformatics, where millions of sequences (like k-mers in genomics) need to be hashed quickly)
The low collision rate makes it suitable for indexing and partitioning DNA sequence data in bioinformatics, where multiple datasets and sequences need to be handled quickly and without collisions

Sequences: R6, TIGR4, 14412_3#82.contigs_velvet, 14412_3#84.contigs_velvet Note: The contigs have been stitched together to create two new full sequences. This makes all the sequence lengths comparable and suitable for Jaccard distance computation.

STEP 2: FULL JACCARD DISTANCE CALCULATION BETWEEN INPUT DICTIONARY PAIRS

J(A, B) = 1 - (|A ∩ B| / |A ∪ B|); Jaccard Distance, where A and B are the two sets to compare

STEP 3: SAMPLE HASH FUNCTION CODE IMPLEMENTATION (MURMURHASH3)

Validate the hash function with test k-mers (to check if a kmer is getting mapped to the same integer in every run of the code)

STEP 4: CREATING INPUT GENOME SEQUENCE SKETCHES BASED ON MURMURHASH3 IMPLEMENTATION (MINHASH)

STEP 5: CALCULATE MINHASH JACCARD DISTANCES FROM SKETCHES

BONUS

Comparison of MinHash and Full Jaccard distances
Create a neighbour-joining tree for genome sequence pairs based on MinHash Jaccard Distances
Create a neighbour-joining tree for genome sequence pairs based on Full Jaccard Distances
Sketch size variation - The sketch size variable (STEP 4, Line 33) can be altered and the code can be run multiple times to check differences in the Jaccard Distances

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
14412_3#82.contigs_velvet.fa		14412_3#82.contigs_velvet.fa
14412_3#84.contigs_velvet.fa		14412_3#84.contigs_velvet.fa
R6.fa		R6.fa
README.md		README.md
Rishabh_Kulkarni.ipynb		Rishabh_Kulkarni.ipynb
TIGR4.fa		TIGR4.fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EMBL_Hinxton_Coding_Challenge

Sections:

STEP 1: READ FASTA FILES AND CREATE K-MER FREQUENCY DICTIONARIES (K = 14)

STEP 2: FULL JACCARD DISTANCE CALCULATION BETWEEN INPUT DICTIONARY PAIRS

STEP 3: SAMPLE HASH FUNCTION CODE IMPLEMENTATION (MURMURHASH3)

STEP 4: CREATING INPUT GENOME SEQUENCE SKETCHES BASED ON MURMURHASH3 IMPLEMENTATION (MINHASH)

STEP 5: CALCULATE MINHASH JACCARD DISTANCES FROM SKETCHES

BONUS

About

Uh oh!

Releases

Packages

Languages

rdk004/EMBL_Hinxton_Coding_Challenge

Folders and files

Latest commit

History

Repository files navigation

EMBL_Hinxton_Coding_Challenge

Sections:

STEP 1: READ FASTA FILES AND CREATE K-MER FREQUENCY DICTIONARIES (K = 14)

STEP 2: FULL JACCARD DISTANCE CALCULATION BETWEEN INPUT DICTIONARY PAIRS

STEP 3: SAMPLE HASH FUNCTION CODE IMPLEMENTATION (MURMURHASH3)

STEP 4: CREATING INPUT GENOME SEQUENCE SKETCHES BASED ON MURMURHASH3 IMPLEMENTATION (MINHASH)

STEP 5: CALCULATE MINHASH JACCARD DISTANCES FROM SKETCHES

BONUS

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages