Skip to content

MinHash implementation for Jaccard Distance calculation between genome sequence pairs of Streptococcus pneumoniae

Notifications You must be signed in to change notification settings

rdk004/EMBL_Hinxton_Coding_Challenge

Repository files navigation

EMBL_Hinxton_Coding_Challenge

MinHash implementation for Jaccard Distance calculation between genome sequence pairs of Streptococcus pneumonia

Sections:

STEP 1: READ FASTA FILES AND CREATE K-MER FREQUENCY DICTIONARIES (K = 14)

Murmurhash3 package: (https://github.com/hajimes/mmh3)

  • It has a set of fast and robust hash functions
  • Non-cryptographic hash function suitable for general hash-based lookup
  • Produces a 32-bit hash value
  • Widely used in bioinformatics and other fields for hashing string sequences due to its efficiency, low collision rates, and good distribution properties
  • Highly optimized for speed and is faster than many other functions (crucial when dealing with large datasets common in bioinformatics, where millions of sequences (like k-mers in genomics) need to be hashed quickly)
  • The low collision rate makes it suitable for indexing and partitioning DNA sequence data in bioinformatics, where multiple datasets and sequences need to be handled quickly and without collisions

Sequences: R6, TIGR4, 14412_3#82.contigs_velvet, 14412_3#84.contigs_velvet Note: The contigs have been stitched together to create two new full sequences. This makes all the sequence lengths comparable and suitable for Jaccard distance computation.

STEP 2: FULL JACCARD DISTANCE CALCULATION BETWEEN INPUT DICTIONARY PAIRS

  • J(A, B) = 1 - (|A ∩ B| / |A ∪ B|); Jaccard Distance, where A and B are the two sets to compare

STEP 3: SAMPLE HASH FUNCTION CODE IMPLEMENTATION (MURMURHASH3)

  • Validate the hash function with test k-mers (to check if a kmer is getting mapped to the same integer in every run of the code)

STEP 4: CREATING INPUT GENOME SEQUENCE SKETCHES BASED ON MURMURHASH3 IMPLEMENTATION (MINHASH)

STEP 5: CALCULATE MINHASH JACCARD DISTANCES FROM SKETCHES

BONUS

  • Comparison of MinHash and Full Jaccard distances
  • Create a neighbour-joining tree for genome sequence pairs based on MinHash Jaccard Distances
  • Create a neighbour-joining tree for genome sequence pairs based on Full Jaccard Distances
  • Sketch size variation - The sketch size variable (STEP 4, Line 33) can be altered and the code can be run multiple times to check differences in the Jaccard Distances

About

MinHash implementation for Jaccard Distance calculation between genome sequence pairs of Streptococcus pneumoniae

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published