MinHash implementation for Jaccard Distance calculation between genome sequence pairs of Streptococcus pneumonia
Murmurhash3 package: (https://github.com/hajimes/mmh3)
- It has a set of fast and robust hash functions
- Non-cryptographic hash function suitable for general hash-based lookup
- Produces a 32-bit hash value
- Widely used in bioinformatics and other fields for hashing string sequences due to its efficiency, low collision rates, and good distribution properties
- Highly optimized for speed and is faster than many other functions (crucial when dealing with large datasets common in bioinformatics, where millions of sequences (like k-mers in genomics) need to be hashed quickly)
- The low collision rate makes it suitable for indexing and partitioning DNA sequence data in bioinformatics, where multiple datasets and sequences need to be handled quickly and without collisions
Sequences: R6, TIGR4, 14412_3#82.contigs_velvet, 14412_3#84.contigs_velvet Note: The contigs have been stitched together to create two new full sequences. This makes all the sequence lengths comparable and suitable for Jaccard distance computation.
- J(A, B) = 1 - (|A ∩ B| / |A ∪ B|); Jaccard Distance, where A and B are the two sets to compare
- Validate the hash function with test k-mers (to check if a kmer is getting mapped to the same integer in every run of the code)
- Comparison of MinHash and Full Jaccard distances
- Create a neighbour-joining tree for genome sequence pairs based on MinHash Jaccard Distances
- Create a neighbour-joining tree for genome sequence pairs based on Full Jaccard Distances
- Sketch size variation - The sketch size variable (STEP 4, Line 33) can be altered and the code can be run multiple times to check differences in the Jaccard Distances