-
Hi, Using core variants from PyPGx, I successfully generated per-position and per-gene coverage for SNVs. However, I encountered an issue with SV genes. From what I understand, SVs are calculated based on .sav files, which are encrypted and not easy to interpret. Is there a way to create a BED file for SV genes without covering the entire gene, focusing only on relevant positions? If so, what approach or data source would you recommend? Thanks |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
That's a great question! First of all, yes — all the SV detection models are saved as When training these models, I provided per-base copy number data across the entire target region. This is important because we can't know in advance which genomic coordinates will be informative for detecting different SVs. For whole genome sequencing (WGS), this isn’t a problem — by definition, you're generating sequencing data for the entire genome. However, this can be more challenging for targeted sequencing applications, such as whole exome sequencing (WES). That's why, in these cases, PyPGx performs imputation using forward filling to handle gaps (i.e., regions without sequencing data). You can read more about this here: predict-cnv. Now, regarding your specific question about using per-base coverage for certain SNVs — yes, you can provide a BED file to the prepare-depth-of-coverage command, and proceed from there. PyPGx will use forward filling to impute missing positions, which technically should meet your needs. That said, I do want to caution that this approach isn’t ideal if you’re starting with very few positions. Forward filling works best when there is still good coverage across the broader target region. If your coverage is too sparse, the imputed data may not be very informative or reliable. That said, you’re absolutely welcome to try your approach — and if you run into any issues, I’d be more than happy to help troubleshoot. P.S. What's the source of your data? I'm assuming next-generation sequencing — but is it targeted sequencing, WES, or WGS? I’m also curious: why the focus on coverage at specific SNVs rather than across the full set of positions covered by your sequencing platform? |
Beta Was this translation helpful? Give feedback.
That's a great question!
First of all, yes — all the SV detection models are saved as
.sav
files. These models were trained using a support vector machine (SVM)-based multiclass classifier, employing the one-vs-rest strategy for each gene and each GRCh build. You can read more about the approach here: Structural Variation Detection. If you're interested in training your own model, check out this command: train-cnv-caller.When training these models, I provided per-base copy number data across the entire target region. This is important because we can't know in advance which genomic coordinates will be informative for detecting different SVs. For whole genome sequencing (WGS), this isn’t a…