SV coverage #153

evyamor · 2025-03-17T12:30:06Z

evyamor
Mar 17, 2025

Hi,

Using core variants from PyPGx, I successfully generated per-position and per-gene coverage for SNVs. However, I encountered an issue with SV genes. From what I understand, SVs are calculated based on .sav files, which are encrypted and not easy to interpret.

Is there a way to create a BED file for SV genes without covering the entire gene, focusing only on relevant positions? If so, what approach or data source would you recommend?

Thanks

Answered by sbslee

Mar 22, 2025

That's a great question!

First of all, yes — all the SV detection models are saved as .sav files. These models were trained using a support vector machine (SVM)-based multiclass classifier, employing the one-vs-rest strategy for each gene and each GRCh build. You can read more about the approach here: Structural Variation Detection. If you're interested in training your own model, check out this command: train-cnv-caller.

When training these models, I provided per-base copy number data across the entire target region. This is important because we can't know in advance which genomic coordinates will be informative for detecting different SVs. For whole genome sequencing (WGS), this isn’t a…

View full answer

sbslee · 2025-03-22T01:53:54Z

sbslee
Mar 22, 2025
Maintainer

That's a great question!

First of all, yes — all the SV detection models are saved as .sav files. These models were trained using a support vector machine (SVM)-based multiclass classifier, employing the one-vs-rest strategy for each gene and each GRCh build. You can read more about the approach here: Structural Variation Detection. If you're interested in training your own model, check out this command: train-cnv-caller.

When training these models, I provided per-base copy number data across the entire target region. This is important because we can't know in advance which genomic coordinates will be informative for detecting different SVs. For whole genome sequencing (WGS), this isn’t a problem — by definition, you're generating sequencing data for the entire genome.

However, this can be more challenging for targeted sequencing applications, such as whole exome sequencing (WES). That's why, in these cases, PyPGx performs imputation using forward filling to handle gaps (i.e., regions without sequencing data). You can read more about this here: predict-cnv.

Now, regarding your specific question about using per-base coverage for certain SNVs — yes, you can provide a BED file to the prepare-depth-of-coverage command, and proceed from there. PyPGx will use forward filling to impute missing positions, which technically should meet your needs.

That said, I do want to caution that this approach isn’t ideal if you’re starting with very few positions. Forward filling works best when there is still good coverage across the broader target region. If your coverage is too sparse, the imputed data may not be very informative or reliable.

That said, you’re absolutely welcome to try your approach — and if you run into any issues, I’d be more than happy to help troubleshoot.

P.S. What's the source of your data? I'm assuming next-generation sequencing — but is it targeted sequencing, WES, or WGS? I’m also curious: why the focus on coverage at specific SNVs rather than across the full set of positions covered by your sequencing platform?

1 reply

evyamor Mar 24, 2025
Author

I'm working with both WGS and WES.
My goal is to assess the coverage of core positions to understand whether PyPGx ngs-pipeilne algorithm assigned star-alleles through forward-filling or based on actual data.
for example, when a gene is labeled as Reference/Reference.
I hoped to achieve this by evaluating how well the core positions were covered in the first place. I was aiming to apply the same logic to structural variants (SVs), but I now understand that, in the case of SVs, it's more appropriate to assess coverage across the full gene from start position to end position for each SV gene. Unless, if applicable, I'd like to know if there is possibly a smaller region I could use for SV genes that would be the only targeted region for these SV genes in the pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

SV coverage #153

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

SV coverage #153

Uh oh!

evyamor Mar 17, 2025

Replies: 1 comment · 1 reply

Uh oh!

sbslee Mar 22, 2025 Maintainer

Uh oh!

Uh oh!

evyamor Mar 24, 2025 Author

evyamor
Mar 17, 2025

Replies: 1 comment 1 reply

sbslee
Mar 22, 2025
Maintainer

evyamor Mar 24, 2025
Author