Skip to content

readgff fails with protein sequences #20

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
diegozea opened this issue Feb 28, 2025 · 2 comments
Open

readgff fails with protein sequences #20

diegozea opened this issue Feb 28, 2025 · 2 comments

Comments

@diegozea
Copy link
Member

readgff tries to read sequences as DNA sequences, therefore it fails when reading files containing protein sequences.

Input

GFF file containing a protein sequence downloaded from the ELM database: http://elm.eu.org/downloads.html
Link to the file: http://elm.eu.org/instances.gff?q=SRC_HUMAN

##gff-version 3
P12931	ELM	sequence_feature	530	534	.	.	.	ID=LIG_SH2_SFK_CTail_3
P12931	ELM	sequence_feature	252	259	.	.	.	ID=LIG_SH3_4
P12931	ELM	sequence_feature	72	78	.	.	.	ID=MOD_CDK_SPxK_1
P12931	ELM	sequence_feature	1	7	.	.	.	ID=MOD_NMyristoyl
P12931	ELM	sequence_feature	526	534	.	.	.	ID=MOD_TYR_CSK
##FASTA
>P12931
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

Output

julia> p = readgff("elm_instances.gff")
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
ERROR: Cannot encode byte 0x50 (char 'P') at index 8 to BioSequences.DNAAlphabet{4}()
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::Base.CodeUnits{UInt8, String}, soff::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:164
  [3] encode_chunk
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:178 [inlined]
  [4] encode_chunks!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, startindex::Int64, src::Base.CodeUnits{UInt8, String}, soff::Int64, N::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:189
  [5] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:97 [inlined]
  [6] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:85 [inlined]
  [7] parsechromosome!(input::TranscodingStreams.NoopStream{IOStream}, record::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:133
  [8] tryread!
    @ C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:49 [inlined]
  [9] iterate(reader::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, nextone::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:42
 [10] _collect(cont::UnitRange{Int64}, itr::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, ::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base .\array.jl:727
 [11] collect
    @ .\array.jl:716 [inlined]
 [12] readgff(input::String)
    @ GenomicAnnotations C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\utils.jl:87
 [13] top-level scope
    @ REPL[5]:1

Version

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50 (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × 13th Gen Intel(R) Core(TM) i7-1365U
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

(Downloads) pkg> st
Status `C:\Users\dz272503\Downloads\Project.toml`
  [4f8a0a0a] GenomicAnnotations v0.4.5
@diegozea diegozea changed the title readgff fails whith protein sequences readgff fails with protein sequences Feb 28, 2025
@cjprybol
Copy link

cjprybol commented Mar 3, 2025

I have never encountered a GFF file with protein sequences before and had always assumed it stood for "Genome Feature Format", but looks like the proper name is "General Feature Format" and the standard allows for any valid FASTA following the GFF table (link)

One possible solution would be to read some records from the file, infer the sequence type(s), and then process the remaining records using that inferred sequence type (DNA, RNA, AA), similar to how delimited file parsers might evaluate the first few lines of the file before finalizing the column types to parse all of the values into.

We could also skip the type-inference by reading the beginning of the file with the GFF3 parser and then once the FASTA section is reached, calling out to FASTX.jl and then parsing the entire FASTA into untyped records. My understanding is that the FASTX records are StringViews, so no need to decide on the sequence type.

If the GFF3 parser was split into a GFF3 parsing + a FASTX.jl callout, and if we can add type-inference to the BioSequences package (BioJulia/BioSequences.jl#269 or BioJulia/BioSequences.jl#241), would that be a suitable way to address this that everyone would be happy with?

Maybe worth discussing as a side issue, but while reviewing the packages for this I saw that this package and GFF3.jl implement distinct GFF3 readers and writers. It could be nice to also consolidate by having this package call out to GFF3.jl or deprecate that package and integrate it into this package? Not sure if this has already been discussed elsewhere - apologies if it has

@kdyrhage
Copy link
Member

kdyrhage commented Mar 3, 2025

This turned out to require more extensive changes than I initially expected, so for now I'm keeping it to an experimental branch (seqtype). It'll probably be merged in some shape or form eventually, as the change also allows the user to specify whether to use 2 or 4 bits per symbol for nucleotide sequences, which seems like a good option to have.

Currently the following syntax works:

records = readgff(filepath, AminoAcidAlphabet)
# alias for:
records = collect(open(GFF.Reader{AminoAcidAlphabet}, filepath))

# To read GenBank annotations with 2-bit nucleotide sequences:
records = readgbk(filepath, DNAAlphabet{2})

Some features may be broken, but feel free to try it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants