`readgff` fails with protein sequences #20

diegozea · 2025-02-28T10:15:52Z

readgff tries to read sequences as DNA sequences, therefore it fails when reading files containing protein sequences.

Input

GFF file containing a protein sequence downloaded from the ELM database: http://elm.eu.org/downloads.html
Link to the file: http://elm.eu.org/instances.gff?q=SRC_HUMAN

##gff-version 3
P12931	ELM	sequence_feature	530	534	.	.	.	ID=LIG_SH2_SFK_CTail_3
P12931	ELM	sequence_feature	252	259	.	.	.	ID=LIG_SH3_4
P12931	ELM	sequence_feature	72	78	.	.	.	ID=MOD_CDK_SPxK_1
P12931	ELM	sequence_feature	1	7	.	.	.	ID=MOD_NMyristoyl
P12931	ELM	sequence_feature	526	534	.	.	.	ID=MOD_TYR_CSK
##FASTA
>P12931
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL

Output

julia> p = readgff("elm_instances.gff")
MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAEPKLFGGFNSSDTVTSPQRAGPLAGGVTTFVALYDYESRTETDLSFKKGERLQIVNNTEGDWWLAHSLSTGQTGYIPSNYVAPSDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESETTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITSRTQFNSLQQLVAYYSKHADGLCHRLTTVCPTSKPQTQGLAKDAWEIPRESLRLEVKLGQGCFGEVWMGTWNGTTRVAIKTLKPGTMSPEAFLQEAQVMKKLRHEKLVQLYAVVSEEPIYIVTEYMSKGSLLDFLKGETGKYLRLPQLVDMAAQIASGMAYVERMNYVHRDLRAANILVGENLVCKVADFGLARLIEDNEYTARQGAKFPIKWTAPEAALYGRFTIKSDVWSFGILLTELTTKGRVPYPGMVNREVLDQVERGYRMPCPPECPESLHDLMCQCWRKEPEERPTFEYLQAFLEDYFTSTEPQYQPGENL
ERROR: Cannot encode byte 0x50 (char 'P') at index 8 to BioSequences.DNAAlphabet{4}()
Stacktrace:
  [1] error(s::String)
    @ Base .\error.jl:35
  [2] throw_encode_error(A::BioSequences.DNAAlphabet{4}, src::Base.CodeUnits{UInt8, String}, soff::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:164
  [3] encode_chunk
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:178 [inlined]
  [4] encode_chunks!(dst::BioSequences.LongSequence{BioSequences.DNAAlphabet{4}}, startindex::Int64, src::Base.CodeUnits{UInt8, String}, soff::Int64, N::Int64)
    @ BioSequences C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\copying.jl:189
  [5] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:97 [inlined]
  [6] LongSequence
    @ C:\Users\dz272503\.julia\packages\BioSequences\Mf23T\src\longsequences\constructors.jl:85 [inlined]
  [7] parsechromosome!(input::TranscodingStreams.NoopStream{IOStream}, record::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:133
  [8] tryread!
    @ C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:49 [inlined]
  [9] iterate(reader::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, nextone::GenomicAnnotations.Record{Gene})
    @ GenomicAnnotations.GFF C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\GFF\reader.jl:42
 [10] _collect(cont::UnitRange{Int64}, itr::GenomicAnnotations.GFF.Reader{TranscodingStreams.NoopStream{IOStream}}, ::Base.HasEltype, isz::Base.SizeUnknown)
    @ Base .\array.jl:727
 [11] collect
    @ .\array.jl:716 [inlined]
 [12] readgff(input::String)
    @ GenomicAnnotations C:\Users\dz272503\.julia\packages\GenomicAnnotations\37yeV\src\utils.jl:87
 [13] top-level scope
    @ REPL[5]:1

Version

julia> versioninfo()
Julia Version 1.11.3
Commit d63adeda50 (2025-01-21 19:42 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × 13th Gen Intel(R) Core(TM) i7-1365U
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, goldmont)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)

(Downloads) pkg> st
Status `C:\Users\dz272503\Downloads\Project.toml`
  [4f8a0a0a] GenomicAnnotations v0.4.5

The text was updated successfully, but these errors were encountered:

cjprybol · 2025-03-03T15:47:47Z

I have never encountered a GFF file with protein sequences before and had always assumed it stood for "Genome Feature Format", but looks like the proper name is "General Feature Format" and the standard allows for any valid FASTA following the GFF table (link)

One possible solution would be to read some records from the file, infer the sequence type(s), and then process the remaining records using that inferred sequence type (DNA, RNA, AA), similar to how delimited file parsers might evaluate the first few lines of the file before finalizing the column types to parse all of the values into.

We could also skip the type-inference by reading the beginning of the file with the GFF3 parser and then once the FASTA section is reached, calling out to FASTX.jl and then parsing the entire FASTA into untyped records. My understanding is that the FASTX records are StringViews, so no need to decide on the sequence type.

If the GFF3 parser was split into a GFF3 parsing + a FASTX.jl callout, and if we can add type-inference to the BioSequences package (BioJulia/BioSequences.jl#269 or BioJulia/BioSequences.jl#241), would that be a suitable way to address this that everyone would be happy with?

Maybe worth discussing as a side issue, but while reviewing the packages for this I saw that this package and GFF3.jl implement distinct GFF3 readers and writers. It could be nice to also consolidate by having this package call out to GFF3.jl or deprecate that package and integrate it into this package? Not sure if this has already been discussed elsewhere - apologies if it has

kdyrhage · 2025-03-03T16:10:24Z

This turned out to require more extensive changes than I initially expected, so for now I'm keeping it to an experimental branch (seqtype). It'll probably be merged in some shape or form eventually, as the change also allows the user to specify whether to use 2 or 4 bits per symbol for nucleotide sequences, which seems like a good option to have.

Currently the following syntax works:

records = readgff(filepath, AminoAcidAlphabet)
# alias for:
records = collect(open(GFF.Reader{AminoAcidAlphabet}, filepath))

# To read GenBank annotations with 2-bit nucleotide sequences:
records = readgbk(filepath, DNAAlphabet{2})

Some features may be broken, but feel free to try it out.

diegozea changed the title ~~readgff fails whith protein sequences~~ readgff fails with protein sequences Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`readgff` fails with protein sequences #20

`readgff` fails with protein sequences #20

diegozea commented Feb 28, 2025

cjprybol commented Mar 3, 2025

Uh oh!

kdyrhage commented Mar 3, 2025

Uh oh!

readgff fails with protein sequences #20

readgff fails with protein sequences #20

Comments

diegozea commented Feb 28, 2025

Input

Output

Version

cjprybol commented Mar 3, 2025

Uh oh!

kdyrhage commented Mar 3, 2025

Uh oh!

`readgff` fails with protein sequences #20

`readgff` fails with protein sequences #20