-
Notifications
You must be signed in to change notification settings - Fork 4
readgff
fails with protein sequences
#20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
readgff
fails whith protein sequencesreadgff
fails with protein sequences
I have never encountered a GFF file with protein sequences before and had always assumed it stood for "Genome Feature Format", but looks like the proper name is "General Feature Format" and the standard allows for any valid FASTA following the GFF table (link) One possible solution would be to read some records from the file, infer the sequence type(s), and then process the remaining records using that inferred sequence type (DNA, RNA, AA), similar to how delimited file parsers might evaluate the first few lines of the file before finalizing the column types to parse all of the values into. We could also skip the type-inference by reading the beginning of the file with the GFF3 parser and then once the FASTA section is reached, calling out to FASTX.jl and then parsing the entire FASTA into untyped records. My understanding is that the FASTX records are StringViews, so no need to decide on the sequence type. If the GFF3 parser was split into a GFF3 parsing + a FASTX.jl callout, and if we can add type-inference to the BioSequences package (BioJulia/BioSequences.jl#269 or BioJulia/BioSequences.jl#241), would that be a suitable way to address this that everyone would be happy with? Maybe worth discussing as a side issue, but while reviewing the packages for this I saw that this package and GFF3.jl implement distinct GFF3 readers and writers. It could be nice to also consolidate by having this package call out to GFF3.jl or deprecate that package and integrate it into this package? Not sure if this has already been discussed elsewhere - apologies if it has |
This turned out to require more extensive changes than I initially expected, so for now I'm keeping it to an experimental branch (seqtype). It'll probably be merged in some shape or form eventually, as the change also allows the user to specify whether to use 2 or 4 bits per symbol for nucleotide sequences, which seems like a good option to have. Currently the following syntax works: records = readgff(filepath, AminoAcidAlphabet)
# alias for:
records = collect(open(GFF.Reader{AminoAcidAlphabet}, filepath))
# To read GenBank annotations with 2-bit nucleotide sequences:
records = readgbk(filepath, DNAAlphabet{2}) Some features may be broken, but feel free to try it out. |
readgff
tries to read sequences as DNA sequences, therefore it fails when reading files containing protein sequences.Input
GFF file containing a protein sequence downloaded from the ELM database: http://elm.eu.org/downloads.html
Link to the file: http://elm.eu.org/instances.gff?q=SRC_HUMAN
Output
Version
The text was updated successfully, but these errors were encountered: