Skip to content

Question about using local database #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yamn29302 opened this issue Jun 5, 2025 · 3 comments
Open

Question about using local database #6

yamn29302 opened this issue Jun 5, 2025 · 3 comments

Comments

@yamn29302
Copy link

Hello again!

I have some question about how to work with the -lbdb option.
I suppose this option takes a protein database as input, so I download the desire genomic protein sequences from ensembl. (I'm working on plant species, also wondering if this source is proper for this package?)

The sequences are concatenated into a single fasta using cat, and piped into makeblastdb. Everything works fine.

And then, I call uorf4u, but the following message pops up.

Image

So here are my issues.

  1. Shoulde the blast database made up with protein sequences?
  2. Can the blast database be made up with source other than genbank, like ensembl?
  3. The error pops up when "retrieve upstream sequences", so I wonder how does the software do this in local mode? Since the local database is a protein database, which means it doesn't contain upstream information.

BTW. I think it will be nice to add more about this option in the "Example-driven guide" section in the documentation.

Thanks

Y.Z. Zhou

@art-egorov
Copy link
Collaborator

art-egorov commented Jun 5, 2025

Hello!

Yes, I believe the issue is that upstream sequences can only be retrieved for proteins whose IDs are found in the NCBI database. This is because the process relies on identifying assemblies from the identical protein database before retrieving the upstream sequences. Yep, this should be stated more clearly in the documentation, I will update.

Best

@yamn29302
Copy link
Author

Hello!

Thanks for the explaination.

I wonder will this package be updated with a "full local" mode? In case sometimes we may want to process on unpublished sequences or some genomes don't have annotation on NCBI. Here is a possible way of doing it. Sorry I'm more a biologist than a programmer, can't help much on coding.

  1. Use a different databse instead of blast.
    I've used a package called cblaster. It searchs the presence of mutiple target genes and define it as a gene cluster, which also works with "above gene" tasks. It uses a different way to store genomic data, likely to be related to SQL and diamond. The input of the makedb command are gbk/gtff files, which should contain the information of NTs, AAs and the position information. I guess this way can handel both the mORF amino acid sequences and the uORF upstream sequences properly.
    https://cblaster.readthedocs.io/en/latest/guide/makedb_module.html

  2. After searching through database, retrieve N bases at the upstream of each hit. (N can be specified by user with a default value).

  3. Perform the uorf4u algorithm, and return the result.

Best

Y.Z. Zhou

@art-egorov
Copy link
Collaborator

Hello again!
Sorry for a delayed reply.
You can also use uorf4u if you don't expect to find your proteins in ncbi. In that case you can provide set of upstream sequences and they will be used to find if there are conserved uORFs.
Below screen from documentation page (https://gca-vh-lab.github.io/uorf4u/ExampleDrivenGuide/cmd_guide/)

Image

Best

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants