Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recipe Advice for Annotating Unknown Genes from RNAseq Analysis Using Orthologs from Related Species #538

Open
MauriAndresMU1313 opened this issue Oct 2, 2024 · 0 comments

Comments

@MauriAndresMU1313
Copy link

Hi everyone, this is not an issue, but I’m looking for some advice on following the recipes you provide in the "A few recipes" section of v2.1.5 to v2.1.12.

A little context: I ran an RNAseq analysis, and my output is the count-genes.tsv file. Using reference genomes from RefSeq, the annotation of most of the genes in these files is generally fine; most genes were mapped to their corresponding gene name. However, I have some unknown genes with no associated gene symbol, like LOCXXXXXXX (where X is any number).

I plan to find the corresponding orthologs for those genes using related species to increase the number of annotated genes. With this in mind, I ran Orthofinder with related species (mammalian species). In short, the output is orthogroup fasta files that contain orthologous proteins in each file. These files have protein IDs in the format NP_XXXXXXXX or XP_XXXXXXXX. So now, the plan is to use Eggnog-mapper to identify the functional annotations related to these proteins in each orthogroup.

Here’s where I’m a little confused about the next step: I will get the annotations, but I’m wondering how I can track the functional annotation to their respective genes and determine if it is an LOCXXXXXXX-type gene. For example, in the "A few recipes" section, you have options like:

  • Run search and annotation, using MMseqs after translating input CDS to proteins. Add the search and annotation results to the attributes of an existing GFF file (GFF decoration), using the GeneID field to link features from the GFF to the annotation results. (This seems the most appropriate to me because I can download GFF files from RefSeq-genomes.)
  • Run gene prediction using a genome to train Prodigal
  • Repeat the annotation step, using specific taxa as target and reporting the one-to-one orthologs found (This seems like another option, but I’m concerned that this depends on the number of species in the phylogeny since I don’t have too many.)

Do you think these ideas are realistic? Even if I get the functional annotation of the orthologs, I may need to trace them back to their respective positions on the chromosome and check if the gene symbol is unknown. Then, maybe I can use a parameter to confidently replace the gene symbol with its respective ortholog.
In general, I’m looking for guidance on using eggnog-mapper for the potential workflow I have in mind. I’m posting here because some papers have used eggnog-mapper to map to their respective orthologs.

Any comment, suggestion or idea is more that welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant