View on GitHub

Annie

ANNotation Information Extractor

Download this project as a .zip file Download this project as a tar.gz file

Introduction

License

Annie is released under the MIT License.

Purpose

Annie reads genomic/transcriptomic annotation information from various sources -- IPRScan, SwissProt BLAST results, and soon Trinotate/Blast2GO -- and creates a 3-column table. Why would you want to create a 3 column table of annotations? To add functional annotations to a genome using the Genome Annotation Generator (GAG) or a transcriptome using Transvestigator of course!

Installation

Annie is written in Python 3 . If you have Python 3 (not 2!) installed on your computer, you can download the zipped source code link at the top of the page and extract it. That's it!

Citing Annie

If you'd like to flatter us, please cite us as follows:

Tate, R., Hall, B., DeRego, T., & Geib, S. (2014). Annie: the ANNotation Information Extractor (Version 1.0) [Software]. Available from http://genomeannotation.github.io/annie.

Usage

Currently, Annie works with two cases: Results from InterProScan and results from a BLAST search against the UniProt SwissProt database.

InterProScan (ipr)

The ipr input file is the tab separated (.tsv) output from InterProScan, make sure you have this format (not xml). A sample file is provided with the repository. It's called sample_data/sample.ipr.

To convert this file to a three-column annotation table, issue the following command:

python3 annie.py -ipr sample_data/sample.ipr

The output file is a table of annotations. Each row in the ipr file contains the mrna id and the dbxref, as well as the GO and IPR values if they exist. For each row, we create up to 3 annotations where the feature id is the mrna id, and the keys are "dbxref", "IPR" and "GO" with their respective values. Here is a sample input/output pair:

Input
m.98281    c95b0824ccd627403aa63f9e474649cc    7571    Pfam    PF00041 Fibronectin type III domain 3729    3812    4.7E-14 T   04-04-2014  IPR003961   Fibronectin, type III   GO:0005515
m.98281    c95b0824ccd627403aa63f9e474649cc    7571    Pfam    PF00041 Fibronectin type III domain 6484    6567    1.8E-12 T   04-04-2014  IPR003961   Fibronectin, type III   GO:0005515
m.42655    de17ff06d901d22dacc3f5c91510f33f    288 Pfam    PF12171 Zinc-finger double-stranded RNA-binding 45  69  1.3E-5  T   05-04-2014  IPR022755   Zinc finger, double-stranded RNA binding
m.82734    ae22a6fcc80b2ac982378f16ac022b3d    144 Pfam    PF14846 Domain of unknown function (DUF4485)    8   89  4.8E-17 T   05-04-2014  IPR027831   Domain of unknown function DUF4485
Output
m.42655    Dbxref  PFAM:PF12171
m.42655    InterPro    IPR022755
m.82734    Dbxref  PFAM:PF14846
m.82734    InterPro    IPR027831
m.98281    Dbxref  PFAM:PF00041
m.98281    GO  GO:0005515
m.98281    InterPro    IPR003961

A few notes on the sample input/output:

Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:

SwissProt (sprot)

SwissProt annotations require a few input files:

Examples of each of these files are provided as sample_data/sample.blastout, sample_data/sample.gff, and sample_data/sample.fasta. To create a three-column annotation table from the sample data, use this command:

python3 annie.py\
	-b sample_data/sample.blastout\
        -g sample_data/sample.gff\
        -db sample_data/sample.fasta

The sprot case returns two types of annotations. The first is a product annotation in the form <mrna_id> product <product>. We use the blast file to get the dbxref for the associated mrna and then we use the fasta file to take that dbxref and get the corresponding product. Here is a diagram of what goes on:

mrna_id ---blast_file---> dbxref ---fasta_file---> product

Next, we have the name annotation. The name annotation has the form <parent_gene_id> name <parent_gene_name>. First, for each mrna in the blast results, we look it up in the gff file to get the corresponding parent gene id. That is how we obtain the <parent_gene_id> portion of the annotation. The process to obtain the <parent_gene_name> is similar to getting the product for the product annotation. First, we use the blast file to get the associated dbxref from the mrna and then we use the fasta file to take that dbxref ref and give us the gene name. Here is a diagram of how the name annotation is made:

mrna_id ---gff_file---> parent_gene_id

mrna_id ---blast_file---> dbxref ---fasta_file---> parent_gene_name

Now that we understand what this case is doing, let's take a look at some sample input/output:

Input
Blast output
m.4830 sp|Q5AZY1|MRH4_EMENI    32.65   49  33  0   114 162 500 548 0.56    34.3
m.4831 sp|Q9TTC1|POL_KORV  32.26   155 102 3   23  174 807 961 1e-16   81.6
m.4837 sp|P05892|GAG_SIVVT 45.24   42  21  2   35  75  394 434 0.012   38.1
m.4838 sp|Q9UGP4|LIMD1_HUMAN   29.58   71  42  3   32  96  369 437 0.88    31.6
GFF3 file

comp9975_c0_seq1   .   gene    25  603 .   +   .   ID=g.4830
comp9975_c0_seq1   .   mRNA    25  603 .   +   .   ID=m.4830;Parent=g.4830
comp9975_c0_seq1   .   CDS 25  603 .   +   .   ID=cds.m.4830;Parent=m.4830
comp9975_c1_seq1   .   gene    3   533 .   +   .   ID=g.4831
comp9975_c1_seq1   .   mRNA    3   533 .   +   .   ID=m.4831;Parent=g.4831
comp9975_c1_seq1   .   CDS 3   533 .   +   .   ID=cds.m.4831;Parent=m.4831
comp9982_c0_seq1   .   gene    1   441 .   -   .   ID=g.4837
comp9982_c0_seq1   .   mRNA    1   441 .   -   .   ID=m.4837;Parent=g.4837
comp9982_c0_seq1   .   CDS 1   441 .   -   .   ID=cds.m.4837;Parent=m.4837
comp9983_c0_seq1   .   gene    1746    2060    .   +   .   ID=g.4838
comp9983_c0_seq1   .   mRNA    1746    2060    .   +   .   ID=m.4838;Parent=g.4838
comp9983_c0_seq1   .   CDS 1746    2060    .   +   .   ID=cds.m.4838;Parent=m.4838

SwissProt database in fasta format

(we removed the sequences for readability of this example)

>sp|Q5AZY1|MRH4_EMENI ATP-dependent RNA helicase mrh4, mitochondrial OS=Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139) GN=mrh4 PE=3 SV=1
...
>sp|Q9TTC1|POL_KORV Pro-Pol polyprotein OS=Koala retrovirus GN=pro-pol PE=3 SV=1
...
>sp|P05892|GAG_SIVVT Gag polyprotein OS=Simian immunodeficiency virus agm.vervet (isolate AGM TYO-1) GN=gag PE=3 SV=1
...
>sp|Q9UGP4|LIMD1_HUMAN LIM domain-containing protein 1 OS=Homo sapiens PE=1 SV=1
Output
g.4838 name    LIMD1
m.4838 product LIM domain-containing protein 1
g.4830 name    mrh4
m.4830 product ATP-dependent RNA helicase mrh4, mitochondrial
g.4831 name    pro-pol
m.4831 product Pro-Pol polyprotein
g.4837 name    gag
m.4837 product Gag polyprotein

A few notes on the sample input/output:

Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:

Other Options

Annie also provides options for a blacklist of undesirable "product" annotations, a whitelist of chosen Dbxref sources, and a custom output filename. These options are invoked with the --whitelist, --blacklist, and -o flags, respectively.