Introduction

License

Annie is released under the MIT License.

Purpose

Annie reads genomic/transcriptomic annotation information from various sources -- IPRScan, SwissProt BLAST results, and soon Trinotate/Blast2GO -- and creates a 3-column table. Why would you want to create a 3 column table of annotations? To add functional annotations to a genome using the Genome Annotation Generator (GAG) or a transcriptome using Transvestigator of course!

Installation

Annie is written in Python 3 . If you have Python 3 (not 2!) installed on your computer, you can download the zipped source code link at the top of the page and extract it. That's it!

Citing Annie

If you'd like to flatter us, please cite us as follows:

Tate, R., Hall, B., DeRego, T., & Geib, S. (2014). Annie: the ANNotation Information Extractor (Version 1.0) [Software]. Available from http://genomeannotation.github.io/annie.

Usage

Currently, Annie works with two cases: Results from InterProScan and results from a BLAST search against the UniProt SwissProt database.

InterProScan (ipr)

The ipr input file is the tab separated (.tsv) output from InterProScan, make sure you have this format (not xml). A sample file is provided with the repository. It's called sample_data/sample.ipr.

To convert this file to a three-column annotation table, issue the following command:

python3 annie.py -ipr sample_data/sample.ipr

The output file is a table of annotations. Each row in the ipr file contains the mrna id and the dbxref, as well as the GO and IPR values if they exist. For each row, we create up to 3 annotations where the feature id is the mrna id, and the keys are "dbxref", "IPR" and "GO" with their respective values. Here is a sample input/output pair:

Input

m.98281    c95b0824ccd627403aa63f9e474649cc    7571    Pfam    PF00041 Fibronectin type III domain 3729    3812    4.7E-14 T   04-04-2014  IPR003961   Fibronectin, type III   GO:0005515
m.98281    c95b0824ccd627403aa63f9e474649cc    7571    Pfam    PF00041 Fibronectin type III domain 6484    6567    1.8E-12 T   04-04-2014  IPR003961   Fibronectin, type III   GO:0005515
m.42655    de17ff06d901d22dacc3f5c91510f33f    288 Pfam    PF12171 Zinc-finger double-stranded RNA-binding 45  69  1.3E-5  T   05-04-2014  IPR022755   Zinc finger, double-stranded RNA binding
m.82734    ae22a6fcc80b2ac982378f16ac022b3d    144 Pfam    PF14846 Domain of unknown function (DUF4485)    8   89  4.8E-17 T   05-04-2014  IPR027831   Domain of unknown function DUF4485

Output

m.42655    Dbxref  PFAM:PF12171
m.42655    InterPro    IPR022755
m.82734    Dbxref  PFAM:PF14846
m.82734    InterPro    IPR027831
m.98281    Dbxref  PFAM:PF00041
m.98281    GO  GO:0005515
m.98281    InterPro    IPR003961

A few notes on the sample input/output:

The GO annotation doesn't appear for every mrna because not every mrna has one.
Notice that the mrna, dbxref, GO, and IPR values for the first two entries in the input are the same so they result in the same annotations. Knowing that nobody has time for duplicate annotations, Annie automatically removes them for you.
The output has the annotations ordered. In other words, it first sorts by the feature id (the mrna id), then by the key (e.g. "Dbxref", "GO", "InterPro"), then by the value.

Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:

values in each row are tab-separated
the first column is the mrna id
the 4th and 5th columns correspond to your dbxref
the 12th column is the IPR column if the 12th column exists
the 14th column is the GO column if the 14th column exists

SwissProt (sprot)

SwissProt annotations require a few input files:

The output file from BLAST in tabular form (blastall -m 8 or blast+ -outfmt 6)
A GFF3 file corresponding to your assembly
The fasta file representing the database provided to BLAST (get yours here)

Examples of each of these files are provided as sample_data/sample.blastout, sample_data/sample.gff, and sample_data/sample.fasta. To create a three-column annotation table from the sample data, use this command:

python3 annie.py\
	-b sample_data/sample.blastout\
        -g sample_data/sample.gff\
        -db sample_data/sample.fasta

The sprot case returns two types of annotations. The first is a product annotation in the form <mrna_id> product <product>. We use the blast file to get the dbxref for the associated mrna and then we use the fasta file to take that dbxref and get the corresponding product. Here is a diagram of what goes on:

mrna_id ---blast_file---> dbxref ---fasta_file---> product

Next, we have the name annotation. The name annotation has the form <parent_gene_id> name <parent_gene_name>. First, for each mrna in the blast results, we look it up in the gff file to get the corresponding parent gene id. That is how we obtain the <parent_gene_id> portion of the annotation. The process to obtain the <parent_gene_name> is similar to getting the product for the product annotation. First, we use the blast file to get the associated dbxref from the mrna and then we use the fasta file to take that dbxref ref and give us the gene name. Here is a diagram of how the name annotation is made:

mrna_id ---gff_file---> parent_gene_id

mrna_id ---blast_file---> dbxref ---fasta_file---> parent_gene_name

Now that we understand what this case is doing, let's take a look at some sample input/output:

Input

Blast output

m.4830 sp|Q5AZY1|MRH4_EMENI    32.65   49  33  0   114 162 500 548 0.56    34.3
m.4831 sp|Q9TTC1|POL_KORV  32.26   155 102 3   23  174 807 961 1e-16   81.6
m.4837 sp|P05892|GAG_SIVVT 45.24   42  21  2   35  75  394 434 0.012   38.1
m.4838 sp|Q9UGP4|LIMD1_HUMAN   29.58   71  42  3   32  96  369 437 0.88    31.6

GFF3 file

comp9975_c0_seq1   .   gene    25  603 .   +   .   ID=g.4830
comp9975_c0_seq1   .   mRNA    25  603 .   +   .   ID=m.4830;Parent=g.4830
comp9975_c0_seq1   .   CDS 25  603 .   +   .   ID=cds.m.4830;Parent=m.4830
comp9975_c1_seq1   .   gene    3   533 .   +   .   ID=g.4831
comp9975_c1_seq1   .   mRNA    3   533 .   +   .   ID=m.4831;Parent=g.4831
comp9975_c1_seq1   .   CDS 3   533 .   +   .   ID=cds.m.4831;Parent=m.4831
comp9982_c0_seq1   .   gene    1   441 .   -   .   ID=g.4837
comp9982_c0_seq1   .   mRNA    1   441 .   -   .   ID=m.4837;Parent=g.4837
comp9982_c0_seq1   .   CDS 1   441 .   -   .   ID=cds.m.4837;Parent=m.4837
comp9983_c0_seq1   .   gene    1746    2060    .   +   .   ID=g.4838
comp9983_c0_seq1   .   mRNA    1746    2060    .   +   .   ID=m.4838;Parent=g.4838
comp9983_c0_seq1   .   CDS 1746    2060    .   +   .   ID=cds.m.4838;Parent=m.4838

SwissProt database in fasta format

(we removed the sequences for readability of this example)

>sp|Q5AZY1|MRH4_EMENI ATP-dependent RNA helicase mrh4, mitochondrial OS=Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139) GN=mrh4 PE=3 SV=1
...
>sp|Q9TTC1|POL_KORV Pro-Pol polyprotein OS=Koala retrovirus GN=pro-pol PE=3 SV=1
...
>sp|P05892|GAG_SIVVT Gag polyprotein OS=Simian immunodeficiency virus agm.vervet (isolate AGM TYO-1) GN=gag PE=3 SV=1
...
>sp|Q9UGP4|LIMD1_HUMAN LIM domain-containing protein 1 OS=Homo sapiens PE=1 SV=1

Output

g.4838 name    LIMD1
m.4838 product LIM domain-containing protein 1
g.4830 name    mrh4
m.4830 product ATP-dependent RNA helicase mrh4, mitochondrial
g.4831 name    pro-pol
m.4831 product Pro-Pol polyprotein
g.4837 name    gag
m.4837 product Gag polyprotein

A few notes on the sample input/output:

Not every gene id has an associated gene name. In that case, we use part of the dbxref to name it (the part after the second '|' and from that part, the part before the first '_').
Although not seen here, if two name annotations have the same value (i.e. the same parent gene name), we index them by adding a "_0" in front of the first one, a "_1" in front of the second one, etc.
Although not seen here, if Annie fails to cross-reference something, it'll skip doing that annotation and gently notify you in the command prompt/terminal.

Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:

in the blast file, every row has tab-separated values with the mrna id in the first row and the dbxref in the second row
in the gff file, every row has tab separated values
in the gff file, the third column has the feature type ("mRNA", etc). We also assume the third column, if it is for mrna, has exactly the value "mRNA" with no surrounding whitespace or changes in captilization.
in the gff file, the 9th column should have the mrna id and parent gene id in exactly the following form: ID=<mrna_id>;Parent=<parent_gene_id> so an example would ne ID=m.123;Parent=g.123
in the fasta file, if the line is a header line, the first character is '>'
in the fasta file, the dbxref is immediately the first thing to appear after the '>' with no whitespace inbetween the '>' and the dbxref
in the fasta file, the dbxref has no whitespace contained inside
in the fasta file, the product is everything between the dbxref and the string "OS=" which we assume exists
in the fasta file, if the gene name exists, it contains no whitespace within
in the fasta file, if the gene name exists, it is right before the string "PE=" which we assume exists
in the fasta file, we assume the gene name is in the following form: GN=<gene_name>. An example would be GN=gag

Other Options

Annie also provides options for a blacklist of undesirable "product" annotations, a whitelist of chosen Dbxref sources, and a custom output filename. These options are invoked with the --whitelist, --blacklist, and -o flags, respectively.