Introduction
License
Annie is released under the MIT License.
Purpose
Annie reads genomic/transcriptomic annotation information from various sources -- IPRScan, SwissProt BLAST results, and soon Trinotate/Blast2GO -- and creates a 3-column table. Why would you want to create a 3 column table of annotations? To add functional annotations to a genome using the Genome Annotation Generator (GAG) or a transcriptome using Transvestigator of course!
Installation
Annie is written in Python 3 . If you have Python 3 (not 2!) installed on your computer, you can download the zipped source code link at the top of the page and extract it. That's it!
Citing Annie
If you'd like to flatter us, please cite us as follows:
Tate, R., Hall, B., DeRego, T., & Geib, S. (2014). Annie: the ANNotation Information Extractor (Version 1.0) [Software]. Available from http://genomeannotation.github.io/annie.
Usage
Currently, Annie works with two cases: Results from InterProScan and results from a BLAST search against the UniProt SwissProt database.
InterProScan (ipr)
The ipr input file is the tab separated (.tsv) output from InterProScan, make sure you have this format (not xml). A sample file is provided with the repository. It's called sample_data/sample.ipr
.
To convert this file to a three-column annotation table, issue the following command:
python3 annie.py -ipr sample_data/sample.ipr
The output file is a table of annotations. Each row in the ipr file contains the mrna id and the dbxref, as well as the GO and IPR values if they exist. For each row, we create up to 3 annotations where the feature id is the mrna id, and the keys are "dbxref", "IPR" and "GO" with their respective values. Here is a sample input/output pair:
Input
m.98281 c95b0824ccd627403aa63f9e474649cc 7571 Pfam PF00041 Fibronectin type III domain 3729 3812 4.7E-14 T 04-04-2014 IPR003961 Fibronectin, type III GO:0005515
m.98281 c95b0824ccd627403aa63f9e474649cc 7571 Pfam PF00041 Fibronectin type III domain 6484 6567 1.8E-12 T 04-04-2014 IPR003961 Fibronectin, type III GO:0005515
m.42655 de17ff06d901d22dacc3f5c91510f33f 288 Pfam PF12171 Zinc-finger double-stranded RNA-binding 45 69 1.3E-5 T 05-04-2014 IPR022755 Zinc finger, double-stranded RNA binding
m.82734 ae22a6fcc80b2ac982378f16ac022b3d 144 Pfam PF14846 Domain of unknown function (DUF4485) 8 89 4.8E-17 T 05-04-2014 IPR027831 Domain of unknown function DUF4485
Output
m.42655 Dbxref PFAM:PF12171
m.42655 InterPro IPR022755
m.82734 Dbxref PFAM:PF14846
m.82734 InterPro IPR027831
m.98281 Dbxref PFAM:PF00041
m.98281 GO GO:0005515
m.98281 InterPro IPR003961
A few notes on the sample input/output:
- The GO annotation doesn't appear for every mrna because not every mrna has one.
- Notice that the mrna, dbxref, GO, and IPR values for the first two entries in the input are the same so they result in the same annotations. Knowing that nobody has time for duplicate annotations, Annie automatically removes them for you.
- The output has the annotations ordered. In other words, it first sorts by the feature id (the mrna id), then by the key (e.g. "Dbxref", "GO", "InterPro"), then by the value.
Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:
- values in each row are tab-separated
- the first column is the mrna id
- the 4th and 5th columns correspond to your dbxref
- the 12th column is the IPR column if the 12th column exists
- the 14th column is the GO column if the 14th column exists
SwissProt (sprot)
SwissProt annotations require a few input files:
- The output file from BLAST in tabular form (
blastall -m 8
orblast+ -outfmt 6
) - A GFF3 file corresponding to your assembly
- The fasta file representing the database provided to BLAST (get yours here)
Examples of each of these files are provided as sample_data/sample.blastout
, sample_data/sample.gff
, and sample_data/sample.fasta
. To create a three-column annotation table from the sample data, use this command:
python3 annie.py\
-b sample_data/sample.blastout\
-g sample_data/sample.gff\
-db sample_data/sample.fasta
The sprot case returns two types of annotations. The first is a product annotation in the form <mrna_id> product <product>
. We use the blast file to get the dbxref for the associated mrna and then we use the fasta file to take that dbxref and get the corresponding product. Here is a diagram of what goes on:
mrna_id ---blast_file---> dbxref ---fasta_file---> product
Next, we have the name annotation. The name annotation has the form <parent_gene_id> name <parent_gene_name>
. First, for each mrna in the blast results, we look it up in the gff file to get the corresponding parent gene id. That is how we obtain the <parent_gene_id>
portion of the annotation. The process to obtain the <parent_gene_name>
is similar to getting the product for the product annotation. First, we use the blast file to get the associated dbxref from the mrna and then we use the fasta file to take that dbxref ref and give us the gene name. Here is a diagram of how the name annotation is made:
mrna_id ---gff_file---> parent_gene_id
mrna_id ---blast_file---> dbxref ---fasta_file---> parent_gene_name
Now that we understand what this case is doing, let's take a look at some sample input/output:
Input
Blast output
m.4830 sp|Q5AZY1|MRH4_EMENI 32.65 49 33 0 114 162 500 548 0.56 34.3
m.4831 sp|Q9TTC1|POL_KORV 32.26 155 102 3 23 174 807 961 1e-16 81.6
m.4837 sp|P05892|GAG_SIVVT 45.24 42 21 2 35 75 394 434 0.012 38.1
m.4838 sp|Q9UGP4|LIMD1_HUMAN 29.58 71 42 3 32 96 369 437 0.88 31.6
GFF3 file
comp9975_c0_seq1 . gene 25 603 . + . ID=g.4830
comp9975_c0_seq1 . mRNA 25 603 . + . ID=m.4830;Parent=g.4830
comp9975_c0_seq1 . CDS 25 603 . + . ID=cds.m.4830;Parent=m.4830
comp9975_c1_seq1 . gene 3 533 . + . ID=g.4831
comp9975_c1_seq1 . mRNA 3 533 . + . ID=m.4831;Parent=g.4831
comp9975_c1_seq1 . CDS 3 533 . + . ID=cds.m.4831;Parent=m.4831
comp9982_c0_seq1 . gene 1 441 . - . ID=g.4837
comp9982_c0_seq1 . mRNA 1 441 . - . ID=m.4837;Parent=g.4837
comp9982_c0_seq1 . CDS 1 441 . - . ID=cds.m.4837;Parent=m.4837
comp9983_c0_seq1 . gene 1746 2060 . + . ID=g.4838
comp9983_c0_seq1 . mRNA 1746 2060 . + . ID=m.4838;Parent=g.4838
comp9983_c0_seq1 . CDS 1746 2060 . + . ID=cds.m.4838;Parent=m.4838
SwissProt database in fasta format
(we removed the sequences for readability of this example)
>sp|Q5AZY1|MRH4_EMENI ATP-dependent RNA helicase mrh4, mitochondrial OS=Emericella nidulans (strain FGSC A4 / ATCC 38163 / CBS 112.46 / NRRL 194 / M139) GN=mrh4 PE=3 SV=1
...
>sp|Q9TTC1|POL_KORV Pro-Pol polyprotein OS=Koala retrovirus GN=pro-pol PE=3 SV=1
...
>sp|P05892|GAG_SIVVT Gag polyprotein OS=Simian immunodeficiency virus agm.vervet (isolate AGM TYO-1) GN=gag PE=3 SV=1
...
>sp|Q9UGP4|LIMD1_HUMAN LIM domain-containing protein 1 OS=Homo sapiens PE=1 SV=1
Output
g.4838 name LIMD1
m.4838 product LIM domain-containing protein 1
g.4830 name mrh4
m.4830 product ATP-dependent RNA helicase mrh4, mitochondrial
g.4831 name pro-pol
m.4831 product Pro-Pol polyprotein
g.4837 name gag
m.4837 product Gag polyprotein
A few notes on the sample input/output:
- Not every gene id has an associated gene name. In that case, we use part of the dbxref to name it (the part after the second '|' and from that part, the part before the first '_').
- Although not seen here, if two name annotations have the same value (i.e. the same parent gene name), we index them by adding a "_0" in front of the first one, a "_1" in front of the second one, etc.
- Although not seen here, if Annie fails to cross-reference something, it'll skip doing that annotation and gently notify you in the command prompt/terminal.
Annie makes a few assumptions about your input files so if something doesn't work, check to see if your file meets these assumptions:
- in the blast file, every row has tab-separated values with the mrna id in the first row and the dbxref in the second row
- in the gff file, every row has tab separated values
- in the gff file, the third column has the feature type ("mRNA", etc). We also assume the third column, if it is for mrna, has exactly the value "mRNA" with no surrounding whitespace or changes in captilization.
- in the gff file, the 9th column should have the mrna id and parent gene id in exactly the following form:
ID=<mrna_id>;Parent=<parent_gene_id>
so an example would neID=m.123;Parent=g.123
- in the fasta file, if the line is a header line, the first character is '>'
- in the fasta file, the dbxref is immediately the first thing to appear after the '>' with no whitespace inbetween the '>' and the dbxref
- in the fasta file, the dbxref has no whitespace contained inside
- in the fasta file, the product is everything between the dbxref and the string "OS=" which we assume exists
- in the fasta file, if the gene name exists, it contains no whitespace within
- in the fasta file, if the gene name exists, it is right before the string "PE=" which we assume exists
- in the fasta file, we assume the gene name is in the following form:
GN=<gene_name>
. An example would beGN=gag
Other Options
Annie also provides options for a blacklist of undesirable "product" annotations, a whitelist of chosen Dbxref sources, and a custom output filename. These options are invoked with the --whitelist
, --blacklist
, and -o
flags, respectively.