View on GitHub

GAG - Genome Annotation Generator

UNSUPPORTED - Command line application to read, sanitize, annotate and modify genomic data. Can export an NCBI .tbl file of annotations on a genome.

Download this project as a .zip file Download this project as a tar.gz file

Introduction

License

GAG is released under the MIT License.

Purpose

GAG can read a genome and write it to the NCBI's .tbl format. This is the main purpose of the program. However, if you're writing a .tbl file, there's a good chance that your genome is destined to encounter tbl2asn further downstream. In anticipation of this meeting, GAG includes a (growing) number of options to remove questionable features or trim erroneous regions from your genome. It can also be used to add functional annotations. These options are outlined in the Advanced Usage section.

Installation

GAG is written in Python 2. Assuming you have Python 2 (not 3!) installed, you can download the zipped source code from the link at the top of this page, extract it and run it.

Basic Usage

Here's what a very simple GAG session might look like:

python gag.py --fasta organism.fasta --gff organism.gff --out gag_output

This command will create a subfolder called 'gag_output' containing, among other things, a .tbl file.

Genome

What is a genome? Great question. A genome can be many things to many people. To GAG, a genome is a .fasta file and a .gff file.

Fasta File

You must provide GAG with a .fasta file of nucleic acid sequence data. (More info here.) It is important to note that GAG is only expecting to find 'A', 'C', 'G', 'T' or 'N' (upper or lowercase) in your sequences. Mixed bases ('R', 'M', etc.) or invalid bases ('Q', 'ㅊ', etc.) are converted to 'N' for the purpose of protein translation, start/stop codon detection or whatever else we might dream up.

GAG will accept a scaffolded or unscaffolded assembly, though a scaffolded assembly is recommended.

GFF3 File

For best results, use a valid GFF3 file. The GFF3 specification can be found at the Sequence Ontology homepage, and there is a validator here. The genometools software can also be used to correct simple errors in gff3 files prior to utilizing GAG. (For unpredictable results, go ahead and try to load an invalid file. It's been known to work. Who knows?)

Your hopefully valid GFF3 file should contain features identified as 'gene', 'mRNA', 'exon', and 'CDS', along with possible companions 'start_codon' and 'stop_codon'. There are other valid features, but at the moment GAG will ignore them, writing them to a file called 'genome.ignored.gff' for posterity. (You can always reunite these features with your final result by typing cat genome.ignored.gff genome.gff > reunited.gff from a Linux prompt.) If you have a pet feature that you think belongs in your .tbl file, please visit the Issues Page and let us know.

Column 9 of a GFF3 file is the "attributes" column. A unique "ID" field is required for each feature; a "Parent" field is required for all non-gene features; everything else is gravy. For example, add a unique "Name" field to your gene and it will be included in your .tbl file. Attach "Dbxref" or "Ontology_term" fields to your mRNA feature and these annotations will be added when you write a .tbl file. See Functional Annotations for more information and examples.

Citing GAG

Please cite as follows:

Hall, B., DeRego, T., & Geib, S. (2014). GAG: the Genome Annotation Generator (Version 1.0) [Software]. Available from http://genomeannotation.github.io/GAG.

Advanced Usage

The following vignettes demonstrate some of GAG's capabilities for removing questionable features of your genome, fixing mistakes, and flagging features for further investigation.

Removing Terminal 'N's

Depending on the assembler you used, your genome may contain sequences that begin or end with unknown bases ("N" or "n"). The NCBI will not accept submissions containing such sequences. For example, take a look at the file walkthrough/terminal_ns/genome.fasta. It is crawling with unknown bases. This type of sequence will earn your genome submission a big red REJECTED stamp.

GAG can check each sequence in your .fasta for 'N's at the beginning or the end. If it finds any, it will remove them from the sequence and (if the bases were at the beginning of the sequence) update the indices of all features on the sequence accordingly. If by chance you have a gene that starts on one of these unknown bases, it will be removed. Such is life. (Note: GAG only removes 'N's at the beginning or end of a scaffold; it does not remove unknown bases within the assembly.)

To harness this unbelievable power, simply run GAG with the --fix_terminal_ns flag:

python gag.py\
    --fasta walkthrough/terminal_ns/genome.fasta\
    --gff walkthrough/terminal_ns/genome.gff\
    --fix_terminal_ns\
    --out terminal_ns_fixed

If you now open the file terminal_ns_removed/genome.fasta you can see that the 'N's at the beginning and end of the sequence are gone. (Compare this file with walkthrough/terminal_ns/genome.fasta.) Additionally, the indices of each feature in 'terminal_ns_removed/genome.gff' are updated to correspond to the new, cleaned up sequence.

Flagging and Removing Short Introns (and so much more)

Another NCBI guideline involves minimum length of introns. Ten seems to be the magic number. It's pretty common to find structural annotations coming from automated annotation pipelines with shorter introns, though. Luckily, GAG can flag (or remove) them for you. Is it possible there really are introns shorter than 10 bp in your genome? I guess so. Should you just go in and remove every mRNA with a short intron? Or every mRNA coding a protein shorter than a certain length? That is up to you. But tbl2asn and NCBI have specific minimal size limits for introns, peptide lengths, etc. So if you have something you want to keep that is outside of these limits, then take it up with them. We suggest a flagging and reviewing approach for those who are cautious.

Below is an example of flagging and removing short introns; for a full list of commands type gag.py -h

To begin, let's get some basic stats on our genome. Run GAG without any special options:

gag.py -f walkthrough/short_introns/genome.fasta -g walkthrough/short_introns/genome.gff

Now take a look at the stats file located at gag_output/genome.stats. You'll see a lot of info; here are the good bits:

Number of sequences:   1
                                 Genome     
                                 ------     
Total sequence length            105       
Number of genes                  2          
...
Shortest intron                  6          
...

So we have two genes, and at least one of them has a suspiciously short intron. Take a look at walkthrough/short_introns/genome.gff

foo_seq	maker	gene	6	35	0	-	0	ID=TEST_1
foo_seq	maker	mRNA	6	35	0	-	0	ID=TEST_1-RA;Parent=TEST_1
foo_seq	maker	exon	6	10	0.94	-	0	ID=TEST_1-RA:exon:0;Parent=TEST_1-RA
foo_seq	maker	exon	15	25	0.94	-	0	ID=TEST_1-RA:exon:1;Parent=TEST_1-RA
foo_seq	maker	CDS	6	10	0	-	0	ID=TEST_1-RA:cds:0;Parent=TEST_1-RA
foo_seq	maker	CDS	15	25	0	-	0	ID=TEST_1-RA:cds:1;Parent=TEST_1-RA

If you do the math, you'll see that the intron between those two exons is only 4bp long. We have found our culprit.

To address this, run GAG again, this time with the --flag_introns_shorter_than option:

gag.py -f walkthrough/short_introns/genome.fasta -g walkthrough/short_introns/genome.gff -o flagged_intron--flag_introns_shorter_than 10

Inspecting the results in flagged_intron/genome.gff:

foo_seq	maker	gene	6	35	.	-	.	ID=TEST_1
foo_seq	maker	mRNA	6	35	.	-	.	ID=TEST_1-RA;Parent=TEST_1
foo_seq	maker	exon	6	10	0.94	-	.	ID=TEST_1-RA:exon:0;Parent=TEST_1-RA;gag_flag=intron_min_length:10
foo_seq	maker	exon	15	25	0.94	-	.	ID=TEST_1-RA:exon:1;Parent=TEST_1-RA;gag_flag=intron_min_length:10
foo_seq	maker	CDS	6	10	.	-	0	ID=TEST_1-RA:cds:0;Parent=TEST_1-RA
foo_seq	maker	CDS	15	25	.	-	0	ID=TEST_1-RA:cds:1;Parent=TEST_1-RA

The only difference is in the very last column -- both of those exons have an added annotation. It says gag_flag=intron_min_length:10.

At this point, you can load the flagged genome into a genome browser and search for features containing the annotation 'gag_flag' in order to inspect them more closely. If you determine that they don't belong in your genome, return to GAG and replace 'flag' with 'remove' in the above command.

One final word about removing features like this: GAG's policy is to remove the parent mRNA when an offending intron, exon or CDS length is detected. If this leaves the parent gene without any mRNAs, the gene is removed as well. In other words, GAG removes empty genes and mRNAs by default. If you have empty features that you'd like to keep for some reason, stop by the Issues page and let us know.

Creating Start and Stop Codons

Certain configurations of certain annotation pipelines/software will produce lines in the GFF3 file indicating 'start_codon' and 'stop_codon' features. As mentioned above, GAG reads and stores these features. If they are missing from your annotations (which is common), you must create them. If they are present in your gff3 file and you don't run this command, GAG will assume that they are correct and translate based off of them. If you have explicit start and stop codon features in your GFF3 file and you have reason to believe they are incorrect, or if they're missing entirely, run the command below and GAG will recalculate (assuming standard eukaryotic start and stop codons).

As an example, take a look at the genome contained in the folder walkthrough/start_stop_codons. The .fasta looks like this:

>foo_seq
GATTAatgATTACAGATTACAtagTACAGATTACA
and the .gff looks like this:
foo_seq	maker	gene	6	35	0	+	0	ID=TEST_1
foo_seq	maker	mRNA	6	35	0	+	0	ID=TEST_1-RA;Parent=TEST_1
foo_seq	maker	exon	6	10	0.94	+	0	ID=TEST_1-RA:exon:0;Parent=TEST_1-RA
foo_seq	maker	exon	15	24	0.94	+	0	ID=TEST_1-RA:exon:1;Parent=TEST_1-RA
foo_seq	maker	CDS	6	10	0	+	0	ID=TEST_1-RA:cds:0;Parent=TEST_1-RA
foo_seq	maker	CDS	15	24	0	+	0	ID=TEST_1-RA:cds:1;Parent=TEST_1-RA

If you use GAG to write this genome to an output folder, the file 'genome.stats' will contain a lot of statistics; here are the highlights:


Number of sequences:   1
                                 Genome     
                                 ------     
Total sequence length            105        
Number of genes                  1          
...
Number of CDS                    1
CDS: complete                    0          
CDS: start, no stop              0          
CDS: stop, no start              0          
CDS: no stop, no start           1

So we have one CDS, and it has neither a stop nor a start. However, if you get out your reading glasses and examine the .gff file, you can see that the first three bases of that cds are at indices 6, 7, 8. If you count off the bases in the .fasta, you'll see that the corresponding nucleotides are 'atg'. Looks like a start codon to me. You can perform a similar exercise to find the last three bases, or just trust that we've rendered them in lowercase for convenience -- they are 'tag', a stop codon.

To repair this oversight, run GAG with the fix_start_stop flag:

python gag.py\
    --fasta walkthrough/start_stop_codons/genome.fasta\
    --gff walkthrough/start_stop_codons/genome.gff\
    --fix_start_stop\
    --out start_stop_fixed

Take a look at the start_stop_fixed/genome.stats and marvel at the improvement. Here are the relevant lines:

GAG> info
----------------------------------------
Number of sequences:   1
                                 Reference Genome     Modified Genome     
                                 ----------------     ---------------     
Total sequence length            105                  105                 
Number of genes                  1                    1                   
...
Number of CDS                    1                    1
CDS: complete                    0                    1
CDS: start, no stop              0                    0                   
CDS: stop, no start              0                    0                   
CDS: no stop, no start           1                    0                   

That CDS is now counted as "complete", meaning we found the start and stop. When you write it to .tbl format, this information will be carried over and will prevent those insidious "ERROR: PartialProblem" messages from tbl2asn.

Functional Annotations

GAG supports two complementary methods of adding functional annotations to your genome. The easiest way is to include them in column 9 of your .gff file. Adding a "Name=___" entry to a gene will result in this gene name being added to your .tbl file. You can then add entries to its child mRNAs using one of the following keywords: "Dbxref=___", "Ontology_term=___" or "product=___". (Note the lowercase 'p' in 'product'. This is important; attributes starting with capital letters are reserved in the GFF3 spec.)

Another option is to feed GAG a tab-separated file containing your annotations. You can either make this yourself if you have user-generated annotations, or use Annie, the Annotation Extractor. Annie focuses on using manually currated UniProt Swissprot proteins to provide a possible gene name and protein function, and InterProScan to provide cross-references to annotations from functional databases. If you decide to generate this table yourself, as with the first option, you can add names to genes or annotations to mRNAs. Each row in the table should have three rows. The first column should contain the ID for the gene or mRNA to be annotated. For genes, the format is as follows:

(gene_id)	"name"	(gene_name)

Note that the second column must contain the word "name", just to emphasize that you are giving the gene a name. Genes can't take functions, database cross-references, etc. Those belong to mRNAs, whose table entries should look like this:

(mrna_id)	("Dbxref", "Ontology_term" or "product")	(annotation)

The three options given in the second column are the only recommended terms. Technically, though, GAG will add any annotation you choose. Annotate at your own risk! If you are adding db_xref terms, make sure they are from an NCBI approved database. You can find a list here: NCBI db_xref qualifiers page

All of this is (hopefully) made clearer with an example. Take a look at the genome contained in the folder 'walkthrough/annotation'. Here is the .gff:

foo_seq	maker	gene	6	35	0	+	0	ID=TEST_1;Name=Abc123
foo_seq	maker	mRNA	6	35	0	+	0	ID=TEST_1-RA;Parent=TEST_1;Dbxref=PFAM:0001;product=amazing_protein
foo_seq	maker	exon	6	10	0.94	+	0	ID=TEST_1-RA:exon:0;Parent=TEST_1-RA
foo_seq	maker	exon	15	25	0.94	+	0	ID=TEST_1-RA:exon:1;Parent=TEST_1-RA
foo_seq	maker	CDS	6	10	0	+	0	ID=TEST_1-RA:cds:0;Parent=TEST_1-RA
foo_seq	maker	CDS	15	25	0	+	0	ID=TEST_1-RA:cds:1;Parent=TEST_1-RA
foo_seq	maker	start_codon	6	8	0.94	+	0	ID=TEST_1-RA:start:0;Parent=TEST_1-RA
foo_seq	maker	stop_codon	23	25	0.94	+	0	ID=TEST_1-RA:stop:1;Parent=TEST_1-RA
foo_seq	maker	gene	50	100	0	+	0	ID=TEST_2
foo_seq	maker	mRNA	50	100	0	+	0	ID=TEST_2-RA;Parent=TEST_2
foo_seq	maker	exon	54	70	0.94	+	0	ID=TEST_2-RA:exon:0;Parent=TEST_2-RA
foo_seq	maker	exon	81	97	0.94	+	0	ID=TEST_2-RA:exon:1;Parent=TEST_2-RA
foo_seq	maker	CDS	54	70	0	+	0	ID=TEST_2-RA:cds:0;Parent=TEST_2-RA
foo_seq	maker	CDS	81	97	0	+	0	ID=TEST_2-RA:cds:1;Parent=TEST_2-RA
foo_seq	maker	start_codon	54	56	0.94	+	0	ID=TEST_1-RB:start:0;Parent=TEST_2-RA
foo_seq	maker	stop_codon	95	97	0.94	+	0	ID=TEST_1-RB:stop:1;Parent=TEST_2-RA

As you can see, the first gene ("TEST_1") has a 'Name' and its child mRNA ("TEST_1-RA") has 'Dbxref' and 'product' annotations. Now take a look at the file called 'genome.annotations':

TEST_2   	name         	Xyz987
TEST_2-RA	Dbxref       	PFAM:1123
TEST_2-RA	Ontology_term	GO:1234

This table contains annotations for the second gene in the genome. We'll add these annotations and write the genome to a .tbl file.

python gag.py\
    --fasta walkthrough/annotation/genome.fasta\
    --gff walkthrough/annotation/genome.gff\
    --anno walkthrough/annotation/genome.annotations\
    --out annotated

Take a look at the file 'annotated/genome.tbl'. It should look like this:

>Feature foo_seq
1	105	REFERENCE
			PBARC	12345
50	100	gene
			gene	Xyz987
			locus_tag	TEST_2
54	70	mRNA
81	97
			Dbxref	PFAM:1123
			Ontology_term	GO:1234
			product	hypothetical protein
54	70	CDS
81	97
			codon_start	1
			Dbxref	PFAM:1123
			Ontology_term	GO:1234
			product	hypothetical protein
6	35	gene
			gene	Abc123
			locus_tag	TEST_1
6	10	mRNA
15	25
			Dbxref	PFAM:0001
			product	amazing_protein
6	10	CDS
15	25
			codon_start	1
			Dbxref	PFAM:0001
			product	amazing_protein

As you can see, both genes now have names and their corresponding child features have functional annotations.