GAG is released under the MIT License.
GAG can read a genome and write it to the NCBI's .tbl format. This is the main purpose of the program. However, if you're writing a .tbl file, there's a good chance that your genome is destined to encounter tbl2asn further downstream. In anticipation of this meeting, GAG includes a (growing) number of options to remove questionable features or trim erroneous regions from your genome. It can also be used to add functional annotations. These options are outlined in the Advanced Usage section.
GAG is written in Python 2. Assuming you have Python 2 (not 3!) installed, you can download the zipped source code from the link at the top of this page, extract it and run it.
Here's what a very simple GAG session might look like:
python gag.py --fasta organism.fasta --gff organism.gff --out gag_output
This command will create a subfolder called 'gag_output' containing, among other things, a .tbl file.
What is a genome? Great question. A genome can be many things to many people. To GAG, a genome is a .fasta file and a .gff file.
You must provide GAG with a .fasta file of nucleic acid sequence data. (More info here.) It is important to note that GAG is only expecting to find 'A', 'C', 'G', 'T' or 'N' (upper or lowercase) in your sequences. Mixed bases ('R', 'M', etc.) or invalid bases ('Q', 'ㅊ', etc.) are converted to 'N' for the purpose of protein translation, start/stop codon detection or whatever else we might dream up.
GAG will accept a scaffolded or unscaffolded assembly, though a scaffolded assembly is recommended.
For best results, use a valid GFF3 file. The GFF3 specification can be found at the Sequence Ontology homepage, and there is a validator here. The genometools software can also be used to correct simple errors in gff3 files prior to utilizing GAG. (For unpredictable results, go ahead and try to load an invalid file. It's been known to work. Who knows?)
Your hopefully valid GFF3 file should contain features identified as 'gene', 'mRNA', 'exon', and 'CDS', along with possible companions 'start_codon' and 'stop_codon'. There are other valid features, but at the moment GAG will ignore them, writing them to a file called 'genome.ignored.gff' for posterity. (You can always reunite these features with your final result by typing
cat genome.ignored.gff genome.gff > reunited.gff from a Linux prompt.) If you have a pet feature that you think belongs in your .tbl file, please visit the Issues Page and let us know.
Column 9 of a GFF3 file is the "attributes" column. A unique "ID" field is required for each feature; a "Parent" field is required for all non-gene features; everything else is gravy. For example, add a unique "Name" field to your gene and it will be included in your .tbl file. Attach "Dbxref" or "Ontology_term" fields to your mRNA feature and these annotations will be added when you write a .tbl file. See Functional Annotations for more information and examples.
Please cite as follows:
Hall, B., DeRego, T., & Geib, S. (2014). GAG: the Genome Annotation Generator (Version 1.0) [Software]. Available from http://genomeannotation.github.io/GAG.
The following vignettes demonstrate some of GAG's capabilities for removing questionable features of your genome, fixing mistakes, and flagging features for further investigation.
Depending on the assembler you used, your genome may contain sequences that begin or end with unknown bases ("N" or "n"). The NCBI will not accept submissions containing such sequences. For example, take a look at the file
walkthrough/terminal_ns/genome.fasta. It is crawling with unknown bases. This type of sequence will earn your genome submission a big red REJECTED stamp.
GAG can check each sequence in your .fasta for 'N's at the beginning or the end. If it finds any, it will remove them from the sequence and (if the bases were at the beginning of the sequence) update the indices of all features on the sequence accordingly. If by chance you have a gene that starts on one of these unknown bases, it will be removed. Such is life. (Note: GAG only removes 'N's at the beginning or end of a scaffold; it does not remove unknown bases within the assembly.)
To harness this unbelievable power, simply run GAG with the
python gag.py\ --fasta walkthrough/terminal_ns/genome.fasta\ --gff walkthrough/terminal_ns/genome.gff\ --fix_terminal_ns\ --out terminal_ns_fixed
If you now open the file
terminal_ns_removed/genome.fasta you can see that the 'N's at the beginning and end of the sequence are gone. (Compare this file with
walkthrough/terminal_ns/genome.fasta.) Additionally, the indices of each feature in 'terminal_ns_removed/genome.gff' are updated to correspond to the new, cleaned up sequence.
Flagging and Removing Short Introns (and so much more)
Another NCBI guideline involves minimum length of introns. Ten seems to be the magic number. It's pretty common to find structural annotations coming from automated annotation pipelines with shorter introns, though. Luckily, GAG can flag (or remove) them for you. Is it possible there really are introns shorter than 10 bp in your genome? I guess so. Should you just go in and remove every mRNA with a short intron? Or every mRNA coding a protein shorter than a certain length? That is up to you. But tbl2asn and NCBI have specific minimal size limits for introns, peptide lengths, etc. So if you have something you want to keep that is outside of these limits, then take it up with them. We suggest a flagging and reviewing approach for those who are cautious.
Below is an example of flagging and removing short introns; for a full list of commands type
To begin, let's get some basic stats on our genome. Run GAG without any special options:
gag.py -f walkthrough/short_introns/genome.fasta -g walkthrough/short_introns/genome.gff
Now take a look at the stats file located at
gag_output/genome.stats. You'll see a lot of info; here are the good bits:
Number of sequences: 1 Genome ------ Total sequence length 105 Number of genes 2 ... Shortest intron 6 ...
So we have two genes, and at least one of them has a suspiciously short intron. Take a look at
foo_seq maker gene 6 35 0 - 0 ID=TEST_1 foo_seq maker mRNA 6 35 0 - 0 ID=TEST_1-RA;Parent=TEST_1 foo_seq maker exon 6 10 0.94 - 0 ID=TEST_1-RA:exon:0;Parent=TEST_1-RA foo_seq maker exon 15 25 0.94 - 0 ID=TEST_1-RA:exon:1;Parent=TEST_1-RA foo_seq maker CDS 6 10 0 - 0 ID=TEST_1-RA:cds:0;Parent=TEST_1-RA foo_seq maker CDS 15 25 0 - 0 ID=TEST_1-RA:cds:1;Parent=TEST_1-RA
If you do the math, you'll see that the intron between those two exons is only 4bp long. We have found our culprit.
To address this, run GAG again, this time with the
gag.py -f walkthrough/short_introns/genome.fasta -g walkthrough/short_introns/genome.gff -o flagged_intron--flag_introns_shorter_than 10
Inspecting the results in
foo_seq maker gene 6 35 . - . ID=TEST_1 foo_seq maker mRNA 6 35 . - . ID=TEST_1-RA;Parent=TEST_1 foo_seq maker exon 6 10 0.94 - . ID=TEST_1-RA:exon:0;Parent=TEST_1-RA;gag_flag=intron_min_length:10 foo_seq maker exon 15 25 0.94 - . ID=TEST_1-RA:exon:1;Parent=TEST_1-RA;gag_flag=intron_min_length:10 foo_seq maker CDS 6 10 . - 0 ID=TEST_1-RA:cds:0;Parent=TEST_1-RA foo_seq maker CDS 15 25 . - 0 ID=TEST_1-RA:cds:1;Parent=TEST_1-RA
The only difference is in the very last column -- both of those exons have an added annotation. It says
At this point, you can load the flagged genome into a genome browser and search for features containing the annotation 'gag_flag' in order to inspect them more closely. If you determine that they don't belong in your genome, return to GAG and replace 'flag' with 'remove' in the above command.
One final word about removing features like this: GAG's policy is to remove the parent mRNA when an offending intron, exon or CDS length is detected. If this leaves the parent gene without any mRNAs, the gene is removed as well. In other words, GAG removes empty genes and mRNAs by default. If you have empty features that you'd like to keep for some reason, stop by the Issues page and let us know.
Creating Start and Stop Codons
Certain configurations of certain annotation pipelines/software will produce lines in the GFF3 file indicating 'start_codon' and 'stop_codon' features. As mentioned above, GAG reads and stores these features. If they are missing from your annotations (which is common), you must create them. If they are present in your gff3 file and you don't run this command, GAG will assume that they are correct and translate based off of them. If you have explicit start and stop codon features in your GFF3 file and you have reason to believe they are incorrect, or if they're missing entirely, run the command below and GAG will recalculate (assuming standard eukaryotic start and stop codons).
As an example, take a look at the genome contained in the folder
walkthrough/start_stop_codons. The .fasta looks like this:
and the .gff looks like this:
foo_seq maker gene 6 35 0 + 0 ID=TEST_1 foo_seq maker mRNA 6 35 0 + 0 ID=TEST_1-RA;Parent=TEST_1 foo_seq maker exon 6 10 0.94 + 0 ID=TEST_1-RA:exon:0;Parent=TEST_1-RA foo_seq maker exon 15 24 0.94 + 0 ID=TEST_1-RA:exon:1;Parent=TEST_1-RA foo_seq maker CDS 6 10 0 + 0 ID=TEST_1-RA:cds:0;Parent=TEST_1-RA foo_seq maker CDS 15 24 0 + 0 ID=TEST_1-RA:cds:1;Parent=TEST_1-RA
If you use GAG to write this genome to an output folder, the file 'genome.stats' will contain a lot of statistics; here are the highlights:
Number of sequences: 1 Genome ------ Total sequence length 105 Number of genes 1 ... Number of CDS 1 CDS: complete 0 CDS: start, no stop 0 CDS: stop, no start 0 CDS: no stop, no start 1
So we have one CDS, and it has neither a stop nor a start. However, if you get out your reading glasses and examine the .gff file, you can see that the first three bases of that cds are at indices 6, 7, 8. If you count off the bases in the .fasta, you'll see that the corresponding nucleotides are 'atg'. Looks like a start codon to me. You can perform a similar exercise to find the last three bases, or just trust that we've rendered them in lowercase for convenience -- they are 'tag', a stop codon.
To repair this oversight, run GAG with the
python gag.py\ --fasta walkthrough/start_stop_codons/genome.fasta\ --gff walkthrough/start_stop_codons/genome.gff\ --fix_start_stop\ --out start_stop_fixed
Take a look at the
start_stop_fixed/genome.stats and marvel at the improvement. Here are the relevant lines:
GAG> info ---------------------------------------- Number of sequences: 1 Reference Genome Modified Genome ---------------- --------------- Total sequence length 105 105 Number of genes 1 1 ... Number of CDS 1 1 CDS: complete 0 1 CDS: start, no stop 0 0 CDS: stop, no start 0 0 CDS: no stop, no start 1 0
That CDS is now counted as "complete", meaning we found the start and stop. When you write it to .tbl format, this information will be carried over and will prevent those insidious "ERROR: PartialProblem" messages from tbl2asn.
GAG supports two complementary methods of adding functional annotations to your genome. The easiest way is to include them in column 9 of your .gff file. Adding a "Name=___" entry to a gene will result in this gene name being added to your .tbl file. You can then add entries to its child mRNAs using one of the following keywords: "Dbxref=___", "Ontology_term=___" or "product=___". (Note the lowercase 'p' in 'product'. This is important; attributes starting with capital letters are reserved in the GFF3 spec.)
Another option is to feed GAG a tab-separated file containing your annotations. You can either make this yourself if you have user-generated annotations, or use Annie, the Annotation Extractor. Annie focuses on using manually currated UniProt Swissprot proteins to provide a possible gene name and protein function, and InterProScan to provide cross-references to annotations from functional databases. If you decide to generate this table yourself, as with the first option, you can add names to genes or annotations to mRNAs. Each row in the table should have three rows. The first column should contain the ID for the gene or mRNA to be annotated. For genes, the format is as follows:
(gene_id) "name" (gene_name)
Note that the second column must contain the word "name", just to emphasize that you are giving the gene a name. Genes can't take functions, database cross-references, etc. Those belong to mRNAs, whose table entries should look like this:
(mrna_id) ("Dbxref", "Ontology_term" or "product") (annotation)
The three options given in the second column are the only recommended terms. Technically, though, GAG will add any annotation you choose. Annotate at your own risk! If you are adding db_xref terms, make sure they are from an NCBI approved database. You can find a list here: NCBI db_xref qualifiers page
All of this is (hopefully) made clearer with an example. Take a look at the genome contained in the folder 'walkthrough/annotation'. Here is the .gff:
foo_seq maker gene 6 35 0 + 0 ID=TEST_1;Name=Abc123 foo_seq maker mRNA 6 35 0 + 0 ID=TEST_1-RA;Parent=TEST_1;Dbxref=PFAM:0001;product=amazing_protein foo_seq maker exon 6 10 0.94 + 0 ID=TEST_1-RA:exon:0;Parent=TEST_1-RA foo_seq maker exon 15 25 0.94 + 0 ID=TEST_1-RA:exon:1;Parent=TEST_1-RA foo_seq maker CDS 6 10 0 + 0 ID=TEST_1-RA:cds:0;Parent=TEST_1-RA foo_seq maker CDS 15 25 0 + 0 ID=TEST_1-RA:cds:1;Parent=TEST_1-RA foo_seq maker start_codon 6 8 0.94 + 0 ID=TEST_1-RA:start:0;Parent=TEST_1-RA foo_seq maker stop_codon 23 25 0.94 + 0 ID=TEST_1-RA:stop:1;Parent=TEST_1-RA foo_seq maker gene 50 100 0 + 0 ID=TEST_2 foo_seq maker mRNA 50 100 0 + 0 ID=TEST_2-RA;Parent=TEST_2 foo_seq maker exon 54 70 0.94 + 0 ID=TEST_2-RA:exon:0;Parent=TEST_2-RA foo_seq maker exon 81 97 0.94 + 0 ID=TEST_2-RA:exon:1;Parent=TEST_2-RA foo_seq maker CDS 54 70 0 + 0 ID=TEST_2-RA:cds:0;Parent=TEST_2-RA foo_seq maker CDS 81 97 0 + 0 ID=TEST_2-RA:cds:1;Parent=TEST_2-RA foo_seq maker start_codon 54 56 0.94 + 0 ID=TEST_1-RB:start:0;Parent=TEST_2-RA foo_seq maker stop_codon 95 97 0.94 + 0 ID=TEST_1-RB:stop:1;Parent=TEST_2-RA
As you can see, the first gene ("TEST_1") has a 'Name' and its child mRNA ("TEST_1-RA") has 'Dbxref' and 'product' annotations. Now take a look at the file called 'genome.annotations':
TEST_2 name Xyz987 TEST_2-RA Dbxref PFAM:1123 TEST_2-RA Ontology_term GO:1234
This table contains annotations for the second gene in the genome. We'll add these annotations and write the genome to a .tbl file.
python gag.py\ --fasta walkthrough/annotation/genome.fasta\ --gff walkthrough/annotation/genome.gff\ --anno walkthrough/annotation/genome.annotations\ --out annotated
Take a look at the file 'annotated/genome.tbl'. It should look like this:
>Feature foo_seq 1 105 REFERENCE PBARC 12345 50 100 gene gene Xyz987 locus_tag TEST_2 54 70 mRNA 81 97 Dbxref PFAM:1123 Ontology_term GO:1234 product hypothetical protein 54 70 CDS 81 97 codon_start 1 Dbxref PFAM:1123 Ontology_term GO:1234 product hypothetical protein 6 35 gene gene Abc123 locus_tag TEST_1 6 10 mRNA 15 25 Dbxref PFAM:0001 product amazing_protein 6 10 CDS 15 25 codon_start 1 Dbxref PFAM:0001 product amazing_protein
As you can see, both genes now have names and their corresponding child features have functional annotations.