View on GitHub

transvestigator

Validates transcriptome and prepares it for submission to the NCBI

Download this project as a .zip file Download this project as a tar.gz file

Basic Usage

Assuming you have a transcriptome assembly consisting of a fasta and GFF3 file (creatively titled "transcriptome.fasta" and "transcriptome.gff"), type:

python3 transvestigator.py -f transcriptome.fasta -g transcriptome.gff

Optionally, you may use the -a flag to include an annotation file -- this is a tab-separated file of annotations in the format specified here.

Transcripts can be filtered according to a blacklist (using the -bl flag) and/or RSEM statistics (using the -r flag).

After the program runs, you will find a folder called "transvestigator_out", containing files called "transcriptome.new.gff", "transcriptome.new.fsa" and "transcriptome.new.tbl". The latter two are your input files for tbl2asn.

What It Does:

When you run transvestigator, it reads the transcriptome into memory. It then fixes feature lengths, creates starts and stops, removes multiple CDS features, adjusts CDS phase, places transcript on the positive strand (if not already), and writes a .fasta and .tbl file. If you included an annotation file, the .tbl file will contain functional annotations. If you provided a blacklist file, those transcripts will be excluded.

Fix Feature Lengths

If the indices given for a gene, mRNA or CDS extend beyond the actual length of the sequence which contains them, the end index of the feature is adjusted to fall within the sequence boundaries.

Create Starts And Stops

The .tbl file indicates the presence of start and stop codons, but the .gff file associated with a transcriptome may not explicitly indicate the presence or absence of these features. So transvestigator inspects the first/last three bases of each CDS to determine whether they are a start/stop, then updates the transcriptome accordingly. This step is the key to eliminating the dreaded "PartialProblem" errors further downstream.

Remove Multiple CDS Features

If a transcript contains multiple CDS features, the longest one is chosen and the others are discarded. Future work may involve more complicated algorithms for making this decision.

Adjust CDS Phase

A strange error arises in NCBI TSA submissions when a CDS begins at the second or third base of a sequence. The fix we apply changes the start index of the feature to 1 and adjusts its phase in order to compensate. This adds a "codon_start" annotation to the .tbl file and, more importantly, makes errors go away :)

Citing transvestigator

Citation instructions and export to BibTeX, EndNote, etc. are available at the link below. Thanks for citing us!

10.5281/zenodo.10471

Example

Minimal examples of transcriptomes in need of the above fixes can be found in the 'walkthrough' directory of the repository. Here's an example of running transvestigator on the "Multiple CDS" data; the process for the other walkthroughs is similar.

Fixing a Transcriptome with Multiple CDS Features

First take a look at the file 'walkthrough/multi_cds/transcriptome.gff':


comp10026_c0_seq1	.	gene	1	1674	.	+	.	ID=g.4872
comp10026_c0_seq1	.	mRNA	1	1674	.	+	.	ID=m.4872;Parent=g.4872
comp10026_c0_seq1	.	exon	1	1674	.	+	.	ID=m.4872.exon1;Parent=m.4872
comp10026_c0_seq1	.	CDS	1	1674	.	+	.	ID=cds.m.4872;Parent=m.4872
comp10026_c0_seq1	.	gene	1568	1969	.	+	.	ID=g.4873
comp10026_c0_seq1	.	mRNA	1568	1969	.	+	.	ID=m.4873;Parent=g.4873
comp10026_c0_seq1	.	exon	1568	1969	.	+	.	ID=m.4873.exon1;Parent=m.4873
comp10026_c0_seq1	.	CDS	1568	1969	.	+	.	ID=cds.m.4873;Parent=m.4873

Note that the contig 'comp100026_c0_seq1' contains two gene/mRNA/exon/CDS features. We can only have one in our final submission. Take care of this by running transvestigator.

If you're in the root folder of the repository, you can run it by typing

python3 transvestigator.py\
        -f walkthrough/multi_cds/transcriptome.fasta\
        -g walkthrough/multi_cds/transcriptome.gff

After the program runs, take a look at the file 'transvestigator_out/transcriptome.new.gff':


comp10026_c0_seq1	.	gene	1	1674	.	+	0	ID=g.4872
comp10026_c0_seq1	.	mRNA	1	1674	.	+	0	ID=m.4872;Parent=g.4872
comp10026_c0_seq1	.	exon	1	1674	.	+	0	ID=m.4872.exon1;Parent=m.4872
comp10026_c0_seq1	.	CDS	1	1674	.	+	0	ID=cds.m.4872;Parent=m.4872
comp10026_c0_seq1	.	stop_codon	1672	1	.	+	0	ID=m.4872:stop;Parent=m.4872

The shorter CDS has been removed. This of course may have removed a real CDS (maybe the longer one was wrong, or maybe this was a fusion of two transcripts), future improvements hope to provide more technical apporaches for dealing with multiple CDS. This change also carries over to the .tbl file that is written. As an added bonus, we've discovered a stop codon :)