TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using StringTie (Haas 2025).
TransDecoder is applied to an entire transcriptome for a single organism involving thousands of transcript sequences as input. Therefore, TransDecoder is unlikely to work if you provide a small number of sequences as input, as it requires training a species-specific model based on hundreds of candidates derived from the inputs.
TransDecoder identifies likely coding sequences based on the following criteria:
- a minimum length open reading frame (ORF) is found in a transcript sequence
- a log-likelihood score similar to what is computed by the GeneID software is > 0.
- the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.
- if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
- a PSSM is built/trained/used to refine the start codon prediction.
- optional: the putative peptide has a match to a Pfam/Uniprot domain above the noise cutoff score.
Important: As of March 2024, TransDecoder is no longer actively supported by the developer. Please continue to use as it fits your needs.
If you want to install it yourself, you can run:
mamba create --name transdecoder_5.7.1 -c bioconda transdecoder=5.7.1
Basic usage
conda activate transdecoder_5.7.1
# Extract the long open reading frames
TransDecoder.LongOrfs -t target_transcripts.fasta --output_dir results
# Predict the likely coding regions
# The final set of candidate coding regions can be found as files '.transdecoder.' where extensions include .pep, .cds, .gff3, and .bed
TransDecoder.Predict -t target_transcripts.fasta --output_dir results
conda deactivate
Output files generated:
- transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
- transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs
- transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs
- transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.
Useful options:
: Transcripts fasta file
: gene-to-transcript identifier mapping file (tab-delimited, gene_id trans_id)
: minimum protein length (default: 100)
|-G <string>
: genetic code (default: universal). View help for all options--complete_orfs_only
: yields only complete ORFs (peps start with Met (M), end with stop (*))
: Retain only the single best orf per transcript (prioritized by homology then orf length)
: 'dynamic' or 'strict' (default: dynamic). In dynamic mode, sets range according to 1%FDR in random sequence of same GC content.
: under 'strict' mode, retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence marks it as coding (default: 1000000) so essentially turned off by default.
: domain table output file from running hmmscan to search Pfam. Any ORF with a pfam domain hit will be retained in the final output.
: blastp output in '-outfmt 6' format. Any ORF with a blast match will be retained in the final output.
: start refinement identifies potential start codons for 5' partial ORFs using a PWM, process on by default
: Top longest ORFs to train Markov Model (hexamer stats) (default: 500). Note, 10x this value are first selected for removing redundancies, and then this -T value of longest ORFs are selected from the non-redundant set.
Starting from a genome-based transcript structure GTF file
The process here is identical to the above with the exception that we must first generate a fasta file corresponding to the transcript sequences, and in the end, we recompute a genome annotation file in GFF3 format that describes the predicted coding regions in the context of the genome.
# Construct the transcript fasta file using the genome and the transcripts.gtf stringtie.gtf genome.fasta > transcripts.fasta
# Convert gtf to gff stringtie.gtf > stringtie.gff3
# Extract open reading frames
TransDecoder.LongOrfs -t transcripts.fasta --output_dir results
# Predict the likely coding regions
TransDecoder.Predict -t transcripts.fasta --output_dir results
# Generate a genome-based coding region annotation file results/transcripts.fasta.transdecoder.gff3 \
stringtie.gff3 > results/transcripts.fasta.transdecoder.genome.gff3
# Convert to gtf as companion to the gff3 file results/transcripts.fasta.transdecoder.genome.gff3 \
> \
# Optional: Clean the gtf file
python /zfs/omics/projects/bioinformatics/scripts/ \
--input results/transcripts.fasta.transdecoder.genome.gtf \
--output results/transcripts.fasta.transdecoder.genome.clean.gtf
Include homology searches as ORF retention criteria
# Extract the long open reading frames
TransDecoder.LongOrfs -t target_transcripts.fasta --output_dir results
# Blastp Against uniprot database
blastp -query results/transcripts.fasta.transdecoder_dir/longest_orfs.pep \
-db /zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 30 > results/blastp.outfmt6
# Hmmsearch against PFAM
hmmsearch --cpu 30 -E 1e-10 --domtblout results/pfam.domtblout \
# Integrate the homology searches in the prediction step
TransDecoder.Predict -t target_transcripts.fasta --output_dir results \
--retain_pfam_hits results/pfam.domtblout \
--retain_blastp_hits results/blastp.outfmt6
If you want to generate your own databases, you can do the following:
# Get the Uniprot database
today=$(date -I)
gzip -d uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta uniprot_sprot_${today}.fasta
makeblastdb -in uniprot_sprot_${today}.fasta -parse_seqids -dbtype prot
# Get the Pfam database
gzip -d *gz