conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
TransDecoder
Introduction
TransDecoder identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using StringTie (Haas 2025).
TransDecoder is applied to an entire transcriptome for a single organism involving thousands of transcript sequences as input. Therefore, TransDecoder is unlikely to work if you provide a small number of sequences as input, as it requires training a species-specific model based on hundreds of candidates derived from the inputs.
TransDecoder identifies likely coding sequences based on the following criteria:
- a minimum length open reading frame (ORF) is found in a transcript sequence
- a log-likelihood score similar to what is computed by the GeneID software is > 0.
- the above coding score is greatest when the ORF is scored in the 1st reading frame as compared to scores in the other 2 forward reading frames.
- if a candidate ORF is found fully encapsulated by the coordinates of another candidate ORF, the longer one is reported. However, a single transcript can report multiple ORFs (allowing for operons, chimeras, etc).
- a PSSM is built/trained/used to refine the start codon prediction.
- optional: the putative peptide has a match to a Pfam/Uniprot domain above the noise cutoff score.
Important: As of March 2024, TransDecoder is no longer actively supported by the developer. Please continue to use as it fits your needs.
Installation
Installed on Crunchomics: Yes,
- TransDecoder v5.7.1 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create --name transdecoder_5.7.1 -c bioconda transdecoder=5.7.1
Usage
Basic usage
conda activate transdecoder_5.7.1
# Extract the long open reading frames
TransDecoder.LongOrfs -t target_transcripts.fasta --output_dir results
# Predict the likely coding regions
# The final set of candidate coding regions can be found as files '.transdecoder.' where extensions include .pep, .cds, .gff3, and .bed
TransDecoder.Predict -t target_transcripts.fasta --output_dir results
conda deactivate
Output files generated:
- transcripts.fasta.transdecoder.pep : peptide sequences for the final candidate ORFs; all shorter candidates within longer ORFs were removed.
- transcripts.fasta.transdecoder.cds : nucleotide sequences for coding regions of the final candidate ORFs
- transcripts.fasta.transdecoder.gff3 : positions within the target transcripts of the final selected ORFs
- transcripts.fasta.transdecoder.bed : bed-formatted file describing ORF positions, best for viewing using GenomeView or IGV.
Useful options:
TransDecoder.LongOrfs
:
-t
: Transcripts fasta file--gene_trans_map
: gene-to-transcript identifier mapping file (tab-delimited, gene_idtrans_id ) -m <int>
: minimum protein length (default: 100)--genetic_code
|-G <string>
: genetic code (default: universal). View help for all options--complete_orfs_only
: yields only complete ORFs (peps start with Met (M), end with stop (*))
TransDecoder.Predict
:
--single_best_only
: Retain only the single best orf per transcript (prioritized by homology then orf length)--retain_long_orfs_mode <string>
: ‘dynamic’ or ‘strict’ (default: dynamic). In dynamic mode, sets range according to 1%FDR in random sequence of same GC content.--retain_long_orfs_length <int>
: under ‘strict’ mode, retain all ORFs found that are equal or longer than these many nucleotides even if no other evidence marks it as coding (default: 1000000) so essentially turned off by default.--retain_pfam_hits <string>
: domain table output file from running hmmscan to search Pfam (see transdecoder.github.io for info)Any ORF with a pfam domain hit will be retained in the final output.--retain_blastp_hits <string>
: blastp output in ‘-outfmt 6’ format. Any ORF with a blast match will be retained in the final output.--no_refine_starts
: start refinement identifies potential start codons for 5’ partial ORFs using a PWM, process on by default-T <int>
: Top longest ORFs to train Markov Model (hexamer stats) (default: 500). Note, 10x this value are first selected for removing redundancies, and then this -T value of longest ORFs are selected from the non-redundant set.
Starting from a genome-based transcript structure GTF file
The process here is identical to the above with the exception that we must first generate a fasta file corresponding to the transcript sequences, and in the end, we recompute a genome annotation file in GFF3 format that describes the predicted coding regions in the context of the genome.
# Construct the transcript fasta file using the genome and the transcripts.gtf
gtf_genome_to_cdna_fasta.pl stringtie.gtf genome.fasta > transcripts.fasta
# Convert gtf to gff
gtf_to_alignment_gff3.pl stringtie.gtf > stringtie.gff3
# Extract open reading frames
TransDecoder.LongOrfs -t transcripts.fasta --output_dir results
# Predict the likely coding regions
TransDecoder.Predict -t transcripts.fasta --output_dir results
# Generate a genome-based coding region annotation file
cdna_alignment_orf_to_genome_orf.pl results/transcripts.fasta.transdecoder.gff3 \
\
stringtie.gff3 > results/transcripts.fasta.transdecoder.genome.gff3
transcripts.fasta
# Convert to gtf as companion to the gff3 file
gff3_gene_to_gtf_format.pl results/transcripts.fasta.transdecoder.genome.gff3 \
> \
genome.fasta
results/transcripts.fasta.transdecoder.genome.gtf
# Optional: Clean the gtf file
python /zfs/omics/projects/bioinformatics/scripts/edit_transdecoder_gtf.py \
--input results/transcripts.fasta.transdecoder.genome.gtf \
--output results/transcripts.fasta.transdecoder.genome.clean.gtf
Include homology searches as ORF retention criteria
# Extract the long open reading frames
TransDecoder.LongOrfs -t target_transcripts.fasta --output_dir results
# Blastp Against uniprot database
blastp -query results/transcripts.fasta.transdecoder_dir/longest_orfs.pep \
-db /zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta -max_target_seqs 1 \
-outfmt 6 -evalue 1e-5 -num_threads 30 > results/blastp.outfmt6
# Hmmsearch against PFAM
hmmsearch --cpu 30 -E 1e-10 --domtblout results/pfam.domtblout \
\
/zfs/omics/projects/bioinformatics/databases/pfam/release_36.0/2-Pfam-A.hmm
results/transcripts.fasta.transdecoder_dir/longest_orfs.pep
# Integrate the homology searches in the prediction step
TransDecoder.Predict -t target_transcripts.fasta --output_dir results \
--retain_pfam_hits results/pfam.domtblout \
--retain_blastp_hits results/blastp.outfmt6
If you want to generate your own databases, you can do the following:
# Get the Uniprot database
today=$(date -I)
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
mv uniprot_sprot.fasta uniprot_sprot_${today}.fasta
makeblastdb -in uniprot_sprot_${today}.fasta -parse_seqids -dbtype prot
# Get the Pfam database
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.clans.tsv.gz
gzip -d *gz