TD2

Introduction

TD2 identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using StringTie (Mao et al. 2025). TD2 is the successor of Transdecoder and uses a similar workflow and command syntax. TD2 is a de novo ORF finder. If a reference genome annotation is available, the authors recommend to use ORFanage instead (https://github.com/alevar/ORFanage).

For all usage information, please visit the Wiki.

Installation

Installed on Crunchomics: Yes,

  • TD2 v1.0.6 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
  • Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -n td2 python=3.11
conda activate td2 
pip install TD2

Usage

The code below can be used to predict coding regions from a transcript file and predict the likely coding genes. Optionally, you can identify ORFs with homology to known proteins via MMSeqs2, blastp or HMMER3 searches as described in the tools wiki.

conda activate TD2_1.0.6 

mkdir results/td2

# Predict coding regions from a transcript file
# TD2 will choose the single best ORF per transcript by default, similar to the --single-best-only 
# this behavior can be turned off by using the --all-good flag 
TD2.LongOrfs \
  -t transcripts.fasta \
 -O results/td2

# Predict the likely coding genes
TD2.Predict \
  -t transcripts.fasta \
  -O results/td2 

# The final outputs are generated in the current working directory 
# Move these files to the td2 folder 
mv transcripts.fasta.TD2.* results/td2

Useful options TD2.LongOrfs:

options:
  -h, --help            show this help message and exit
  -O OUTPUT_DIR, --output-dir OUTPUT_DIR
                        path to output results, default=./{transcripts}
  --precise             set --precise to enable precise mode. Equivalent to -m 98 -M 98 for TD2.LongOrfs,
                        default=False
  -m MINIMUM_LENGTH, --min-length MINIMUM_LENGTH
                        minimum protein length for proteins in long transcripts, default=90
  -M ABSOLUTE_MIN, --absolute-min-length ABSOLUTE_MIN
                        minimum protein length for proteins in short transcripts, default=90
  -L LEN_SCALE, --length-scale LEN_SCALE
                        allow short ORFs in short transcripts if the ORF is at least a fraction of the total
                        transcript length, default=1.1 (essentially off by default). You must also specify -M
                        to a lower minimum ORF length to work with -L
  -S, --strand-specific
                        set -S for strand-specific ORFs (only analyzes top strand), default=False
  -G GENETIC_CODE, --genetic-code GENETIC_CODE
                        genetic code (NCBI integer code), default=1 (universal)
  --complete-orfs-only  ignore all ORFs without both a stop and start codon, default=False
  --alt-start           include alternative initiator codons, default=False
  --all-stopless        report stopless sequences rather than ORFs, i.e. never require a start codon,
                        default=False
  --top TOP             record the top N CDS transcripts by length, default=0
  --gene-trans-map GENE_TRANS_MAP
                        gene-to-transcript mapping file (tab-delimited)
  -v, --verbose         set -v for verbose output, default=False
  -@ THREADS, --threads THREADS
                        number of threads to use, default=64
  -% MEMORY_THRESHOLD, --memory-threshold MEMORY_THRESHOLD
                        percent of available memory to use per batch, default=None

required arguments:
  -t TRANSCRIPTS        REQUIRED path to transcripts.fasta

Useful options TD2.LongOrfs:

options:
  -h, --help            show this help message and exit
  --precise             set --precise to enable precise mode. Equivalent to -P 0.9 and --retain-long-orfs-fdr
                        0.005 for TD2.Predict, default=False
  -P PSAURON_CUTOFF     minimum in-frame PSAURON score required to report ORF assuming no homology hits, higher
                        is less sensitive and more precise (range: [0,1]; default: 0.50)
  --all-good            report all ORFs that pass PSAURON and/or length-based false discovery filters,
                        default=False
  --retain-mmseqs-hits RETAIN_MMSEQS_HITS
                        mmseqs output in '.m8' format. Complete ORFs with a MMseqs2 match will be retained in
                        the final output.
  --retain-blastp_hits RETAIN_BLASTP_HITS
                        blastp output in '-outfmt 6' format. Complete ORFs with a blastp match will be retained
                        in the final output.
  --retain-hmmer_hits RETAIN_HMMER_HITS
                        domain table output file from running hmmer to search Pfam. Complete ORFs with a Pfam
                        domain hit will be retained in the final output.
  --retain-long-orfs-mode RETAIN_LONG_ORFS_MODE
                        dynamic: retain ORFs longer than a threshold length determined by calculating the FDR
                        for each transcript's GC percent; strict: retain ORFs with length above constant length
  --retain-long-orfs-fdr RETAIN_LONG_ORFS_FDR
                        in "--retain-long-orfs-mode dynamic" mode, set the False Discovery Rate used to
                        calculate dynamic threshold, default=0.10
  --retain-long-orfs-length RETAIN_LONG_ORFS_LENGTH
                        in "--retain-long-orfs-mode strict" mode, retain all ORFs found that are equal or
                        longer than these many nucleotides even if no other evidence marks it as coding,
                        default=100000
  --discard-encapsulated
                        retain ORFs that are fully contained within larger ORFs, default=False
  --complete-orfs-only  discard all ORFs without both a stop and start codon, default=False
  --psauron-all-frame   require ORF to have highest PSAURON score compared to all other reading frames, set
                        this argument for less sensitive and more precise ORFs, can dramatically increase
                        compute time requirements, default=False
  -G GENETIC_CODE       genetic code a.k.a. translation table, NCBI integer codes, default=1
  -O OUTPUT_DIR         same output directory from LongOrfs
  -v, --verbose         verbose output with progress bars, default=False

required arguments:
  -t TRANSCRIPTS        REQUIRED path to transcripts.fasta

References

Mao, Alan, Hyun Joo Ji, Brian J. Haas, Steven L. Salzberg, and Markus J. Sommer. 2025. “TD2: Finding Protein Coding Regions in Transcripts.” http://dx.doi.org/10.1101/2025.04.13.648579.