conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
TD2
Introduction
TD2 identifies candidate coding regions within transcript sequences, such as those generated by de novo RNA-Seq transcript assembly using Trinity, or constructed based on RNA-Seq alignments to the genome using StringTie (Mao et al. 2025). TD2 is the successor of Transdecoder and uses a similar workflow and command syntax. TD2 is a de novo ORF finder. If a reference genome annotation is available, the authors recommend to use ORFanage instead (https://github.com/alevar/ORFanage).
For all usage information, please visit the Wiki.
Installation
Installed on Crunchomics: Yes,
- TD2 v1.0.6 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -n td2 python=3.11
conda activate td2
pip install TD2
Usage
The code below can be used to predict coding regions from a transcript file and predict the likely coding genes. Optionally, you can identify ORFs with homology to known proteins via MMSeqs2, blastp or HMMER3 searches as described in the tools wiki.
conda activate TD2_1.0.6
mkdir results/td2
# Predict coding regions from a transcript file
# TD2 will choose the single best ORF per transcript by default, similar to the --single-best-only
# this behavior can be turned off by using the --all-good flag
TD2.LongOrfs \
-t transcripts.fasta \
-O results/td2
# Predict the likely coding genes
TD2.Predict \
-t transcripts.fasta \
-O results/td2
# The final outputs are generated in the current working directory
# Move these files to the td2 folder
mv transcripts.fasta.TD2.* results/td2
Useful options TD2.LongOrfs:
options:
-h, --help show this help message and exit
-O OUTPUT_DIR, --output-dir OUTPUT_DIR
path to output results, default=./{transcripts}
--precise set --precise to enable precise mode. Equivalent to -m 98 -M 98 for TD2.LongOrfs,
default=False
-m MINIMUM_LENGTH, --min-length MINIMUM_LENGTH
minimum protein length for proteins in long transcripts, default=90
-M ABSOLUTE_MIN, --absolute-min-length ABSOLUTE_MIN
minimum protein length for proteins in short transcripts, default=90
-L LEN_SCALE, --length-scale LEN_SCALE
allow short ORFs in short transcripts if the ORF is at least a fraction of the total
transcript length, default=1.1 (essentially off by default). You must also specify -M
to a lower minimum ORF length to work with -L
-S, --strand-specific
set -S for strand-specific ORFs (only analyzes top strand), default=False
-G GENETIC_CODE, --genetic-code GENETIC_CODE
genetic code (NCBI integer code), default=1 (universal)
--complete-orfs-only ignore all ORFs without both a stop and start codon, default=False
--alt-start include alternative initiator codons, default=False
--all-stopless report stopless sequences rather than ORFs, i.e. never require a start codon,
default=False
--top TOP record the top N CDS transcripts by length, default=0
--gene-trans-map GENE_TRANS_MAP
gene-to-transcript mapping file (tab-delimited)
-v, --verbose set -v for verbose output, default=False
-@ THREADS, --threads THREADS
number of threads to use, default=64
-% MEMORY_THRESHOLD, --memory-threshold MEMORY_THRESHOLD
percent of available memory to use per batch, default=None
required arguments:
-t TRANSCRIPTS REQUIRED path to transcripts.fasta
Useful options TD2.LongOrfs:
options:
-h, --help show this help message and exit
--precise set --precise to enable precise mode. Equivalent to -P 0.9 and --retain-long-orfs-fdr
0.005 for TD2.Predict, default=False
-P PSAURON_CUTOFF minimum in-frame PSAURON score required to report ORF assuming no homology hits, higher
is less sensitive and more precise (range: [0,1]; default: 0.50)
--all-good report all ORFs that pass PSAURON and/or length-based false discovery filters,
default=False
--retain-mmseqs-hits RETAIN_MMSEQS_HITS
mmseqs output in '.m8' format. Complete ORFs with a MMseqs2 match will be retained in
the final output.
--retain-blastp_hits RETAIN_BLASTP_HITS
blastp output in '-outfmt 6' format. Complete ORFs with a blastp match will be retained
in the final output.
--retain-hmmer_hits RETAIN_HMMER_HITS
domain table output file from running hmmer to search Pfam. Complete ORFs with a Pfam
domain hit will be retained in the final output.
--retain-long-orfs-mode RETAIN_LONG_ORFS_MODE
dynamic: retain ORFs longer than a threshold length determined by calculating the FDR
for each transcript's GC percent; strict: retain ORFs with length above constant length
--retain-long-orfs-fdr RETAIN_LONG_ORFS_FDR
in "--retain-long-orfs-mode dynamic" mode, set the False Discovery Rate used to
calculate dynamic threshold, default=0.10
--retain-long-orfs-length RETAIN_LONG_ORFS_LENGTH
in "--retain-long-orfs-mode strict" mode, retain all ORFs found that are equal or
longer than these many nucleotides even if no other evidence marks it as coding,
default=100000
--discard-encapsulated
retain ORFs that are fully contained within larger ORFs, default=False
--complete-orfs-only discard all ORFs without both a stop and start codon, default=False
--psauron-all-frame require ORF to have highest PSAURON score compared to all other reading frames, set
this argument for less sensitive and more precise ORFs, can dramatically increase
compute time requirements, default=False
-G GENETIC_CODE genetic code a.k.a. translation table, NCBI integer codes, default=1
-O OUTPUT_DIR same output directory from LongOrfs
-v, --verbose verbose output with progress bars, default=False
required arguments:
-t TRANSCRIPTS REQUIRED path to transcripts.fasta