conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/GTDB_tk
Introduction
GTDB_Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes (Chaumeil et al. 2019). It uses the GTDB database to assign your genome(s) of interest to a taxonomy (Parks et al. 2020). For more information, visit the tool`s github and website.
Installation
Installed on crunchomics: Yes,
- Several versions are available as part of the bioinformatics share:
- GTDB_Tk v2.6.1 and the GTDB v226 database
- GTDB_Tk v2.4.0 and the GTDB v220 database
- If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
After you were added to the bioinformatics share you can add the conda environments that are installed in this share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
GTDB v2.4.0 with r220
# Setup GTDBtk
mamba create -n gtdbtk_2.4.0 -c bioconda gtdbtk=2.4.0
# Download gtdb data, v220
cd <my_database_folder>
wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz
tar xvzf gtdbtk_data.tar.gz
# Add link to gtdbtk
conda activate gtdbtk_2.4.0
conda env config vars set GTDBTK_DATA_PATH="<my_database_folder>/gtdb/release220";
conda deactivateGTDB v2.6.1 with r260
# Setup GTDBtk
mamba create -n gtdbtk_2.6.1 -c bioconda gtdbtk=2.6.1
# Download gtdb data, v226
cd <my_database_folder>
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
tar xvzf gtdbtk_data.tar.gz
# Link datbase to gtdbtk
conda activate gtdbtk_2.6.1
conda env config vars set GTDBTK_DATA_PATH="<my_database_folder>/gtdb/release226";
conda deactivateUsage
GTDB v2.4.0 with r220
mkdir -p results/gtdb
conda activate gtdbtk_2.4.0
gtdbtk classify_wf --genome_dir genome_dir/ \
--extension fasta \
--out_dir results/gtdb \
--mash_db /zfs/omics/projects/bioinformatics/databases/gtdb/release220/gtdb_ref_sketch.msh \
--cpus 20
conda deactivateGTDB v2.6.1 with r226
mkdir -p results/gtdb
conda activate gtdbtk_2.6.1
gtdbtk classify_wf --genome_dir genome_dir/ \
--extension fasta \
--out_dir results/gtdb \
--cpus 20
conda deactivateThe files with the name gtdbtk.*.summary.tsv will contain key information about the taxonomic assignment of your genome(s).
Note, that the first step of the classify_wf workflow is an ANI pre-screening step. If all genomes have been classified by the ANI screening step, Identify and Align steps will be skipped. If you still want to run these steps, you can add the --skip_ani_screen argument to your command.
For a full set of options and description how the tool works, please visit the manual.
Common Issues and Solutions
- Issue 1: Running out of memory as described here
- Solution 1: For Crunchomics users: Use 150G of memory (and make sure to submit the job via srun or sbatch)
- Solution 2: Use
--scratch_dirand--pplacer_cpus 1
- Issue 2: Using a gtdbtk version with the incorrect database
- Solution 2: Ensure that you use the right gtdbtk-db combination, as listed here