GTDB_tk

Introduction

GTDB_Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes (Chaumeil et al. 2019). It uses the GTDB database to assign your genome(s) of interest to a taxonomy (Parks et al. 2020). For more information, visit the tool`s github and website.

Installation

Installed on crunchomics: Yes,

  • Several versions are available as part of the bioinformatics share:
  • If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.

After you were added to the bioinformatics share you can add the conda environments that are installed in this share as follows (if you have already done this in the past, you don’t need to run this command):

conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

GTDB v2.4.0 with r220
# Setup GTDBtk
mamba create -n gtdbtk_2.4.0 -c bioconda gtdbtk=2.4.0

# Download gtdb data, v220 
cd <my_database_folder>
wget https://data.gtdb.ecogenomic.org/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz
tar xvzf gtdbtk_data.tar.gz 

# Add link to gtdbtk
conda activate gtdbtk_2.4.0
conda env config vars set GTDBTK_DATA_PATH="<my_database_folder>/gtdb/release220";
conda deactivate
GTDB v2.6.1 with r260
# Setup GTDBtk
mamba create -n gtdbtk_2.6.1 -c bioconda gtdbtk=2.6.1

# Download gtdb data, v226 
cd <my_database_folder>
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
tar xvzf gtdbtk_data.tar.gz 

# Link datbase to gtdbtk
conda activate gtdbtk_2.6.1
conda env config vars set GTDBTK_DATA_PATH="<my_database_folder>/gtdb/release226";
conda deactivate

Usage

GTDB v2.4.0 with r220
mkdir -p results/gtdb 

conda activate gtdbtk_2.4.0

gtdbtk classify_wf --genome_dir  genome_dir/ \
  --extension fasta \
  --out_dir results/gtdb \
   --mash_db  /zfs/omics/projects/bioinformatics/databases/gtdb/release220/gtdb_ref_sketch.msh \
  --cpus 20

conda deactivate
GTDB v2.6.1 with r226
mkdir -p results/gtdb 

conda activate gtdbtk_2.6.1

gtdbtk classify_wf --genome_dir  genome_dir/ \
  --extension fasta \
  --out_dir results/gtdb \
  --cpus 20

conda deactivate

The files with the name gtdbtk.*.summary.tsv will contain key information about the taxonomic assignment of your genome(s).

Note, that the first step of the classify_wf workflow is an ANI pre-screening step. If all genomes have been classified by the ANI screening step, Identify and Align steps will be skipped. If you still want to run these steps, you can add the --skip_ani_screen argument to your command.

For a full set of options and description how the tool works, please visit the manual.

Common Issues and Solutions

  • Issue 1: Running out of memory as described here
    • Solution 1: For Crunchomics users: Use 150G of memory (and make sure to submit the job via srun or sbatch)
    • Solution 2: Use --scratch_dir and --pplacer_cpus 1
  • Issue 2: Using a gtdbtk version with the incorrect database
    • Solution 2: Ensure that you use the right gtdbtk-db combination, as listed here

References

Chaumeil, Pierre-Alain, Aaron J. Mussig, Philip Hugenholtz, and Donovan H. Parks. 2019. “GTDB-Tk: A Toolkit to Classify Genomes with the Genome Taxonomy Database.” Bioinformatics (Oxford, England), November. https://doi.org/10.1093/bioinformatics/btz848.
Parks, Donovan H., Maria Chuvochina, Pierre-Alain Chaumeil, Christian Rinke, Aaron J. Mussig, and Philip Hugenholtz. 2020. “A Complete Domain-to-Species Taxonomy for Bacteria and Archaea.” Nature Biotechnology 38 (9): 1079–86. https://doi.org/10.1038/s41587-020-0501-8.