gtdb – Bioinformatics guidance page

GTDB_tk

Introduction

GTDB_Tk is a software toolkit for assigning objective taxonomic classifications to bacterial and archaeal genomes (Chaumeil et al. 2019). It uses the GTDB database to assign your genome(s) of interest to a taxonomy (Parks et al. 2020). For more information, visit the tool`s github and website.

Installation

Installed on crunchomics: Yes,

GTDB_Tk v2.4.0 and the GTDB v220 database are installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.

After you were added to the bioinformatics share you can add the conda environments that are installed in this share as follows (if you have already done this in the past, you don’t need to run this command):

conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -n gtdbtk_2.4.0 -c bioconda gtdbtk=2.4.0

#get gtdb data, v220 
cd <my_database_folder>
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_package/full_package/gtdbtk_data.tar.gz
tar xvzf gtdbtk_data.tar.gz 

#link data to gtdbtk
conda activate gtdbtk_2.4.0
conda env config vars set GTDBTK_DATA_PATH="<my_database_folder>/gtdb/release220";
conda deactivate

Usage

mkdir -p results/gtdb 

conda activate gtdbtk_2.4.0

gtdbtk classify_wf --genome_dir  genome_dir/ \
  --extension fasta \
  --out_dir results/gtdb \
   --mash_db  /zfs/omics/projects/bioinformatics/databases/gtdb/release220/gtdb_ref_sketch.msh \
  --cpus 20

conda deactivate

The files with the name gtdbtk.*.summary.tsv will contain key information about the taxonomic assignment of your genome(s).

For a full set of options and description how the tool works, please visit the manual.

Common Issues and Solutions

Issue 1: Running out of memory as described here
- Solution 1: Use --scratch_dir and --pplacer_cpus 1
Issue 2: Using a gtdbtk version with the incorrect database
- Solution 2: Ensure that you use the right gtdbtk-db combination, as listed here

References

Chaumeil, Pierre-Alain, Aaron J. Mussig, Philip Hugenholtz, and Donovan H. Parks. 2019. “GTDB-Tk: A Toolkit to Classify Genomes with the Genome Taxonomy Database.” Bioinformatics (Oxford, England), November. https://doi.org/10.1093/bioinformatics/btz848.

Parks, Donovan H., Maria Chuvochina, Pierre-Alain Chaumeil, Christian Rinke, Aaron J. Mussig, and Philip Hugenholtz. 2020. “A Complete Domain-to-Species Taxonomy for Bacteria and Archaea.” Nature Biotechnology 38 (9): 1079–86. https://doi.org/10.1038/s41587-020-0501-8.