CheckM2

CheckM2 is a software to assess the quality (completeness, contamination, coding density, etc.) of a genome assembly (Chklovski et al., n.d.). Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.

For more information, check out the tools github page.

Introduction

Installation

Installed on crunchomics: Yes,

  • CheckM2 v1.0.1 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski. Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -p /zfs/omics/projects/bioinformatics/software/miniconda3/envs/checkm2_1.0.1 -c bioconda -c conda-forge checkm2=1.0.1

conda activate checkm2_1.0.1 

#install right python version, otherwise you get a class error
mamba install python=3.8 

#install database 
checkm2 database --download --path /zfs/omics/projects/bioinformatics/databases/checkm

#do test run 
checkm2 testrun

Usage

#run checkm2
conda activate checkm2_1.0.1

checkm2 predict --threads 30 \
  --input  folder_with_genomes_to_analyse/  \
  -x fasta \
  --output-directory results/checkm2 

conda deactivate

After running this, you fill find all relevant information in the quality_report.tsv file in the output folder.

Useful options (for a full list, use the help function):

  • --genes : Treat input files as protein files. [Default: False]
  • -x EXTENSION, --extension EXTENSION: Extension of input files. [Default: .fna]
  • --tmpdir TMPDIR : specify an alternative directory for temporary files
  • --force: overwrite output directory [default: not set]
  • --resume: Reuse Prodigal and DIAMOND results found in output directory [default: not set]
  • --threads num_threads, -t num_threads: number of CPUS to use [default: 1]
  • --ttable ttable: Provide a specific prodigal translation table for bins [default: automatically determine either 11 or 4]

For more information, please visit the manual.

Common Issues and Solutions

  • Issue 1: Running out of memory
    • Solution 1: If you are running CheckM2 on a device with limited RAM, you can use the --lowmem option to reduce DIAMOND RAM use by half at the expense of longer runtime.

References

Chklovski, Alex, Donovan H. Parks, Ben J. Woodcroft, and Gene W. Tyson. n.d. “CheckM2: A Rapid, Scalable and Accurate Tool for Assessing Microbial Genome Quality Using Machine Learning.” https://doi.org/10.1101/2022.07.11.499243.