conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/CheckM2
CheckM2 is a software to assess the quality (completeness, contamination, coding density, etc.) of a genome assembly (Chklovski et al., n.d.). Unlike CheckM1, CheckM2 has universally trained machine learning models it applies regardless of taxonomic lineage to predict the completeness and contamination of genomic bins. This allows it to incorporate many lineages in its training set that have few - or even just one - high-quality genomic representatives, by putting it in the context of all other organisms in the training set. As a result of this machine learning framework, CheckM2 is also highly accurate on organisms with reduced genomes or unusual biology, such as the Nanoarchaeota or Patescibacteria.
For more information, check out the tools github page.
Introduction
Installation
Installed on crunchomics: Yes, the following versions of checkm2 are installed as part of the bioinformatics share:
- CheckM2 v1.0.1
- CheckM2 v1.1.0
If you have access to Crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski. Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -p /zfs/omics/projects/bioinformatics/software/miniconda3/envs/checkm2_1.0.1 -c bioconda -c conda-forge checkm2=1.1.0
conda activate checkm2_1.1.0
#install database
checkm2 database --download --path path_to_db_folder
#do test run
checkm2 testrunUsage
Important:
To generate the coding density information in the output file quality_report.tsv the fasta headers of your genomes need to have compatible names. Fasta headers that have whitespace (spaces or tabs) in them, such as >scaffold_193 total_depth=33.25 will result in only 0.0 values appearing in that column. In contrast, fasta headers without spaces, such as >scaffold_193, will create values from 0.0 - 1.0.
# Optional:
## Ensure there are no spaces in the fasta headers
mkdir genomes_clean
for f in data/genomes/*.fa; do
sed 's/^\(>[^ \t]*\)[ \t].*/\1/' "$f" > data/genomes_clean/$(basename $f)
done
# Run Checkm2
## If you want to run want the older version of Checkm2, replace the env name below ith checkm2_1.0.1
conda activate checkm2_1.1.0
checkm2 predict --threads 30 \
--input data/genomes_clean/ \
-x fa \
--output-directory results/checkm2
conda deactivateAfter running this, you fill find all relevant information in the quality_report.tsv file in the output folder.
Useful options (for a full list, use the help function):
--genes: Treat input files as protein files. [Default: False]-x EXTENSION,--extensionEXTENSION: Extension of input files. [Default: .fna]--tmpdirTMPDIR : specify an alternative directory for temporary files--force: overwrite output directory [default: not set]--resume: Reuse Prodigal and DIAMOND results found in output directory [default: not set]--threadsnum_threads,-tnum_threads: number of CPUS to use [default: 1]--ttablettable: Provide a specific prodigal translation table for bins [default: automatically determine either 11 or 4]
For more information, please visit the manual.
Common Issues and Solutions
- Issue 1: Running out of memory
- Solution 1: If you are running CheckM2 on a device with limited RAM, you can use the
--lowmemoption to reduce DIAMOND RAM use by half at the expense of longer runtime.
- Solution 1: If you are running CheckM2 on a device with limited RAM, you can use the
- Issue 2: The coding density column in the quality_report.tsv only shows 0.0 values
- Solution 1: Ensure that there are no spaces in your fasta headers
- Issue 3: CheckM2 crashes with OSError: [Errno 16] Device or resource busy errors
- Solution: Do not use
--tmpdirwith a path in your personal folder. Either omit the flag entirely (defaults to /tmp on the compute node) or redirect the output to/scratch/$USER/
- Solution: Do not use