conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
MetaCerberus
Introduction
MetaCerberus transforms raw sequencing (i.e. genomic, transcriptomics, metagenomics, metatranscriptomic) data into knowledge (Figueroa III et al. 2024). It is a start to finish python code for versatile analysis of the Functional Ontology Assignments for Metagenomes (FOAM), KEGG, CAZy/dbCAN, VOG, pVOG, PHROG, COG, and a variety of other databases including user customized databases via Hidden Markov Models (HMM) for functional annotation for complete metabolic analysis across the tree of life (i.e., bacteria, archaea, phage, viruses, eukaryotes, and whole ecosystems). MetaCerberus also provides automatic differential statistics using DESeq2/EdgeR, pathway enrichments with GAGE, and pathway visualization with Pathview R.
For more information, please visit the software’s github page.
This software contains the following databases as of August 2024:
Database | Last Update | Version | Publication | MetaCerberus Update Version |
---|---|---|---|---|
KEGG/KOfams | 2024-01-01 | Jan24 | Aramaki et al. 2020 | beta |
FOAM/KOfams | 2017 | 1 | Prestat et al. 2014 | beta |
COG | 2020 | 2020 | Galperin et al. 2020 | beta |
dbCAN/CAZy | 2023-08-02 | 12 | Yin et al., 2012 | beta |
VOG | 2017-03-03 | 80 | Website | beta |
pVOG | 2016 | 2016 | Grazziotin et al. 2017 | 1.2 |
PHROG | 2022-06-15 | 4 | Terizan et al., 2021 | 1.2 |
PFAM | 2023-09-12 | 36 | Mistry et al. 2020 | 1.3 |
TIGRfams | 2018-06-19 | 15 | Haft et al. 2003 | 1.3 |
PGAPfams | 2023-12-21 | 14 | Tatusova et al. 2016 | 1.3 |
AMRFinder-fams | 2024-02-05 | 2024-02-05 | Feldgarden et al. 2021 | 1.3 |
NFixDB | 2024-01-22 | 2 | Bellanger et al. 2024 | 1.3 |
GVDB | 2021 | 1 | Aylward et al. 2021 | 1.3 |
Pads Arsenal | 2019-09-09 | 1 | Zhang et al. 2020 | Coming soon |
efam-XC | 2021-05-21 | 1 | Zayed et al. 2021 | Coming soon |
NMPFams | 2021 | 1 | Baltoumas et al. 2024 | Coming soon |
MEROPS | 2017 | 1 | Rawlings et al. 2018 | Coming soon |
FESNov | 2024 | 1 | Rodríguez del Río et al. 2024 | Coming soon |
Installation
Installed on crunchomics: Yes,
- MetaCerberus v1.3.2 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -n metacerberus_1.3.2 -c conda-forge -c bioconda metacerberus
conda activate metacerberus_1.3.2
metacerberus.py --setup #Setting up FragGeneScanRS
metacerberus.py --download # Downloading required databases
conda deactivate
Usage
General usage notes:
- MetaCerberus can use three different input files:
- raw read data from any sequencing platform (Illumina, PacBio, or Oxford Nanopore),
- assembled contigs, as MAGs, vMAGs, isolate genomes, or a collection of contigs,
- amino acid fasta (.faa), previously called pORFs.
- In QC mode, raw reads are quality controlled via FastQC prior and post trim FastQC. Raw reads are then trimmed via data type; if the data is Illumina or PacBio, fastp is called, otherwise it assumes the data is Oxford Nanopore then Porechop is utilized PoreChop.
- In the formatting and gene prediction stage, contigs and genomes are checked for N repeats. These N repeats are removed by default.
- Contigs can be converted to pORFs using Prodigal, FragGeneScanRs, and Prodigal-gv as specified by user preference.
Example usage on a folder with two protein fasta files:
conda activate metacerberus_1.3.2
mkdir metacerberus
# Run MetaCerberus on a folder with two protein files
metacerberus.py --protein faa/ --hmm ALL --dir_out metacerberus --cpus 20
conda deactivate
When running this example, the results will be stored in a metacerberus
folder. Inside this folder, the final folder will contain:
File Extension | Description Summary | MetaCerberus Update Version |
---|---|---|
.gff | General Feature Format | 1.3 |
.gbk | GenBank Format | 1.3 |
.fna | Nucleotide FASTA file of the input contig sequences. | 1.3 |
.faa | Protein FASTA file of the translated CDS/ORFs sequences. | 1.3 |
.ffn | FASTA Feature Nucleotide file, the Nucleotide sequence of translated CDS/ORFs. | 1.3 |
.html | Summary statistics and/or visualizations, in step 10 folder | 1.3 |
.txt | Statistics relating to the annotated features found. | 1.3 |
level.tsv | Various levels of hierachical steps that is tab-separated file from various databases | 1.3 |
rollup.tsv | All levels of hierachical steps that is tab-separated file from various databases | 1.3 |
.tsv | Final Annotation summary, Tab-separated file of all features from various databases | 1.3 |
Since final_annotation_summary.tsv
only provides the best hit across all databases, we provide two small scripts that you can run if you want to generate a table that allows you to compare the annotations across all databases for each protein. You can run this as follows (only tested on protein files so far):
# Combine results from individual database folders for each protein file
for i in metacerberus/final/Protein_*_protein; do
python /zfs/omics/projects/bioinformatics/scripts/merge_metacerberus_individual_dbs.py -i ${i} -o merged_annotations.tsv
done
# Concatenate the merged results into one document
python /zfs/omics/projects/bioinformatics/scripts/combine_metacerberus_annotations.py -b metacerberus/final/ -o combined_annotations.tsv
Options:
usage: metacerberus.py [--setup] [--update] [--list-db] [--download [DOWNLOAD ...]] [--uninstall] [-c CONFIG] [--prodigal PRODIGAL [PRODIGAL ...]]
[--fraggenescan FRAGGENESCAN [FRAGGENESCAN ...]] [--super SUPER [SUPER ...]] [--prodigalgv PRODIGALGV [PRODIGALGV ...]]
[--phanotate PHANOTATE [PHANOTATE ...]] [--protein PROTEIN [PROTEIN ...]] [--hmmer-tsv HMMER_TSV [HMMER_TSV ...]] [--class CLASS]
[--illumina | --nanopore | --pacbio] [--dir-out DIR_OUT] [--replace] [--keep] [--tmpdir TMPDIR] [--hmm HMM [HMM ...]] [--db-path DB_PATH] [--meta]
[--scaffolds] [--minscore MINSCORE] [--evalue EVALUE] [--skip-decon] [--skip-pca] [--cpus CPUS] [--chunker CHUNKER] [--grouped] [--version] [-h]
[--adapters ADAPTERS] [--qc_seq QC_SEQ]
Setup arguments:
--setup Setup additional dependencies [False]
--update Update downloaded databases [False]
--list-db List available and downloaded databases [False]
--download [DOWNLOAD ...]
Downloads selected HMMs. Use the option --list-db for a list of available databases, default is to download all available databases
--uninstall Remove downloaded databases and FragGeneScan+ [False]
Input files
At least one sequence is required.
accepted formats: [.fastq, .fq, .fasta, .fa, .fna, .ffn, .faa]
Example:
> metacerberus.py --prodigal file1.fasta
> metacerberus.py --config file.config
*Note: If a sequence is given in [.fastq, .fq] format, one of --nanopore, --illumina, or --pacbio is required.:
-c CONFIG, --config CONFIG
Path to config file, command line takes priority
--prodigal PRODIGAL [PRODIGAL ...]
Prokaryote nucleotide sequence (includes microbes, bacteriophage)
--fraggenescan FRAGGENESCAN [FRAGGENESCAN ...]
Eukaryote nucleotide sequence (includes other viruses, works all around for everything)
--super SUPER [SUPER ...]
Run sequence in both --prodigal and --fraggenescan modes
--prodigalgv PRODIGALGV [PRODIGALGV ...]
Giant virus nucleotide sequence
--phanotate PHANOTATE [PHANOTATE ...]
Phage sequence
--protein PROTEIN [PROTEIN ...], --amino PROTEIN [PROTEIN ...]
Protein Amino Acid sequence
--hmmer-tsv HMMER_TSV [HMMER_TSV ...]
Annotations tsv file from HMMER (experimental)
--class CLASS path to a tsv file which has class information for the samples. If this file is included scripts will be included to run Pathview in R
--illumina Specifies that the given FASTQ files are from Illumina
--nanopore Specifies that the given FASTQ files are from Nanopore
--pacbio Specifies that the given FASTQ files are from PacBio
Output options:
--dir-out DIR_OUT path to output directory, defaults to "results-metacerberus" in current directory. [./results-metacerberus]
--replace Flag to replace existing files. [False]
--keep Flag to keep temporary files. [False]
--tmpdir TMPDIR temp directory for RAY (experimental) [system tmp dir]
Database options:
--hmm HMM [HMM ...] A list of databases for HMMER. Use the option --list-db for a list of available databases [KOFam_all]
--db-path DB_PATH Path to folder of databases [Default: under the library path of MetaCerberus]
optional arguments:
--meta Metagenomic nucleotide sequences (for prodigal) [False]
--scaffolds Sequences are treated as scaffolds [False]
--minscore MINSCORE Score cutoff for parsing HMMER results [60]
--evalue EVALUE E-value cutoff for parsing HMMER results [1e-09]
--remove-n-repeats Remove N repeats, splitting contigs [False]
--skip-decon Skip decontamination step. [False]
--skip-pca Skip PCA. [False]
--cpus CPUS Number of CPUs to use per task. System will try to detect available CPUs if not specified [Auto Detect]
--chunker CHUNKER Split files into smaller chunks, in Megabytes [Disabled by default]
--grouped Group multiple fasta files into a single file before processing. When used with chunker can improve speed
--version, -v show the version number and exit
-h, --help show this help message and exit
--adapters ADAPTERS FASTA File containing adapter sequences for trimming
--qc_seq QC_SEQ FASTA File containing control sequences for decontamination
Args that start with '--' can also be set in a config file (specified via -c). Config file syntax allows: key=value, flag=true, stuff=[a,b,c] (for details, see syntax at
https://goo.gl/R74nmi). In general, command-line values override config file values which override defaults.
Common Issues and Solutions
- Issue 1: : The files providing counts have an issue an only provide counts of 1 with v1.3.2
- Solution 1: For now do not use these files, the authors of the software are aware