conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
Atlas
Introduction
Metagenome-atlas is a easy-to-use metagenomic pipeline based on snakemake. It handles all steps from QC, Assembly, Binning, to Annotation (Kieser et al. 2019).
This workflow is best used for short-read Illumina data. Hybrid assembly of long and short reads is supported with spades and metaSpades. However metaSpades needs a paired-end short-read library.
For more information, visit the tools github page and manual
Installation
Installed on crunchomics: Yes,
- Atlas v2.18.1 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
- Notice: This share only includes the atlas software but NOT the databases. Due to the size of the databases you will need to download these yourself
- After getting added to the bioinformatics share you can access the software by running the following command (if you have already done this in the past, you don’t need to run this command):
If you want to install Atlas yourself, you can run:
mamba create --name atlas -c bioconda -c conda-forge metagenome-atlas
Usage
To test the software, we provide some example data and a minimal example describing how to run Atlas. For more details on running Atlas with specific settings, please consult the manual.
To run Atlas, you will need:
- enough memory. It is recommended to use a minimum of ~50GB but an assembly usually requires up to 250GB of memory.
- enough space. Atlas will download some larger databases. If you don’t have the space, you can also omit some steps by editing the config.yaml file (more on that below). The databases downloaded are the following:
- Software needed by Atlas to run: 8G
- CheckM2: 3G
- GUNC database: 13G
- DRAM database: 66G
- GTDB r214 database: 85G
Download example data
To run Atlas, you will need a folder with your raw reads, if you don’t have any available then you can download some example data as outlined below.
cd <path_to_analysis_folder>
#download example data
wget https://zenodo.org/records/6518160/files/0_assembly_and_reads.tar.gz
tar -xzvf 0_assembly_and_reads.tar.gz
#remove unneccessary files
rm 0_assembly_and_reads.tar.gz
rm 0_assembly_and_reads/1_reads/*unpaired*
rm -r 0_assembly_and_reads/2_assembly/
Atlas initialization
The first step when running Atlas is to initialize the environment, by telling Atlas the following:
- Where we want to install the databases needed to run Atlas (here: databases). This is the directory in which all databases are installed so choose it wisely. (here: databases)
- Where the reads are located (here: 0_assembly_and_reads/1_reads/)
conda activate atlas_2.18.1
#prepare database from the raw reads
atlas init --db-dir databases 0_assembly_and_reads/1_reads/
This command parses the folder for fastq files (extension .fastq(.gz)
or .fq(.gz)
, gzipped or not). fastq files can be arranged in subfolders, in which case the subfolder name will be used as a sample name. If you have paired-end reads the files are usually distinguishable by _R1/_R2
or simple _1/_2
in the file names. Atlas searches for these patterns and lists the paired-end files for each sample.
The command creates a samples.tsv
and a config.yaml
in the working directory.
Have a look at samples.tsv
and check if the samples names are inferred correctly. The sample names are used for the naming of contigs, genes, and genomes. Therefore, the sample names should consist only form digits and letters and start with a letter (Even though one -
is allowed). Atlas tries to simplify the file name to obtain unique sample names, if it doesn’t succeed it simply puts S1, S2, … as sample names.
The BinGroup
parameter is used during the genomic binning. In short: If you have between 5 and 150 samples the default (putting everything in one group) is fine. If you have less than 5 samples, put every sample in an individual BinGroup and use metabat as final binner. If you have more samples see the co-binning section for more details.
You should also check the config.yaml
file, especially:
- You may want to add ad host genomes to be removed.
- You may want to change the resources configuration, depending on the system you run atlas on
- In order to decrease the runtime and space requirements, you can omit the taxonomy assignment (and thus not downloading the GTDB database) or DRAM functional assignment by removing the relevant lines in the section
# Annotations section
.
You can run atlas init
with the following options:
-d
,--db-dir
PATH location to store databases (need ~50GB)-w
,--working-dir
PATH location to run atlas--assembler
megahit|spades assembler [default: spades]--data-type
metagenome|metatranscriptome: sample data type [default: metagenome]--interleaved-fastq
: fastq files are paired-end in one files: (interleaved)--threads
INTEGER : number of threads to use per multi-threaded job--skip-qc
: Skip QC, if reads are already pre-processed-h
,--help
: Show this message and exit.
Installing the required software
Atlas will first download all required software via conda. Since v2.18.1 there is a small issue with one of the software scripts. To circumvent running into an error further down the line we will first install all required software without running Atlas further and then editing the problematic script:
#set conda channels to strict (recommended by snakemake)
conda config --set channel_priority strict
#setip the required conda environments
atlas run all --use-conda --conda-create-envs-only
#set conda channel priority back to the default
conda config --set channel_priority flexible
Next, we need to edit one of the scripts. In the command below change databases
to the location you used in the atlas init command after the --db-dir
option.
nano databases/conda_envs/*/lib/python3.11/site-packages/mag_annotator/database_processing.py
Press Ctrl+w
and search for this line of text:
merge_files(glob(path.join(hmm_dir, 'VOG*.hmm')), vog_hmms)
change this line to (don’t change the syntax while doing this, i.e. you still want to keep 4 spaces in front of merge_files
):
merge_files(glob(path.join(hmm_dir, 'hmm', 'VOG*.hmm')), vog_hmms)
Then:
- Press
Ctr+x
- Type
Y
to save - Press
enter
to save the changes without changing the file name
Running Atlas
Next, we start the actual pipeline which will do multiple steps, including quality-control, assembly binning and annotation.
#run atlas
srun --cpus-per-task 20 --mem=100G atlas run all -j 20 --max-mem 100
conda deactivate
The output files are in more detail described here.
You can run atlas run all
with the following options:
-w
,--working-dir
PATH location to run atlas-c
,--config-file
PATH config-file generated with ‘atlas init’-j
,--jobs
INTEGER use at most this many jobs in parallel (see the manual for mor details). [default: 64]--max-mem
FLOAT Specify maximum virtual memory to use by atlas.--profile
TEXT: snakemake profile e.g. for cluster execution-n
,--dryrun
Test execution.-h
,--help
: Show this message and exit.
Common Issues and Solutions
- Issue 1: I am running out of memory/space
- Solution 1: Some steps are quite memory intensive or might need large (~80GB) databases. You can edit the config.yaml file for example using
nano
to omit these steps. To do this, go to the# Annotations section
and remove the lines starting with gtdb to omit the taxonomy annotation. You can also delete the line with dram if you run out of space as this requires a larger database
- Solution 1: Some steps are quite memory intensive or might need large (~80GB) databases. You can edit the config.yaml file for example using