conda config --add envs_dirs /zfs/omics/projects/amplicomics/miniconda3/envs/
Nanophase
Introduction
Nanophase is an easy-to-use pipeline to generate reference-quality MAGs using only Nanopore long reads (long-read-only strategy) or both Nanopore long and Illumina short reads (hybrid strategy) from complex metagenomes (Liu et al. 2022). Since nanophase v0.2.0, it also supports to generate reference-quality genomes from bacterial/archaeal isolates (long-read-only or hybrid strategy). If nanophase is interrupted, it will resume from the last completed stage.
Notice that nanophase does not allow for a separate assembly of multiple samples while at the same time making use of depth information from all of those samples. It is however possible to run nanophase independently on multiple samples or consider a co-assembly. However, “co-assembly of a large number of metagenomes that contain very closely related populations often hinders confident assignments of shared contigs into individual bins” and should be avoided for distinct samples (Shaiber and Eren 2019).
Some more considerations on whether your samples can be used for a co-assembly or not can be found here and here.
Nanophase uses the following tools, please consider citing them as well when using this tool:
- flye
- metabat2
- maxbin2
- SemiBin
- metawrap
- checkm
- racon
- medaka
- polypolish
- POLCA
- bwa
- seqtk
- minimap2
- BBMap
- parallel
- perl
- samtools
- gtdbtk
- fastANI
- blastp
Installation
Nanophase is installed on Crunchomics amplicomics share. If you want access please send an email to n.dombrowski@uva.nl. After you got access, you can add the amplicomics conda environments (in which nanophase is installed) as follows:
If you want to install nanophase yourself, you can follow the steps below. Beware, that the GTDB database is quite large and requires ~80GB of space.
#add necessary channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
#install environment in the amplicomics share
mamba create -p /zfs/omics/projects/amplicomics/miniconda3/envs/nanophase_0.2.3 -c nanophase nanophase -y
## download database: May skip if you have done before or GTDB and PLSDB have been downloaded in the server
#exhange the paths to where you want to download the data
cd /path_to_gtdb_folder
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz && tar xvzf gtdbtk_data.tar.gz
cd /path_to_plsdb_folder
wget https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plsdb.fna.bz2 && bunzip2 plsdb.fna.bz2
#activate env
conda activate nanophase_0.2.3
## setting locations for databases
## change /path_to_gtdb_folder/release_xxx to the real location where you stored the GTDB
## ensure that the release version number is changed to the release version number that was downloaded
echo "export GTDBTK_DATA_PATH=/path_to_gtdb_folder/release_xxx" > $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh
## Change /path/to/plsdb.fna to the real location where you stored the PLSDB
echo "export PLSDB_PATH=/path_to_plsdb_folder/plsdb.fna" >> $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh
#confirm that all packages have been installed
nanophase check
##restart environment for it to recognize the changes
conda deactivate && conda activate nanophase_0.2.3
Usage
Input:
- Illumina short reads and/or Nanopore long-reads in fastq, fastq.gz, fasta, fasta.qz or fq.gz format
Output:
- 01-LongAssemblies sub-folder containing information of Nanopore long-read assemblies (assembler: metaFlye)
- 02-LongBins sub-folder containing the initial bins with relatively low-accuracy quality
- 03-Polishing sub-folder containing polished bins
Example
In the example below, we download the ZymoBIOMICS Gut Microbiome Standard sequenced with Nanopore. This Standard is a mixture of 18 bacterial strains, 2 fungal strains, and 1 archaeal strain in staggered abundances to mimic a true gut microbiome.
Notice:
- You can also run the workflow if you have both Nanopore and Illumina data. If you have only Illumina-data other workflows that work with multiple samples might be more useful to explore
- The analysis was successfully run on Crunchomics with 100 GB of memory and 30 threads on a dataset with 1,679,780 reads and an average length of 4300 bp. The most memory intensive step is running pplacer when you place your genomes with gtdb_tk (one of the last steps of the pipeline). If memory is an issue on your cluster, you can start the pipeline with less memory (i.e. 50GB) to get the assembly and MAGs. Afterwards, you can restart the pipeline with more memory to go through the pplacer step as the pipeline will resume from the last completed stage.
- If your dataset is larger/smaller you might need to adjust the amount of resources but the numbers given should give you an idea on where to start
#start environment
conda activate nanophase_0.2.3
#go to wdir (exchange path to where you want to analyse your data)
cd /path/to/wdir
#prepare folders for better organization
mkdir logs
mkdir scripts
mkdir data
mkdir -p results/seqkit
mkdir -p results/nanophase
#download the test-dataset
#comes with 1,679,780 reads with an avg length of 4300 bp and avg quality of 16
fastq-dump SRR17913199 -O data
#get summary statistics for the dataset
seqkit stats -a -To results/seqkit/stats.tsv data/SRR17913199.fastq
#analyse data with only Nanopore reads
nanophase meta -l data/SRR17913199.fastq -t 30 -o results
#check memory usage for slurm job: 39147
sacct -j 39147 --format=User,JobID,Jobname,state,start,end,elapsed,MaxRss,ncpus
Useful arguments:
--long_read_only
only Nanopore long reads were involved [default: on]--hybrid
both short and long reads were required [Optional]-l
,--long
Nanopore reads: fasta/q file that basecalled by Guppy 5+ or using 20+ chemistry was recommended if only Nanopore reads were included [Mandatory]-1
Illumina short reads: fasta/q paired-end #1 file [Optional]-2
Illumina short reads: fasta/q paired-end #2 file [Optional]-m
,--medaka_model
medaka model used for medaka polishing [default: r1041_e82_400bps_sup_g615]-e
,--environment
Build-in model of SemiBin [default: wastewater]; detail see: SemiBin single_easy_bin -h. Other choices are: human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, wastewater, chicken_caecum, global-t
,--threads
number of threads that used for assembly and polishing [default: 16]-o
,--out
output directory [default: ./nanophase-out]