Nanophase

Introduction

Nanophase is an easy-to-use pipeline to generate reference-quality MAGs using only Nanopore long reads (long-read-only strategy) or both Nanopore long and Illumina short reads (hybrid strategy) from complex metagenomes (Liu et al. 2022). Since nanophase v0.2.0, it also supports to generate reference-quality genomes from bacterial/archaeal isolates (long-read-only or hybrid strategy). If nanophase is interrupted, it will resume from the last completed stage.

Notice that nanophase does not allow for a separate assembly of multiple samples while at the same time making use of depth information from all of those samples. It is however possible to run nanophase independently on multiple samples or consider a co-assembly. However, “co-assembly of a large number of metagenomes that contain very closely related populations often hinders confident assignments of shared contigs into individual bins” and should be avoided for distinct samples (Shaiber and Eren 2019).

Some more considerations on whether your samples can be used for a co-assembly or not can be found here and here.

Nanophase uses the following tools, please consider citing them as well when using this tool:

  • flye
  • metabat2
  • maxbin2
  • SemiBin
  • metawrap
  • checkm
  • racon
  • medaka
  • polypolish
  • POLCA
  • bwa
  • seqtk
  • minimap2
  • BBMap
  • parallel
  • perl
  • samtools
  • gtdbtk
  • fastANI
  • blastp

Installation

Nanophase is installed on Crunchomics amplicomics share. If you want access please send an email to . After you got access, you can add the amplicomics conda environments (in which nanophase is installed) as follows:

conda config --add envs_dirs /zfs/omics/projects/amplicomics/miniconda3/envs/

If you want to install nanophase yourself, you can follow the steps below. Beware, that the GTDB database is quite large and requires ~80GB of space.

#add necessary channels
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

#install environment in the amplicomics share
mamba create -p /zfs/omics/projects/amplicomics/miniconda3/envs/nanophase_0.2.3 -c nanophase nanophase -y

## download database: May skip if you have done before or GTDB and PLSDB have been downloaded in the server
#exhange the paths to where you want to download the data
cd /path_to_gtdb_folder
wget https://data.gtdb.ecogenomic.org/releases/latest/auxillary_files/gtdbtk_data.tar.gz && tar xvzf gtdbtk_data.tar.gz

cd /path_to_plsdb_folder
wget https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/plsdb.fna.bz2 && bunzip2 plsdb.fna.bz2

#activate env
conda activate nanophase_0.2.3

## setting locations for databases
## change /path_to_gtdb_folder/release_xxx to the real location where you stored the GTDB
## ensure that the release version number is changed to the release version number that was downloaded
echo "export GTDBTK_DATA_PATH=/path_to_gtdb_folder/release_xxx" > $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh

## Change /path/to/plsdb.fna to the real location where you stored the PLSDB
echo "export PLSDB_PATH=/path_to_plsdb_folder/plsdb.fna" >> $(dirname $(dirname `which nanophase`))/etc/conda/activate.d/np_db.sh

#confirm that all packages have been installed
nanophase check

##restart environment for it to recognize the changes 
conda deactivate && conda activate nanophase_0.2.3 

Usage

Input:

  • Illumina short reads and/or Nanopore long-reads in fastq, fastq.gz, fasta, fasta.qz or fq.gz format

Output:

  • 01-LongAssemblies sub-folder containing information of Nanopore long-read assemblies (assembler: metaFlye)
  • 02-LongBins sub-folder containing the initial bins with relatively low-accuracy quality
  • 03-Polishing sub-folder containing polished bins

Example

In the example below, we download the ZymoBIOMICS Gut Microbiome Standard sequenced with Nanopore. This Standard is a mixture of 18 bacterial strains, 2 fungal strains, and 1 archaeal strain in staggered abundances to mimic a true gut microbiome.

Notice:

  • You can also run the workflow if you have both Nanopore and Illumina data. If you have only Illumina-data other workflows that work with multiple samples might be more useful to explore
  • The analysis was successfully run on Crunchomics with 100 GB of memory and 30 threads on a dataset with 1,679,780 reads and an average length of 4300 bp. The most memory intensive step is running pplacer when you place your genomes with gtdb_tk (one of the last steps of the pipeline). If memory is an issue on your cluster, you can start the pipeline with less memory (i.e. 50GB) to get the assembly and MAGs. Afterwards, you can restart the pipeline with more memory to go through the pplacer step as the pipeline will resume from the last completed stage.
  • If your dataset is larger/smaller you might need to adjust the amount of resources but the numbers given should give you an idea on where to start
#start environment 
conda activate nanophase_0.2.3

#go to wdir (exchange path to where you want to analyse your data)
cd /path/to/wdir

#prepare folders for better organization 
mkdir logs
mkdir scripts
mkdir data
mkdir -p results/seqkit
mkdir -p results/nanophase

#download the test-dataset 
#comes with 1,679,780 reads with an avg length of 4300 bp and avg quality of 16
fastq-dump SRR17913199 -O data

#get summary statistics for the dataset
seqkit stats -a -To results/seqkit/stats.tsv data/SRR17913199.fastq 

#analyse data with only Nanopore reads 
nanophase meta -l data/SRR17913199.fastq -t 30 -o results

#check memory usage for slurm job: 39147
sacct -j 39147 --format=User,JobID,Jobname,state,start,end,elapsed,MaxRss,ncpus

Useful arguments:

  • --long_read_only only Nanopore long reads were involved [default: on]
  • --hybrid both short and long reads were required [Optional]
  • -l, --long Nanopore reads: fasta/q file that basecalled by Guppy 5+ or using 20+ chemistry was recommended if only Nanopore reads were included [Mandatory]
  • -1 Illumina short reads: fasta/q paired-end #1 file [Optional]
  • -2 Illumina short reads: fasta/q paired-end #2 file [Optional]
  • -m, --medaka_model medaka model used for medaka polishing [default: r1041_e82_400bps_sup_g615]
  • -e, --environment Build-in model of SemiBin [default: wastewater]; detail see: SemiBin single_easy_bin -h. Other choices are: human_gut, dog_gut, ocean, soil, cat_gut, human_oral, mouse_gut, pig_gut, built_environment, wastewater, chicken_caecum, global
  • -t, --threads number of threads that used for assembly and polishing [default: 16]
  • -o, --out output directory [default: ./nanophase-out]

References

Liu, Lei, Yu Yang, Yu Deng, and Tong Zhang. 2022. “Nanopore Long-Read-Only Metagenomics Enables Complete and High-Quality Genome Reconstruction from Mock and Complex Metagenomes.” Microbiome 10 (1). https://doi.org/10.1186/s40168-022-01415-8.
Shaiber, Alon, and A. Murat Eren. 2019. “Composite Metagenome-Assembled Genomes Reduce the Quality of Public Genome Repositories.” Edited by David A. Relman. mBio 10 (3). https://doi.org/10.1128/mbio.00725-19.