ngspeciesid

Introduction

NGSpeciesID is a tool for clustering and consensus forming of long-read amplicon sequencing data (has been used with both PacBio and Oxford Nanopore data) (Sahlin, Lim, and Prost 2021).

For more information, please visit the tools Github page.

Installation

Installed on Crunchomics: Yes,

NGSpeciesID v0.3.0 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):

conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

# Setup environment
mamba create -p NGSpeciesID_0.3.0 python=3.6 pip 

# Install NGSpeciesID inside the new env
mamba activate NGSpeciesID_0.3.0 
mamba install --yes -c conda-forge -c bioconda medaka==0.11.5 openblas==0.3.3 spoa racon minimap2 pysam==0.15.2
pip install NGSpeciesID
pip install absl-py==1.1.0

# Check if help is available
NGSpeciesID --help

# Test install with provided files found on the github repo 
curl -LO https://raw.githubusercontent.com/ksahlin/NGSpeciesID/master/test/sample_h1.fastq
NGSpeciesID --ont --fastq sample_h1.fastq --outfolder ./sample_h1 --consensus --medaka
conda deactivate

Usage

Basic Usage

General comments:

NGSpeciesID takes as input a fastq file generated by an Oxford Nanopore basecaller
There are different settings to control that will differ depending on your dataset (single species amplicon versus amplicons generated from a community) and quality of your data. The most critical parameters to adjust are:
- --abundance_ratio: This setting defines what clusters are refined. Only clusters larger than a fraction of number of total reads are considered (default 0.1). When working with community amplicon data it is recommended to set this value lower than the default as for example with a read depth of 10,000 reads only clusters are considered if they are covered by 1000 reads thereby removing many low abundant taxa. However, you also do not want to go too low and want clusters be represented by at least 50-100 reads in order to generate a useful consensus sequence.
- --aligned_threshold: Defines the the minimum fraction of the read that must be aligned to be included in a cluster (default 0.4). This means that at least 40 bases out of the 100 bases must be aligned to the reference, even if those 40 bases include mismatches or gaps. This threshold is affected by the quality of the reads.
- --mapped_threshold: Defines the minimum fraction of a read that must be classified as mapped to a reference to be considered part of a cluster (default: 0.7). This threshold is affected by the read length variation. Assume you work with a 1500 bp amplicon, using the default cutoff this means that 70% of the read, i.e. 1050 bp, must be mapped.

Basic usage:

conda activate NGSpeciesID_0.3.0 

NGSpeciesID --ont --fastq data/my_data.fastq \
--outfolder results/ \
--consensus --medaka --t 20 --aligned_threshold 0.8 --abundance_ratio 0.01 --mapped_threshold 0.7

conda deactivate

Useful options (for a full list of options please use the tools help function and visit the manual):

--q QUALITY_THRESHOLD: Filters reads with average phred quality value under this number (default = 7.0). (default: 7.0)
--consensus After clustering, (1) run spoa on all clusters, (2) detect reverse complements, (3) run medaka.
--abundance_ratioABUNDANCE_RATIO Threshold for –consensus algorithm. Consider only clusters larger than a fraction of number of total reads (default 0.1) (default: 0.1)
--m TARGET_LENGTH Intended amplicon length. Invoked to filter out reads with length greater than m + s or smaller than m - s (default = 0 which means no filtering) (default: 0)
--s TARGET_DEVIATION Maximum allowed amplicon-length deviation. Invoked to filter out reads with length greater than m + s or smaller than m - s (default = 0 which means no filtering) (default: 0)
--rc_identity_threshold RC_IDENTITY_THRESHOLD: Threshold for –consensus algorithm. Define a reverse complement if identity is over this threshold (default 0.9) (default: 0.9)
--sample_size SAMPLE_SIZE Use sample_size reads in the NGSpecies pipeline (default = 0 which means all reads considered). If sample size is larger than actual number of reads, all reads will be used. (default: 0)
--mapped_threshold MAPPED_THRESHOLD Minmum mapped fraction of read to be included in cluster. The density of minimizers to classify a region as mapped depends on quality of the read. (default: 0.7)
--aligned_threshold ALIGNED_THRESHOLD Minmum aligned fraction of read to be included in cluster. Aligned identity depends on the quality of the read. (default: 0.4)

Generated outputs:

Polished consensus sequence(s). A folder named “medaka_cl_id_X”[/“racon_cl_id_X”] is created for each predicted consensus. Each such folder contains a sequence “consensus.fasta” which is the final output of NGSpeciesID.
Draft spoa consensus sequences of each of the clusters are given as consensus_reference_X.fasta (where X is a number).
The final cluster information is given in a tsv file final_clusters.tsv present in the specified output folder.

General recommendations:

The ideal settings will depend on the quality of your data and it is recommended to test different thresholds for the abundance_ratio (depending on the average phred score of your data) as well as mapped_threshold and aligned_threshold (depending on the average phred score and length of your reads)
To compare results you can look at:
- How many consensus sequences were generated for each run
- How many multi-mappers map back to the consensus sequences

For multiple samples (advanced)

To run Autocycler on more than one genome you can run the code above with either a for loop or with a job scheduler, such as SLURM. To do so, we also provide a bash script. This script:

runs NGSpeciesID on every .fastq.gz file found in the input directory. Output files are named using the name of the input file, so name these accordingly
parses the output of NGSpeciesID. Specifically, the header of the consensus sequences is shortened and individual consensus sequences are combined into a single file named based on the input file

The script, ngspeciesid_pipeline_polished.sh, is available on Crunchomics at /zfs/omics/projects/bioinformatics/scripts/ or can be downloaded here. This scripts requires the following arguments to run:

-r <desired_read_nr> Total reads threshold (default: 100). Consider only clusters larger than a number of total reads. Important: This behaves a bit different than the default settings found in NGSpeciesID. Instead of using a proportion, which can result in varying thresholds when running this on samples with different total read numbers, this wrapper tries to only work with clusters that have approximately 100 reads when using default settings.
-a <aln_thres> Alignment threshold (default: 0.4). Minimum aligned fraction of read to be included in the cluster
-m <mapped_thres> Mapped threshold (default: 0.7). Minimum mapped fraction of read to be included in the cluster
-p <polishing_method> The polishing method to use: ‘medaka’ or ‘racon’.
-d <run_dir> A unique name for this run, used to organize output files. Useful when testing different input parameters.
-i <input_dir> The directory containing input FASTQ files. Workflow does not work on compressed files or files with an extension other than fastq.
-o <output_dir> The base directory where output files will be saved.
-l <log_dir> The directory where log files will be saved.
-t <threads> Number of threads to use (default: 20).

Example usage:

conda activate NGSpeciesID_0.3.0 

bash ./ngspeciesid_pipeline_polished.sh \
    -r 100 -m 0.7 -a 0.6 -t 20 -d run1 -p medaka \
    -i data/ \
    -o results/ngspeciesID -l logs

conda deactivate

References

Sahlin, Kristoffer, Marisa C. W. Lim, and Stefan Prost. 2021. “NGSpeciesID: DNA Barcode and Amplicon Consensus Generation from Long-Read Sequencing Data.” Ecology and Evolution 11 (3): 1392–98. https://doi.org/10.1002/ece3.7146.