fastplong

Introduction

A tool for ultrafast preprocessing and quality control for long reads (Nanopore, PacBio, Cyclone, etc.). For full usage information, please visit the tool’s manual.

Installation

Installed on Crunchomics: Yes,

  • fastplong v0.4.1 is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share, then you can send an email with your Uva netID to Nina Dombrowski, n.dombrowski@uva.nl.
  • Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -n fastplong_0.4.1 -c bioconda fastplong=0.4.1

Usage

fastplong takes as input a FASTQ file and outputs a quality-filtered FASTQ file. Both input and output files can be gzip-compressed.

It allows filtering or trimming reads based on:

  • quality
  • sequence length
  • sequence complexity

Additionally, fastplong can be used to remove adapter sequences. If no adapter sequences are supplied, fastplong automatically detects read start and end adapters. Alternatively, adapters can be specified via:

  • -s, --start_adapter for read start adapter sequence, and -e, --end_adapter for read end adapter sequence
  • You can also specify -a, --adapter_fasta to give a FASTA file to tell fastplong to trim multiple adapters. You can find an example adapter FASTA file here. This file includes the adapter sequences used by porechop via adapters.py.

Below you find a basic usage example.

fastplong \
    -i data/barcode01_merged.fastq.gz \
    -l 1400 --length_limit 1700 \
    --mean_qual 8 --cut_front 8 \
    --thread 2 \
    -o results/fastplong/barcode01_filtered.fastq.gz

Useful options:

usage: fastplong -i <in> -o <out> [options...]
fastplong: ultra-fast FASTQ preprocessing and quality control for long reads
usage: ./fastplong [options] ... 

options:
  -i, --in                           read input file name (string [=])
  -o, --out                          read output file name (string [=])
      --failed_out                   specify the file to store reads that cannot pass the filters. (string [=])
  -z, --compression                  compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])
      --stdin                        input from STDIN.
      --stdout                       stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.
      --reads_to_process             specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])
      --dont_overwrite               don't overwrite existing files. Overwritting is allowed by default.
  -V, --verbose                      output verbose log information (i.e. when every 1M reads are processed).
  -A, --disable_adapter_trimming     adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled
  -s, --start_adapter                the adapter sequence at read start (5'). (string [=auto])
  -e, --end_adapter                  the adapter sequence at read end (3'). (string [=auto])
  -a, --adapter_fasta                specify a FASTA file to trim both read by all the sequences in this FASTA file (string [=])
  -d, --distance_threshold           threshold of sequence-adapter-distance/adapter-length (0.0 ~ 1.0), greater value means more adapters detected (double [=0.25])
      --trimming_extension           when an adapter is detected, extend the trimming to make cleaner trimming, default 10 means trimming 10 bases more (int [=10])
  -f, --trim_front                   trimming how many bases in front for read, default is 0 (int [=0])
  -t, --trim_tail                    trimming how many bases in tail for read, default is 0 (int [=0])
  -x, --trim_poly_x                  enable polyX trimming in 3' ends.
      --poly_x_min_len               the minimum length to detect polyX in the read tail. 10 by default. (int [=10])
  -5, --cut_front                    move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.
  -3, --cut_tail                     move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.
  -W, --cut_window_size              the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])
  -M, --cut_mean_quality             the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])
      --cut_front_window_size        the window size option of cut_front, default to cut_window_size if not specified (int [=4])
      --cut_front_mean_quality       the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])
      --cut_tail_window_size         the window size option of cut_tail, default to cut_window_size if not specified (int [=4])
      --cut_tail_mean_quality        the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])
  -N, --mask                         mask the low quality regions with N, these regions are detected by sliding window with mean quality < mask_mean_quality.
      --mask_window_size             the size of the sliding window to evaluate the mean quality for N masking(5~1000000), default: 50 (int [=50])
      --mask_mean_quality            the mean quality requirement for sliding window N masking (5~30), default: 10 (Q10) (int [=10])
  -b, --break                        break the reads by discarding the low quality regions, these regions are detected by sliding window with mean quality < break_mean_quality.
      --break_window_size            the size of the sliding window to evaluate the mean quality for sliding window breaking(5~1000000), default: 100 (int [=100])
      --break_mean_quality           the mean quality requirement for sliding window breaking (5~30), default: 10 (Q10) (int [=10])
  -Q, --disable_quality_filtering    quality filtering is enabled by default. If this option is specified, quality filtering is disabled
  -q, --qualified_quality_phred      the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])
  -u, --unqualified_percent_limit    how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])
      --n_base_limit                 if number of N base is >n_base_limit, then this read is discarded (0~1000000). 0 means no N allowed, default 1000000 means no N limit (int [=1000000])
  -n, --n_percent_limit              if one read's N base percentage is >n_percent_limit, then this read is discarded (0~100). Default 10 means 10% (int [=10])
  -m, --mean_qual                    if one read's mean_qual quality score <mean_qual, then this read is discarded. Default 0 means no requirement (int [=0])
  -L, --disable_length_filtering     length filtering is enabled by default. If this option is specified, length filtering is disabled
  -l, --length_required              reads shorter than length_required will be discarded, default is 20. (int [=20])
      --length_limit                 reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])
  -y, --low_complexity_filter        enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).
  -Y, --complexity_threshold         the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])
  -j, --json                         the json format report file name (string [=fastplong.json])
  -h, --html                         the html format report file name (string [=fastplong.html])
  -R, --report_title                 should be quoted with ' or ", default is "fastplong report" (string [=fastplong report])
  -w, --thread                       worker thread number, default is 3 (int [=3])
      --split                        split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])
      --split_by_lines               split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])
      --split_prefix_digits          the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])
  -?, --help                         print this message