Ribodetector

Introduction

RiboDetector is a software developed to accurately yet rapidly detect and remove rRNA sequences from metagenomic, metatranscriptomic, and ncRNA sequencing data (Deng et al. 2022). It was developed based on LSTMs and optimized for both GPU and CPU usage to achieve a 10 times on CPU and 50 times on a consumer GPU faster runtime compared to the current state-of-the-art software. Moreover, it is very accurate, with ~10 times fewer false classifications. Finally, it has a low level of bias towards any GO functional groups.

For more information, check out the manual

Installation

Installed on crunchomics: Yes,

  • Ribodetector v0.3.1 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
  • Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

mamba create -p /zfs/omics/projects/bioinformatics/software/miniconda3/envs/ribodetector_0.3.1 -c bioconda ribodetector python=3.8

Usage

In the example below, we use the CPU version of ribodetector. If you have access to GPUs, check out the manual for more information.

To run ribodetector, you need to provide it with the sequencing length, You can set the -l parameter to the mean read length if you have reads with variable length. The mean read length can be computed with seqkit stats.

conda activate ribodetector_0.3.1

ribodetector_cpu -t 20 \
  -i sample_R1_trim.fastq.gz sampleR2_trim.fastq.gz \
  -l 138 \
  -e rrna \
  --chunk_size 256 \
  -o sampleR1_nonrrna.fastq.gz sample_R2_nonrrna.fastq.gz \
  --log sample.log

conda deactivate

optional arguments:

  • -h, --help show this help message and exit
  • -lLEN, --len LEN Sequencing read length (mean length). Note: the accuracy reduces for reads shorter than 40.
  • -i [INPUT [INPUT …]],--input [INPUT [INPUT …]] Path of input sequence files (fasta and fastq), the second file will be considered as second end if two files given.
  • -o [OUTPUT [OUTPUT …]], --output [OUTPUT [OUTPUT …]] Path of the output sequence files after rRNAs removal (same number of files as input). (Note: 2 times slower to write gz files)
  • -r [RRNA [RRNA …]], --rrna [RRNA [RRNA …]] Path of the output sequence file of detected rRNAs (same number of files as input)
  • -e {rrna,norrna,both,none}, --ensure {rrna,norrna,both,none} Ensure which classificaion has high confidence for paired end reads. norrna: output only high confident non-rRNAs, the rest are clasified as rRNAs; rrna: vice versa, only high confident rRNAs are classified as rRNA and the rest output as non-rRNAs; both: both non-rRNA and rRNA prediction with high confidence; none: give label based on the mean probability of read pair. (Only applicable for paired end reads, discard the read pair when their predicitons are discordant)
  • -t THREADS, --threads THREADS number of threads to use. (default: 20)
  • --chunk_size CHUNK_SIZE chunk_size * 1024 reads to load each time. When chunk_size=1000 and threads=20, consumming ~20G memory, better to be multiples of the number of threads..
  • --log LOG Log file name
  • -v, --version Show program’s version number and exit

References

Deng, Zhi-Luo, Philipp C Münch, René Mreches, and Alice C McHardy. 2022. “Rapid and Accurate Identification of Ribosomal RNA Sequences via Deep Learning.” Nucleic Acids Research 50 (10): e60–60. https://doi.org/10.1093/nar/gkac112.