conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
Ribodetector
Introduction
RiboDetector
is a software developed to accurately yet rapidly detect and remove rRNA sequences from metagenomic, metatranscriptomic, and ncRNA sequencing data (Deng et al. 2022). It was developed based on LSTMs and optimized for both GPU and CPU usage to achieve a 10 times on CPU and 50 times on a consumer GPU faster runtime compared to the current state-of-the-art software. Moreover, it is very accurate, with ~10 times fewer false classifications. Finally, it has a low level of bias towards any GO functional groups.
For more information, check out the manual
Installation
Installed on crunchomics: Yes,
- Ribodetector v0.3.1 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -p /zfs/omics/projects/bioinformatics/software/miniconda3/envs/ribodetector_0.3.1 -c bioconda ribodetector python=3.8
Usage
In the example below, we use the CPU version of ribodetector. If you have access to GPUs, check out the manual for more information.
To run ribodetector, you need to provide it with the sequencing length, You can set the -l
parameter to the mean read length if you have reads with variable length. The mean read length can be computed with seqkit stats
.
conda activate ribodetector_0.3.1
ribodetector_cpu -t 20 \
-i sample_R1_trim.fastq.gz sampleR2_trim.fastq.gz \
-l 138 \
-e rrna \
--chunk_size 256 \
-o sampleR1_nonrrna.fastq.gz sample_R2_nonrrna.fastq.gz \
--log sample.log
conda deactivate
optional arguments:
-h
,--help
show this help message and exit-l
LEN,--len
LEN Sequencing read length (mean length). Note: the accuracy reduces for reads shorter than 40.-i
[INPUT [INPUT …]],--input
[INPUT [INPUT …]] Path of input sequence files (fasta and fastq), the second file will be considered as second end if two files given.-o
[OUTPUT [OUTPUT …]],--output
[OUTPUT [OUTPUT …]] Path of the output sequence files after rRNAs removal (same number of files as input). (Note: 2 times slower to write gz files)-r
[RRNA [RRNA …]],--rrna
[RRNA [RRNA …]] Path of the output sequence file of detected rRNAs (same number of files as input)-e
{rrna,norrna,both,none},--ensure
{rrna,norrna,both,none} Ensure which classificaion has high confidence for paired end reads. norrna: output only high confident non-rRNAs, the rest are clasified as rRNAs; rrna: vice versa, only high confident rRNAs are classified as rRNA and the rest output as non-rRNAs; both: both non-rRNA and rRNA prediction with high confidence; none: give label based on the mean probability of read pair. (Only applicable for paired end reads, discard the read pair when their predicitons are discordant)-t
THREADS,--threads
THREADS number of threads to use. (default: 20)--chunk_size
CHUNK_SIZE chunk_size * 1024 reads to load each time. When chunk_size=1000 and threads=20, consumming ~20G memory, better to be multiples of the number of threads..--log
LOG Log file name-v,
--version
Show program’s version number and exit