mamba create -n ncbi_tools
mamba install -n ncbi_tools \
-c conda-forge -c bioconda \
sra-tools parallel-fastq-dump ncbi-datasets-cli
mamba activate ncbi_tools
Downloading data from NCBI
Introduction
On this page you will get introduced to a set of tools that are useful to download genomes as well as sequencing data from NCBI:
- sra-tools: a collection of tools and libraries for using data in the INSDC Sequence Read Archives
- ncbi-datasets-cli: NCBI Datasets is a new resource that lets you easily gather data from across NCBI databases
- parallel-fastq-dump: is a tool that tool up the process of downloading sra data and is a multi-thread alternative to sra-tools
Installation
Since we work with a set of programs that fulfill a similar purpose we install them in one single conda environment as follows:
Usage
Downloading a single SRA archive using sra-tools
Let’s assume we want to download some sequencing data, for example some paired-end Illumina sequencing data from which we know the SRA accession, for example SRR27829729.
mkdir data
#download sra data
prefetch SRR27829729 -O data
#convert to fastq.gz
#the space you need during the conversion is approximately 17 times the size of the accession
fasterq-dump SRR27829729 -O data
#compress files to reduce file size
gzip data/*fastq
For paired-end data, fasterq-dump will automatically generate separate files for the forward and reverse reads.
Downloading several SRA archives using sra-tools
Let’s now assume we want to download two (or more) SRA archives. We can do this by first preparing a list with all the SRAs we want to download. I am doing this with echo, but you can easily prepare a list in excel.
#prepare a list of SRAs we want to work with
echo -e "SRR27829729\nSRR27829749" > sra_list
#get sra
prefetch $(<sra_list) -O data
#convert to fastq
fasterq-dump $(<sra_list) -O data
#compress files to reduce file size
gzip data/*fastq
Downloading several SRA archives using parallel-fastq-dump
If you have few data files to download, the sra-tools are good to use. However, with more data you generate a lot of intermediate files that might need a lot of space. parallel-fastq-dump
allows to combine all steps in once and make use of several threads (if needed). Assuming we need a lot of space, we can also redirect the temporary directory, which the tool uses to store the sra files, elsewhere.
mkdir temp
parallel-fastq-dump \
--sra-id $(<sra_list) \
--split-files \
--threads 4 --gzip -T temp \
--outdir data
Options:
parallel-fastq-dump
is a parallel version of fast-dump
, the predecessor of fasterq-dump
. As such you can add any option you see when typing parallel-fastq-dump -h
as well as fastq-dump -h
. In the example above, we use the option --split-files
in order to split the reads into separate files for the reverse and forward reads.
Downloading a genome from NCBI
We can use the datasets
command from the ncbi-datasets-cli
software to easily download a genome from NCBI as follows:
datasets download genome accession GCF_000385215.1 \
--include gff3,protein,genome \
--filename GCF_000385215.zip
#unzip the data directory
unzip GCF_000385215.zip
#cleanup
rm GCF_000385215.zip
Downloading multiple genomes from NCBI
mkdir data
#prepare a list of accessions we want to work with
echo -e "GCF_000385215.1\nGCA_000774145.1" > genome_list
#download data
for i in `cat genome_list`; do
datasets download genome accession ${i} \
--include gff3,protein,genome \
--filename ${i}.zip
done
#unzip the data directories
for i in `cat genome_list`; do
unzip -j ${i}.zip -d data/${i}
done
#rename the generic protein and gff file names by adding the accession (works if more than one genome is downloaded)
for file in data/*/{protein.faa,genomic.gff}
do
directory_name=$(dirname $file)
accession=$(basename $directory_name)
mv "${file}" "${directory_name}/${accession}_$(basename $file)"
done
#cleanup
rm *.zip
For more functionality of ncbi-datasets-cli
visit this NCBI website.