NCBI nr

Introduction

NCBI has several databases that can be used for BLAST searches. One of those is the the non-redundant (nr) protein database. Non-redundant means that identical sequences are represented by a single entry in the database. In the case of protein sequences, sometimes hundreds of sequences may be collapsed into a single entry.

The default protein database nr contains nearly all protein sequences available at NCBI.

  • no patent sequences
  • no wgs metagenomes and no transcriptome shotgun assembly proteins
  • includes proteins from outside protein-only sources that are also available as separate databases.

Installation

Installed on crunchomics: Yes,

  • The NCBI nr database was downloaded on 30th of August 2024 and converted to a diamond database which can be found on the bioinformatics share.
  • The database can be found here: /zfs/omics/projects/bioinformatics/databases/ncbi_nr/diamond/
  • Taxonomy files that link the NCBI taxonomy ID to a taxonomy string can be found at /zfs/omics/projects/bioinformatics/databases/ncbi_tax

If you want to generate the database yourself you can do the following:

Get the NCBI nr database

Comments:

  • Please note, the database is quite large (~450 Gb as of Oktober 2024)
  • The tool update_blastdb.pl is part of NCBI’s BLAST® Command Line Applications (Wang et al. 2003). Installation instructions can be found here. Additionally, BLAST as well as diamond can also be installed via mamba
# Download the nr database
update_blastdb.pl --decompress nr

# Download taxonomy information
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip

# Prepare NR for diamond
diamond prepdb --db nr

# Extract protein sequences (required to convert BLAST to diamond format)
# While you can use a BLAST database directly with diamond it is so far not possible to extract the taxonomy information
# If you require taxonomy information, convert to dmd format as follows
blastdbcmd -entry 'all' -db nr >nr.faa

# Generate diamond database (the output generated here is called nr.dmnd)
diamond makedb --in nr.faa --db nr \
    --taxonmap prot.accession2taxid.gz \
    --taxonnodes nodes.dmp \
    --taxonnames names.dmp

Parse the taxonomy files

The code below outlines how to generate a two-column text file that lists the NCBI taxonomy ID with a taxonomy string. If you use the ncbitax2lin tool, please refer to this github page in your methods.

# Install ncbitax2lin
mamba create -n ncbitax2lin_2.3.2 -c bioconda ncbitax2lin==2.3.2

# Run ncbitax2lin on the files in the taxdmp folder we unzipped earlier
conda activate ncbitax2lin_2.3.2

ncbitax2lin --nodes-file nodes.dmp --names-file names.dmp --output ncbi_lineages_31102024.csv.gz 

conda deactivate

# Parse the results
gzip -d ncbi_lineages_31102024.csv.gz

# Replace space with underscore and only print tax level until species
sed 's/ /_/g' ncbi_lineages_31102024.csv | awk -F',' -v OFS="\t" '{print $1,$2,
$3,$4,$5,$6,$7,$8}' > temp1

# Add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp1 > temp2

# Merge columns 2-8
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2";"$3";"$4";"$5";"$6";"$7";"$8}' temp2 | sed '1d' > ncbi_tax_31102024.tsv

# Cleanup 
gzip ncbi_lineages_31102024.csv
rm temp*

Usage

The code below gives an example for using the NCBI-nr database with diamond (Buchfink, Reuter, and Drost 2021):

# Run diamond search
diamond blastp -q proteins.faa \
    --more-sensitive --evalue 1e-3 --threads 20 --include-lineage --max-target-seqs 50 \
    --db /zfs/omics/projects/bioinformatics/databases/ncbi_nr/diamond/nr \
    --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids sphylums \
    --out results.txt

For a full documentation and list of all available options, please go here.

If you want to do some parsing and, for example, find the single-best hit per protein and integrate the taxonomy string, you could do the following:

# Select columns of interest in diamond output file
awk -F'\t' -v OFS="\t" '{ print $1, $5, $11, $12, $14, $15, $16 }'  results.txt | sed 's/ /_/g' > temp1

# Get single best hit based on bit score, and then e-value
sort -t$'\t' -k4,4gr -k3,3g temp1 | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k4,4gr -k3,3g >  temp2

# Add an '-' into empty columns or columns without tax assignment
awk -F"\t" '{for(i=1;i<=NF;i++) {if($i ~ /^[[:blank:]]*$/) $i="_"; else gsub(/[[:blank:]]/,"_",$i); if($i=="N/A") $i="-"}}1' OFS="\t" temp2 > temp3

# In column 2 remove everything after < (otherwise the name can get too long)
awk -F'\t' -v OFS='\t' '{split($2,a,"<"); print $1, a[1], $3, $4, $5, $6, $7}' temp3 > temp4

# Merge with taxon names
LC_ALL=C join -a1 -1 6 -2 1 -e'-' -t $'\t'  -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.2  <(LC_ALL=C sort -k6  temp4) <(LC_ALL=C sort -k1 /zfs/omics/projects/bioinformatics/databases/ncbi_tax/ncbi_tax_31102024.tsv) | LC_ALL=C  sort > temp5

# Add in header
echo -e "accession\tTopHit\te_value\tbitscore\tperc_id\ttax_id\tphylum\tncbi_tax" | cat - temp5 > results_parsed.txt

# Cleanup 
rm temp*

References

Buchfink, Benjamin, Klaus Reuter, and Hajk-Georg Drost. 2021. “Sensitive Protein Alignments at Tree-of-Life Scale Using DIAMOND.” Nature Methods 18 (4): 366–68. https://doi.org/10.1038/s41592-021-01101-x.
Wang, Hao, Beng Chin Ooi, Kian-Lee Tan, Twee-Hee Ong, and Lei Zhou. 2003. “BLAST++: BLASTing Queries in Batches.” Bioinformatics 19 (17): 2323–24. https://doi.org/10.1093/bioinformatics/btg310.