NCBI nr

Introduction

NCBI has several databases that can be used for BLAST searches. One of those is the the non-redundant (nr) protein database. Non-redundant means that identical sequences are represented by a single entry in the database. In the case of protein sequences, sometimes hundreds of sequences may be collapsed into a single entry.

The default protein database nr contains nearly all protein sequences available at NCBI.

  • no patent sequences
  • no wgs metagenomes and no transcriptome shotgun assembly proteins
  • includes proteins from outside protein-only sources that are also available as separate databases.

Installation

Installed on crunchomics: Yes,

  • The NCBI nr database was downloaded on 30th of August 2024 and converted to a diamond database which can be found on the bioinformatics share.
  • The database can be found here: /zfs/omics/projects/bioinformatics/databases/ncbi_nr/diamond/
  • Taxonomy files that link the NCBI taxonomy ID to a taxonomy string can be found at /zfs/omics/projects/bioinformatics/databases/ncbi_tax

If you want to generate the database yourself you can do the following:

Get the NCBI nr database

Comments:

  • Please note, the database is quite large (~450 Gb as of Oktober 2024)
  • The tool update_blastdb.pl is part of NCBI’s BLAST® Command Line Applications (Wang et al. 2003). Installation instructions can be found here. Additionally, BLAST as well as diamond can also be installed via mamba
# Download the nr database
update_blastdb.pl --decompress nr

# Download taxonomy information
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz.md5
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
wget https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip

# Prepare NR for diamond
diamond prepdb --db nr

# Extract protein sequences (required to convert BLAST to diamond format)
# While you can use a BLAST database directly with diamond it is so far not possible to extract the taxonomy information
# If you require taxonomy information, convert to dmd format as follows
blastdbcmd -entry 'all' -db nr >nr.faa

# Generate diamond database (the output generated here is called nr.dmnd)
diamond makedb --in nr.faa --db nr \
    --taxonmap prot.accession2taxid.gz \
    --taxonnodes nodes.dmp \
    --taxonnames names.dmp

Parse the taxonomy files

The code below outlines how to generate a two-column text file that lists the NCBI taxonomy ID with a taxonomy string. If you use the ncbitax2lin tool, please refer to this github page in your methods.

# Install ncbitax2lin
mamba create -n ncbitax2lin_2.3.2 -c bioconda ncbitax2lin==2.3.2

# Run ncbitax2lin on the files in the taxdmp folder we unzipped earlier
conda activate ncbitax2lin_2.3.2

ncbitax2lin --nodes-file nodes.dmp --names-file names.dmp --output ncbi_lineages_31102024.csv.gz 

conda deactivate

# Parse the results
gzip -d ncbi_lineages_31102024.csv.gz

# Replace space with underscore and only print tax level until species
sed 's/ /_/g' ncbi_lineages_31102024.csv | awk -F',' -v OFS="\t" '{print $1,$2,
$3,$4,$5,$6,$7,$8}' > temp1

# Add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp1 > temp2

# Merge columns 2-8
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2";"$3";"$4";"$5";"$6";"$7";"$8}' temp2 | sed '1d' > ncbi_tax_31102024.tsv

# Cleanup 
gzip ncbi_lineages_31102024.csv
rm temp*

Usage

Basic command

The code below gives an example for using the NCBI-nr database with diamond (Buchfink, Reuter, and Drost 2021):

# Run diamond search
diamond blastp -q proteins.faa \
    --more-sensitive --evalue 1e-3 --threads 20 --include-lineage --max-target-seqs 25 \
    --db /zfs/omics/projects/bioinformatics/databases/ncbi_nr/diamond/nr \
    --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids sphylums \
    --out results.txt

Basic usage for Crunchomics

When running this on larger files and Crunchomics it is recommended to send intermediate files to the scratch directories as follows. Also ensure that you give this a bit more memory.

mkdir -p /scratch/${USER}/diamond_temp 

diamond blastp -q results/prokka/all_genomes.faa \
    --more-sensitive -b 6 --evalue 1e-3 --threads $SLURM_CPUS_PER_TASK --include-lineage --max-target-seqs 25 \
    --tmpdir /scratch/${USER}/diamond_temp \
    --parallel-tmpdir /scratch/${USER}/diamond_temp \
    --db ${ncbi_nr_db_folder}/nr \
    --outfmt 6 qseqid qtitle qlen sseqid salltitles slen qstart qend sstart send evalue bitscore length pident staxids sphylums \
    --out results/annotations/ncbi_nr/ncbi_blastp_raw.txt

For a full documentation and list of all available options, please go here.

Output parsing (standard)

If you want to do some parsing and, for example, find the single-best hit per protein and integrate the taxonomy string, you could do the following:

# Select columns of interest in diamond output file
# Extracts the accession, description, e_value, bit score, percent identity, taxid, phylum
awk -F'\t' -v OFS="\t" '{ print $1, $5, $11, $12, $14, $15, $16 }'  results.txt | sed 's/ /_/g' > temp1

# Get single best hit based on the e-value, and use the bitscore for ties
sort -t$'\t' -k3,3g -k4,4gr temp1 | sort -t$'\t' --stable -u -k1,1   >  temp2

# Add an '-' into empty columns or columns without tax assignment
awk -F"\t" '{for(i=1;i<=NF;i++) {if($i ~ /^[[:blank:]]*$/) $i="_"; else gsub(/[[:blank:]]/,"_",$i); if($i=="N/A") $i="-"}}1' OFS="\t" temp2 > temp3

# In column 2 remove everything after < (otherwise the name can get too long)
awk -F'\t' -v OFS='\t' '{split($2,a,"<"); print $1, a[1], $3, $4, $5, $6, $7}' temp3 > temp4

# Merge with taxon names
LC_ALL=C join -a1 -1 6 -2 1 -e'-' -t $'\t'  -o1.1,1.2,1.3,1.4,1.5,1.6,1.7,2.2  <(LC_ALL=C sort -k6,6  temp4) <(LC_ALL=C sort -k1,1 /zfs/omics/projects/bioinformatics/databases/ncbi_tax/ncbi_tax_31102024.tsv) | LC_ALL=C  sort > temp5

# Add in header
echo -e "accession\tncbi_tophit\tncbi_e_value\tncbi_bitscore\tncbi_perc_id\tncbi_tax_id\tncbi_phylum\tncbi_ncbi_tax" | cat - temp5 > results_parsed.txt

# Cleanup 
rm temp*

Output parsing (missing tax ids)

If working with MAGs, it sometimes happens that a large proportion of taxids are empty. If that happens, the following code allows to “reverse-engineer” the full taxon string:

# Data parsing 
# Select columns of interest in diamond output file
# Accession, description, evalue, bitscore, pident
awk -F'\t' -v OFS="\t" '{ print $1, $5, $11, $12, $14}'  results/annotations/ncbi_nr/ncbi_blastp_raw.txt |\
    sed 's/ /_/g' > results/annotations/ncbi_nr/temp1

# Get single best hit based on e-value and solve ties with bitscoe
sort -t$'\t' -k3,3g -k4,4gr results/annotations/ncbi_nr/temp1 |\
    sort -t$'\t' --stable -u -k1,1  >  results/annotations/ncbi_nr/temp2

# In column 2 remove everything after > (otherwise the name can get too long whenever we have identical hits to something)
awk -F'\t' -v OFS='\t' '{split($2,a,">"); print $1, a[1], $3, $4, $5}' results/annotations/ncbi_nr/temp2 > results/annotations/ncbi_nr/temp3

# Extract the species name from field 2
awk -F'\t' -v OFS='\t' '{
    tax = $2
    # extract last [...] part from col 2
    if (match(tax, /\[([^[]+)\][^[]*$/, m)) {
        tax = m[1]
    } else {
        tax = "-"
    }
    print $1, $2, $3, $4, $5, tax
}' results/annotations/ncbi_nr/temp3 > results/annotations/ncbi_nr/temp4

# Join against a species - taxon string mapping file
LC_ALL=C join -a1 -1 6 -2 1 -e'-' -t$'\t' \
    -o1.1,1.2,1.3,1.4,1.5,2.3 \
    <(LC_ALL=C sort -k6,6 results/annotations/ncbi_nr/temp4) \
    <(LC_ALL=C sort -k1,1 /zfs/omics/projects/bioinformatics/databases/ncbi_tax/ncbi_tax_by_orgname_dedup.tsv) \
    | LC_ALL=C  sort > results/annotations/ncbi_nr/temp5
    
# Merge with taxon names
echo -e "accession\tncbi_tophit\tncbi_e_value\tncbi_bitscore\tncbi_perc_id\tncbi_lineage" | \
    cat - results/annotations/ncbi_nr/temp5 \
    > results/annotations/ncbi_nr/ncbi_blastp_parsed.txt

# Cleanup 
rm temp*

References

Buchfink, Benjamin, Klaus Reuter, and Hajk-Georg Drost. 2021. “Sensitive Protein Alignments at Tree-of-Life Scale Using DIAMOND.” Nature Methods 18 (4): 366–68. https://doi.org/10.1038/s41592-021-01101-x.
Wang, Hao, Beng Chin Ooi, Kian-Lee Tan, Twee-Hee Ong, and Lei Zhou. 2003. “BLAST++: BLASTing Queries in Batches.” Bioinformatics 19 (17): 2323–24. https://doi.org/10.1093/bioinformatics/btg310.