Interproscan

Introduction

InterPro is a database which integrates together predictive information about proteins’ function from a number of partner resources, giving an overview of the families that a protein belongs to and the domains and sites it contains (Blum et al. 2021).

Users who have novel nucleotide or protein sequences that they wish to functionally characterise can use the software package InterProScan to run the scanning algorithms from the InterPro database in an integrated way. Sequences are submitted in FASTA format. Matches are then calculated against all of the required member database’s signatures and the results are then output in a variety of formats (Jones et al. 2014).

Installation

Newest version

Notice: The newest version does NOT run on crunchomics unless you do some extra steps during the installation since we need a newer java version and we need to update a tool used to analyse Prosite specific databases.

These changes require to move a few things around, so if you feel uncomfortable with this feel free to install an older version (for installation, see section #### Version for Crunchomics below) that works well with the default java version installed on Crunchomics.

If you want to work with the newest version and are not afraid of some extra steps do the following:

# Go into a folder in which you have all your software installed
cd software_dir

# Make a new folder for the interproscan installation and go into the folder
mkdir interproscan
cd interproscan 

# Download software
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.64-96.0/interproscan-5.64-96.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.64-96.0/interproscan-5.64-96.0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was success
# Must return *interproscan-5.64-96.0-64-bit.tar.gz: OK*
md5sum -c interproscan-5.64-96.0-64-bit.tar.gz.md5

# Decompress the downloaded folder
tar -pxvzf interproscan-5.64-96.0-*-bit.tar.gz

# Index hmm models
cd interproscan-5.64-96.0
python3 setup.py -f interproscan.properties

# Setup a conda environment with the newest java version
conda deactivate
mamba create --name java_f_iprscan -c conda-forge openjdk

# Activate the environment and install some more dependencies
conda activate java_f_iprscan
mamba install -c conda-forge gfortran

# The default installation will give an error with pfsearchV3 to solve this:
# Get new pftools version via conda (v3.2.12)
mamba install -c bioconda pftools

# Remove old pftools that came with interprocscan
cd bin/prosite
rm pfscanV3
rm pfsearchV3

# Replace with new pftools we have installed via mamba (and which are found in the mamba env folder)
# !!! exchange <~/personal/mambaforge/> with where your conda environments are installed!!!
cp ~/personal/mambaforge/envs/java_f_iprscan/bin/pfscan .
cp ~/personal/mambaforge/envs/java_f_iprscan/bin/pfscanV3 .
cp ~/personal/mambaforge/envs/java_f_iprscan/bin/pfsearch .
cp ~/personal/mambaforge/envs/java_f_iprscan/bin/pfsearchV3 .
cd ../..

# Do testrun (ProSiteProfiles and ProSitePatterns not working yet, therefore this works best when setting what applications to run manually)
#srun -n 1 --cpus-per-task 1 --mem=8G ./interproscan.sh -i test_all_appl.fasta -f tsv
srun -n 1 --cpus-per-task 1 --mem=8G ./interproscan.sh -i test_all_appl.fasta -f tsv -dp --appl TIGRFAM,SFLD,SUPERFAMILY,PANTHER,GENE3D,Hamap,Coils,SMART,CDD,PRINTS,PIRSR,AntiFam,Pfam,MobiDBLite,PIRSF,ProSiteProfiles 

conda deactivate

Version for Crunchomics

On Crunchomics, newer versions of interproscan do not run due to an incompatibility with the installed Java version. However, this version has an incompatibility with the local blast install, so some additional steps are also needed to get things working as well.

# Download software
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.36-75.0/interproscan-5.36-75.0-64-bit.tar.gz.md5

# Recommended checksum to confirm the download was success
# Must return *interproscan-5.64-96.0-64-bit.tar.gz: OK*
md5sum interproscan-5.36-75.0-64-bit.tar.gz.md5

# Decompress the folder
tar -pxvzf interproscan-5.36-75.0-64-bit.tar.gz

# Index hmm models
cd interproscan-5.36-75.0

#fix a dependency issue with the blast install that comes with this version of interproscan
## Setup new blast env (installed blast 2.16.0)
mamba create -n python3.6.8 -c conda-forge python=3.6.8
conda activate python3.6.8

## Install a working version of blast
mamba install -c bioconda blast=2.16.0

## Replace problematic blast versions via symlinking
## !!! Replace </zfs/omics/personal/ndombro/mambaforge/> with the path to your conda environments!!!
rm bin/blast/ncbi-blast-2.9.0+/rpsblast
ln -s /zfs/omics/personal/ndombro/mambaforge/envs/python3.6.8/bin/rpsblast bin/blast/ncbi-blast-2.9.0+/rpsblast

# Do a test run to check the installation
srun -n1 --cpus-per-task 8 --mem=8G ./interproscan.sh -i test_proteins.fasta

# Close the environment
conda deactivate

Usage

Required inputs:

  • Protein fasta file

Generated outputs:

  • TSV: a simple tab-delimited file format
  • XML: the new “IMPACT” XML format
  • GFF3: The GFF 3.0 format
  • JSON
  • SVG
  • HTML

Notice:

  • Interproscan does not like * symbols inside the protein sequence. Some tools for protein calling, like prokka, use * add the end of a protein to indicate that the full protein was found. If your files have such symbols, use the code below to remove it first. Beware: using sed -i overwrites the content of your file. If that behaviour is not wanted use sed 's/*//g' Proteins_of_interest.faa > Proteins_of_interest_new.faa instead.
  • If you are on Crunchomics (or most other servers): DO NOT run jobs on the head node, but add something like srun -n 1 --cpus-per-task 4 before the actual command

Example code:

#clean faa file 
#remove `*` as interproscan does not like that symbol
sed -i 's/*//g' Proteins_of_interest.faa

#run interproscan
<path_to_install>/interproscan-5.36-75.0/interproscan.sh --cpu 4 -i Proteins_of_interest.faa -d outputfolder -T outputfolder/temp --iprlookup --goterms

To check available options use <path_to_install>/interproscan-5.36-75.0/interproscan.sh --help or for more detailed information, see the documentation):

References

Blum, Matthias, Hsin-Yu Chang, Sara Chuguransky, Tiago Grego, Swaathi Kandasaamy, Alex Mitchell, Gift Nuka, et al. 2021. “The InterPro Protein Families and Domains Database: 20 Years On.” Nucleic Acids Research 49 (D1): D344–54. https://doi.org/10.1093/nar/gkaa977.
Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, et al. 2014. “InterProScan 5: Genome-Scale Protein Function Classification.” Bioinformatics 30 (9): 1236–40. https://doi.org/10.1093/bioinformatics/btu031.