conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
Pseudofinder
Introduction
Pseudofinder is a bioinformatics tool that detects pseudogene candidates from annotated genbank files of bacterial and archaeal genomes (Syberg-Olsen et al. 2022).
It has been tested mostly on genbank (.gbf/.gbk) files annotated by Prokka with the --compliant
flag (i.e. including both /gene and /CDS annotations).
There are alternative programs for pseudogene finding and annotation (e.g. the NCBI Prokaryotic Genome Annotation Pipeline), but to the best of the software developers knowledge, none of them are open source or allow easy fine-tuning of parameters.
For more information, please check the wiki.
Installation
Installed on crunchomics: Yes,
- Pseudofinder v1.1.0 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
cd /home/ndombro/personal/software
git clone https://github.com/filip-husnik/pseudofinder.git
cd pseudofinder
bash setup.sh
conda activate pseudofinder
#install a reference database, i.e. swissprot
cd path_to_databases/uniprot
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz
makeblastdb -in uniprot_sprot.fasta -parse_seqids -dbtype prot
Usage
Input files:
- Pseudofinder requires the user to provide the genome in genbank format as well as a non-redundant protein database formatted for BlastP/BlastX searches. If possible, providing a reference genome allows Pseudofinder to include dN/dS calculations to identify pseudogenes.
- The software developers recommend genbank (.gbf/.gbk) files generated by Prokka with the
--compliant
and--rfam
flags. Annotating rRNAs, tRNAs, and other ncRNAs in Prokka is recommended to eliminate any false positive ‘pseudogene’ candidates. ORFs overlapping with non-coding RNAs such as rRNA can be sometimes misannotated in databases as ‘hypothetical proteins’. - Using very strict gene length cutt-offs in Pseudofinder (
--length_pseudo
>0.90) should be avoided since it can lead to biased pseudogene calls in short proteins due to the signal peptide presence/absence (<35 AA difference).
Database recommendations:
- Database selection is critical to the speed and sensitivity of Pseudofinder. Users can provide any database they would like provided it is a non-redundant protein database formatted for BlastP/BlastX searches, but must keep in mind that larger databases will increase runtime while smaller databases could suffer in sensitivity if they lack relevant protein sequences. For those who don’t have manually curated databases tailored to their specific microbe, we recommend NCBI-NR (non-redundant) protein database (or similar such as SwissProt).
- If you have access to the bioinformatics share, then SwissProt is already installed
More on how pseudogenes are detected and what categories are identified by Pseudofinder is described here.
mkdir -p results/pseudofinder
#run pseudofinder without reference
conda activate pseudofinder_1.1.0
#set variable tp access the reference protein database
pseudo_db="/zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta"
srun --cpus-per-task 20 --mem=10G pseudofinder.py annotate \
--genome data/prokka/GCF_000005845.gbk \
--outprefix results/pseudofinder/pseudofinder \
--database $pseudo_db --threads 20
conda deactivate
Check out this page for more information about the individual output files.