pseudofinder

Introduction

Pseudofinder is a bioinformatics tool that detects pseudogene candidates from annotated genbank files of bacterial and archaeal genomes (Syberg-Olsen et al. 2022).

It has been tested mostly on genbank (.gbf/.gbk) files annotated by Prokka with the --compliant flag (i.e. including both /gene and /CDS annotations).

There are alternative programs for pseudogene finding and annotation (e.g. the NCBI Prokaryotic Genome Annotation Pipeline), but to the best of the software developers knowledge, none of them are open source or allow easy fine-tuning of parameters.

For more information, please check the wiki.

Installation

Installed on crunchomics: Yes,

Pseudofinder v1.1.0 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):

conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/

If you want to install it yourself, you can run:

cd /home/ndombro/personal/software
git clone https://github.com/filip-husnik/pseudofinder.git
cd pseudofinder
bash setup.sh
conda activate pseudofinder

#install a reference database, i.e. swissprot 
cd path_to_databases/uniprot
wget https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
gzip -d uniprot_sprot.fasta.gz 
makeblastdb -in uniprot_sprot.fasta -parse_seqids -dbtype prot

Usage

Input files:

Pseudofinder requires the user to provide the genome in genbank format as well as a non-redundant protein database formatted for BlastP/BlastX searches. If possible, providing a reference genome allows Pseudofinder to include dN/dS calculations to identify pseudogenes.
The software developers recommend genbank (.gbf/.gbk) files generated by Prokka with the --compliant and --rfam flags. Annotating rRNAs, tRNAs, and other ncRNAs in Prokka is recommended to eliminate any false positive ‘pseudogene’ candidates. ORFs overlapping with non-coding RNAs such as rRNA can be sometimes misannotated in databases as ‘hypothetical proteins’.
Using very strict gene length cutt-offs in Pseudofinder (--length_pseudo >0.90) should be avoided since it can lead to biased pseudogene calls in short proteins due to the signal peptide presence/absence (<35 AA difference).

Database recommendations:

Database selection is critical to the speed and sensitivity of Pseudofinder. Users can provide any database they would like provided it is a non-redundant protein database formatted for BlastP/BlastX searches, but must keep in mind that larger databases will increase runtime while smaller databases could suffer in sensitivity if they lack relevant protein sequences. For those who don’t have manually curated databases tailored to their specific microbe, we recommend NCBI-NR (non-redundant) protein database (or similar such as SwissProt).
If you have access to the bioinformatics share, then SwissProt is already installed

More on how pseudogenes are detected and what categories are identified by Pseudofinder is described here.

mkdir -p results/pseudofinder

#run pseudofinder without reference
conda activate pseudofinder_1.1.0

#set variable tp access the reference protein database
pseudo_db="/zfs/omics/projects/bioinformatics/databases/uniprot/uniprot_sprot_070524.fasta"

srun --cpus-per-task 20 --mem=10G pseudofinder.py annotate \
    --genome data/prokka/GCF_000005845.gbk \
    --outprefix results/pseudofinder/pseudofinder \
    --database $pseudo_db --threads 20

conda deactivate

Check out this page for more information about the individual output files.

References

Syberg-Olsen, Mitchell J, Arkadiy I Garber, Patrick J Keeling, John P McCutcheon, and Filip Husnik. 2022. “Pseudofinder: Detection of Pseudogenes in Prokaryotic Genomes.” Molecular Biology and Evolution 39 (7): msac153. https://doi.org/10.1093/molbev/msac153.