conda config --add envs_dirs /zfs/omics/projects/bioinformatics/software/miniconda3/envs/
Prokka
Introduction
Prokka is a software tool to annotate bacterial, archaeal and viral (but NOT eukaryotic) genomes quickly and produce standards-compliant output files (Seemann 2014). Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labeling them with useful information. Briefly, Prokka uses a two-step process for the annotation of protein coding regions:
- protein coding regions on the genome are identified using Prodigal
- the function of the encoded protein is predicted by similarity to proteins in one of many protein or protein domain databases. These databases are ISfinder, NCBI Bacterial Antimicrobial Resistance Reference Gene Database and UniProtKB. You can also add your own databases, such as the TIGRFAM or Pfam databases.
Visit the manual for more information.
Installation
Installed on crunchomics: Yes,
- Prokka v1.14.6 is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.
- Afterwards, you can add the bioinformatics share as follows (if you have already done this in the past, you don’t need to run this command):
If you want to install it yourself, you can run:
mamba create -n prokka -c conda-forge -c bioconda -c defaults prokka
#setup extra databases (optional, but useful for better functional assignments)
#to do this:
#cd into whatever folder prokka gets installed followed by db/hmm,
#this might look something like this:
cd /zfs/omics/projects/bioinformatics/software/miniconda3/envs/prokka_1.14.6/db/hmm
wget https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMs_15.0_HMM.LIB.gz
gzip -d TIGRFAMs_15.0_HMM.LIB.gz
mv TIGRFAMs_15.0_HMM.LIB 1-TIGRFAMs_15.0.hmm
wget ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gzip -d Pfam-A.hmm.gz
mv Pfam-A.hmm 2-Pfam-A.hmm
mv HAMAP.hmm 3-HAMAP.hmm
#ensure that the prokka databases are setup correctly
conda activate prokka
prokka --setupdb
conda deactivate
Usage
Let’s assume you downloaded a genome from NCBI and want to annotate the genome or if you have a set of MAGs and you want to identify the coding regions.
Required input: The contigs from a genome of interest
Generated output:
.gff
This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV..gbk
This is a standard Genbank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-Genbank, with one record for each sequence..fna
Nucleotide FASTA file of the input contig sequences..faa
Protein FASTA file of the translated CDS sequences..ffn
Nucleotide FASTA file of all the prediction transcripts (CDS, rRNA, tRNA, tmRNA, misc_RNA).sqn
An ASN1 format “Sequin” file for submission to Genbank. It needs to be edited to set the correct taxonomy, authors, related publication etc..fsa
Nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines..tbl
Feature Table file, used by “tbl2asn” to create the .sqn file..err
Unacceptable annotations - the NCBI discrepancy report..log
Contains all the output that Prokka produced during its run. This is a record of what settings you used, even if the –quiet option was enabled..txt
Statistics relating to the annotated features found..tsv
Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product
mkdir data
#download genome
wget -O - https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/005/845/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz | gzip -d > data/GCF_000005845.2_ASM584v2_genomic.fna
conda activate prokka_1.14.6
prokka \
--outdir data/prokka \
--prefix GCF_000005845 \
\
data/GCF_000005845.2_ASM584v2_genomic.fna --cpus 4 --kingdom Bacteria
conda deactivate
For a list of all options, visit the manual. Some key options are:
--outdir
[X] Output folder [auto] (default ’’)--force
Force overwriting existing output folder (default OFF)--prefix
[X] Filename output prefix [auto] (default ’’)--addgenes
Add ‘gene’ features for each ‘CDS’ feature (default OFF)--locustag
[X] Locus tag prefix (default ‘PROKKA’)--increment
[N] Locus tag counter increment (default ‘1’)--kingdom
[X] Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default ‘Bacteria’)--gcode
[N] Genetic code / Translation table (set if –kingdom is set) (default ‘0’)--metagenome
Improve gene predictions for highly fragmented genomes (default OFF)--norrna
Don’t run rRNA search (default OFF)--notrna
Don’t run tRNA search (default OFF)--compliant
: Force Genbank/ENA/DDJB compliance: –addgenes –mincontiglen 200 –centre XXX (default OFF). This setting can be useful if you need a specific format for downstream analyses.
Unsure what genetic code/translation table to use? Check out wikipedia. You mainly need to watch out for this when working with mitochondria, mycoplasma and spiroplasma.