# Define location of COG HMM and mapping file
cog_mapping="/zfs/omics/projects/bioinformatics/databases/cog/2024_release/cog-24.def.tab"
cog_hmmdb="/zfs/omics/projects/bioinformatics/databases/cog/2024_release/hmm/NCBI_COGs_Nov2024.hmm"
# Generate output folders
mkdir -p results/cog/
# Run hmmsearch against all COGs
hmmsearch \
--tblout results/cog/sequence_results.txt \
--domtblout results/cog/domain_results.txt \
--notextw --cpu 20 \
$cog_hmmdb \
data/GCF_003697165_2.faa
COG database
Introduction
The Clusters of Orthologous Genes (COG) database provides a comprehensive functional annotation of widespread bacterial and archaeal genes by clustering their protein products by sequence similarity reflecting their common evolutionary origin (Tatusov, Koonin, and Lipman 1997; Galperin et al. 2024) . The current versions was generated based on genomes from 2103 bacteria and 193 archaea, in most cases, with a single representative genome per genus.
For more information, please also have a look at the NCBI COG website.
Installation
Available on crunchomics: Yes,
- A COG HMM database is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share you can send an email with your Uva netID to Nina Dombrowski.
- HMMER is installed on Crunchomics by default
The COG database can be found here:
/zfs/omics/projects/bioinformatics/databases/cog/2024_release
- The original data can be found here.
- If you want to generate your own HMM database, instructions can be found at
/zfs/omics/projects/bioinformatics/databases/cog/scripts/generate_COG_hmms.md
Example usage
The COG database can be used against a query of proteins using HMMER. For example, if you have protein-coding genes of a genome of interest (GCF_003697165_2.faa) you could do the following:
The resulting table of per-sequence hits (tblout) and table of per-domain hits (domtblout) contain several hits per query with no e-value or bit-score cutoff. You can further parse the output using your favorite coding language. A brief bash example, that makes also use of a COG to description mapping file is outlined below:
# Format the full table and only select hits above a certain e-value
sed 's/ \+ /\t/g' results/cog/sequence_results.txt | \
sed '/^#/d'| sed 's/ /\t/g'| \
awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | \
awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3' > results/cog/sequence_results_red_e_cutoff.txt
# Get best hit/protein based on bit score, and e-value
sort -t$'\t' -k3,3gr -k4,4g results/cog/sequence_results_red_e_cutoff.txt | \
sort -t$'\t' --stable -u -k1,1 | \
sort -t$'\t' -k3,3gr -k4,4g > results/cog/temp1
# Merge with COG mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.2,2.5,1.4,1.3 <(LC_ALL=C sort -k2 results/cog/temp1) <(LC_ALL=C sort -k1 $cog_mapping) | LC_ALL=C sort > results/cog/temp2
# Add a header
echo -e "accession\tCOG\tCOG_Description\tCOG_PathwayID\tCOG_Pathway\tCOG_evalue\tCOG_bitscore" | \
cat - results/cog/temp2 > results/cog/NCBI_COGs2024.tsv
# Cleanup
rm results/cog/temp*