COG database

Introduction

The Clusters of Orthologous Genes (COG) database provides a comprehensive functional annotation of widespread bacterial and archaeal genes by clustering their protein products by sequence similarity reflecting their common evolutionary origin (Tatusov, Koonin, and Lipman 1997; Galperin et al. 2024) . The current versions was generated based on genomes from 2103 bacteria and 193 archaea, in most cases, with a single representative genome per genus.

For more information, please also have a look at the NCBI COG website.

Installation

Available on crunchomics: Yes,

  • A COG HMM database is installed as part of the bioinformatics share. If you have access to Crunchomics and have not yet access to the bioinformatics share you can send an email with your Uva netID to Nina Dombrowski.
  • HMMER is installed on Crunchomics by default

The COG database can be found here:

  • /zfs/omics/projects/bioinformatics/databases/cog/2024_release
  • The original data can be found here.
  • If you want to generate your own HMM database, instructions can be found at /zfs/omics/projects/bioinformatics/databases/cog/scripts/generate_COG_hmms.md

Example usage

The COG database can be used against a query of proteins using HMMER. For example, if you have protein-coding genes of a genome of interest (GCF_003697165_2.faa) you could do the following:

# Define location of COG HMM and mapping file 
cog_mapping="/zfs/omics/projects/bioinformatics/databases/cog/2024_release/cog-24.def.tab" 
cog_hmmdb="/zfs/omics/projects/bioinformatics/databases/cog/2024_release/hmm/NCBI_COGs_Nov2024.hmm"

# Generate output folders
mkdir -p results/cog/

# Run hmmsearch against all COGs 
hmmsearch \
    --tblout results/cog/sequence_results.txt \
    --domtblout results/cog/domain_results.txt \
    --notextw --cpu 20 \
    $cog_hmmdb \
    data/GCF_003697165_2.faa

The resulting table of per-sequence hits (tblout) and table of per-domain hits (domtblout) contain several hits per query with no e-value or bit-score cutoff. You can further parse the output using your favorite coding language. A brief bash example, that makes also use of a COG to description mapping file is outlined below:

# Format the full table and only select hits above a certain e-value
sed 's/ \+ /\t/g' results/cog/sequence_results.txt | \
    sed '/^#/d'| sed 's/ /\t/g'| \
    awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | \
    awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > results/cog/sequence_results_red_e_cutoff.txt

# Get best hit/protein based on bit score, and e-value
sort -t$'\t' -k3,3gr -k4,4g  results/cog/sequence_results_red_e_cutoff.txt | \
    sort -t$'\t' --stable -u -k1,1  | \
    sort -t$'\t' -k3,3gr -k4,4g >  results/cog/temp1

# Merge with COG mapping file 
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.2,2.5,1.4,1.3 <(LC_ALL=C sort -k2 results/cog/temp1) <(LC_ALL=C sort -k1 $cog_mapping) | LC_ALL=C  sort > results/cog/temp2

# Add a header
echo -e "accession\tCOG\tCOG_Description\tCOG_PathwayID\tCOG_Pathway\tCOG_evalue\tCOG_bitscore" | \
    cat - results/cog/temp2 > results/cog/NCBI_COGs2024.tsv

# Cleanup
rm results/cog/temp*

References

Galperin, Michael Y, Roberto Vera Alvarez, Svetlana Karamycheva, Kira S Makarova, Yuri I Wolf, David Landsman, and Eugene V Koonin. 2024. “COG Database Update 2024.” Nucleic Acids Research, November. https://doi.org/10.1093/nar/gkae983.
Tatusov, Roman L., Eugene V. Koonin, and David J. Lipman. 1997. “A Genomic Perspective on Protein Families.” Science 278 (5338): 631–37. https://doi.org/10.1126/science.278.5338.631.