KOfam database

Introduction

Kyoto Encyclopedia of Genes and Genomes (KEGG) is a widely used reference knowledge base, which helps investigate genomic functions by linking genes to biological knowledge such as metabolic pathways and molecular networks (Aramaki et al. 2020). In KEGG, the KEGG Orthology (KO) database—a manually curated large collection of protein families (i.e. KO families)—serves as a baseline reference to link genes with other KEGG resources such as metabolic maps through K number identifiers.

KOfam is a profile hidden Markov model (HMM) database of KEGG Orthology (KO). Profiles are built from sequences in KO database using CD-HIT, MAFFT and HMMER. Each profile has its own HMMER score threshold, with which the KO is assigned to a sequence.

Personal note: The HMMER score threshold is a good starting point to confidently assign functions, however, for novel organisms that are underrepresented in the KEGG database, it might be useful to explore hits with lower scores, for example when working with DPANN or CPR.

Installation

Available on crunchomics: Yes,

  • The KEGG database is installed as part of the bioinformatics share. If you have access to crunchomics and have not yet access to the bioinformatics you can send an email with your Uva netID to Nina Dombrowski.

The database can be found here:

  • /zfs/omics/projects/bioinformatics/databases/kegg/release_2024_26-04.

If you want to download the database yourself, you can do:

cd database_folder

wget https://www.genome.jp/ftp/db/kofam/profiles.tar.gz
wget https://www.genome.jp/ftp/db/kofam/ko_list.gz

tar -xzvf profiles.tar.gz

#create combined db: Pressed and indexed 26206 HMMs
cat profiles/*hmm > KO_db.hmm 
hmmpress KO_db.hmm 

#prepare mapping file
gzip -d ko_list.gz
sed -i 's/ /_/g' ko_list

#cleanup 
rm -r profiles
rm -r profiles.tar.gz

This directory contains the following files.

  • KO_db.hmm: a combined profile HMMs of all KO groups.
  • ko_list: Tab separated file containing the following information:
    • knum … K number
    • threshold … score threshold
    • score_type … score type used for the KO (full or domain)
    • profile_type … Poorly aligned sequences are removed (trim) or not (all) when building the profile
    • F-measure … F-measure when calculating threshold
    • nseq … number of sequences
    • nseq_used … number of sequences used to build the profile
    • alen … alignment length
    • mlen … length of consensus positions
    • eff_nseq … effective number of sequences
    • re/pos … relative entropy per position
    • definition … KO definition

KO mapping files

Next, to the ko_list we provide the following mapping files:

/zfs/omics/projects/bioinformatics/databases/kegg/release_2024_26-04/pathway_to_kegg.tsv:

contains a mapping for each KO to a pathway and looks like this:

pathway_desc pathway_map KO_id KO_desc
Mineral absorption map04978 K00510 HMOX1; heme oxygenase 1 [EC:1.14.14.18]
Mineral absorption map04978 K00522 FTH1; ferritin heavy chain [EC:1.16.3.1]
Mineral absorption map04978 K01539 ATP1A; sodium/potassium-transporting ATPase subunit alpha [EC:7.2.2.13]
Mineral absorption map04978 K05849 SLC8A, NCX; solute carrier family 8 (sodium/calcium exchanger)
Mineral absorption map04978 K05850 ATP2B; P-type Ca2+ transporter type 2B [EC:7.2.2.10]
Mineral absorption map04978 K07213 ATOX1, ATX1, copZ, golB; copper chaperone
Mineral absorption map04978 K08370 CYBRD1, Dcytb; plasma membrane ascorbate-dependent reductase [EC:7.2.1.3]
Mineral absorption map04978 K14738 STEAP2; metalloreductase STEAP2 [EC:1.16.1.-]
Mineral absorption map04978 K14739 MT1_2; metallothionein 1/2
Mineral absorption map04978 K17686 copA, ctpA, ATP7; P-type Cu+ transporter [EC:7.2.2.8]
Mineral absorption map04978 K21398 SLC11A2, DMT1, NRAMP2; natural resistance-associated macrophage protein 2
Mineral absorption map04978 K21418 HMOX2; heme oxygenase 2 [EC:1.14.14.18]
Fatty acid biosynthesis map00061 K00059 fabG, OAR1; 3-oxoacyl-[acyl-carrier protein] reductase [EC:1.1.1.100]
Fatty acid biosynthesis map00061 K00208 fabI; enoyl-[acyl-carrier protein] reductase I [EC:1.3.1.9 1.3.1.10]
Fatty acid biosynthesis map00061 K00645 fabD, MCAT, MCT1; [acyl-carrier-protein] S-malonyltransferase [EC:2.3.1.39]
Fatty acid biosynthesis map00061 K00647 fabB; 3-oxoacyl-[acyl-carrier-protein] synthase I [EC:2.3.1.41]
Fatty acid biosynthesis map00061 K00648 fabH; 3-oxoacyl-[acyl-carrier-protein] synthase III [EC:2.3.1.180]
Fatty acid biosynthesis map00061 K00665 FASN; fatty acid synthase, animal type [EC:2.3.1.85]
Fatty acid biosynthesis map00061 K00667 FAS2; fatty acid synthase subunit alpha, fungi type [EC:2.3.1.86]
Fatty acid biosynthesis map00061 K00668 FAS1; fatty acid synthase subunit beta, fungi type [EC:2.3.1.86]
Fatty acid biosynthesis map00061 K01071 MCH; medium-chain acyl-[acyl-carrier-protein] hydrolase [EC:3.1.2.21]

/zfs/omics/projects/bioinformatics/databases/kegg/release_2024_26-04/modules_to_kegg.tsv

contains all genes that are part of a KEGG module in the order they occur within the module:

Module module description KO id Order KO desc
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00844 1 HK; hexokinase [EC:2.7.1.1]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K12407 1 GCK; glucokinase [EC:2.7.1.2]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00845 1 glk; glucokinase [EC:2.7.1.2]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K25026 1 glk; glucokinase [EC:2.7.1.2]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00886 1 ppgK; polyphosphate glucokinase [EC:2.7.1.63]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K08074 1 ADPGK; ADP-dependent glucokinase [EC:2.7.1.147]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00918 1 pfkC; ADP-dependent phosphofructokinase/glucokinase [EC:2.7.1.146 2.7.1.147]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K01810 2 GPI, pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K06859 2 pgi1; glucose-6-phosphate isomerase, archaeal [EC:5.3.1.9]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K13810 2 tal-pgi; transaldolase / glucose-6-phosphate isomerase [EC:2.2.1.2 5.3.1.9]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K15916 2 pgi-pmi; glucose/mannose-6-phosphate isomerase [EC:5.3.1.9 5.3.1.8]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00850 3 pfkA, PFK; 6-phosphofructokinase 1 [EC:2.7.1.11]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K16370 3 pfkB; 6-phosphofructokinase 2 [EC:2.7.1.11]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K21071 3 pfk, pfp; ATP-dependent phosphofructokinase / diphosphate-dependent phosphofructokinase [EC:2.7.1.11 2.7.1.90]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K00918 3 pfkC; ADP-dependent phosphofructokinase/glucokinase [EC:2.7.1.146 2.7.1.147]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K01623 4 ALDO; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K01624 4 FBA, fbaA; fructose-bisphosphate aldolase, class II [EC:4.1.2.13]
M00001 Glycolysis (Embden-Meyerhof pathway), glucose => pyruvate K11645 4 fbaB; fructose-bisphosphate aldolase, class I [EC:4.1.2.13]

References

Aramaki, Takuya, Romain Blanc-Mathieu, Hisashi Endo, Koichi Ohkubo, Minoru Kanehisa, Susumu Goto, and Hiroyuki Ogata. 2020. “KofamKOALA: KEGG Ortholog Assignment Based on Profile HMM and Adaptive Score Threshold.” Bioinformatics 36 (7): 2251–52. https://doi.org/10.1093/bioinformatics/btz859.