09 Transcriptomic and metabolomic data integration

Overview

Teaching: 45 min
Exercises: 15 min

Questions

How can I map differential genes to metabolic pathways?

How do I retrieve KEGG identifiers given a list of gene identifiers?

Objectives

Retrieve common metabolic identifiers such as KEGG KO using the Ensembl API queried using the biomartr package.

Be able to use MapMan to generate pathway visualisations.

1. Introduction
- 1.2 Setup
2. iPath webtool
6. Other data mining tools
7. Troubleshooting
8. Going further
- 8.1 Useful links
- 8.2. References

workflow overview

1. Introduction

1.2 Setup

2. iPath webtool

2.1 Tool presentation

So far, we have only been mostly looking at our transcriptomic results alone. Yet, KEGG has started to give us some insight into metabolism. Let’s dig further with two tools: iPath and MapMan.

From the Interactive Pathways Explorer v3 home page:

Interactive Pathways Explorer (iPath) is a web-based tool for the visualization, analysis and customization of various pathway maps.
iPath provides extensive map customization and data mapping capablities. Colors, width and opacity of any map element can be changed using various types of data (for example KEGG KOs, COGs or EC numbers).

We will use this webtool to map our genes on metabolic pathways to provide some contextual information.

home page of iPath 3 web interface

2.2 Retrieving Uniprot identifiers

From our list of differential genes We are going to generate a list of Uniprot identifiers since these are usable to visualise with iPath3 from our list of differential genes.

diff_arabidopsis_genes_annotated %>% 
  filter(uniprotswissprot != "") %>%                                       # to remove genes with no matching Uniprot entries
  unique() %>% 
  mutate(id_for_ipath = paste("UNIPROT",uniprotswissprot,sep = ":")) %>%   # to create an ID that iPath can use
  dplyr::select(id_for_ipath) %>%                                          # we keep only the relevant ID for further copy-pasting 
  write.table(., 
    file = "diff_genes_swissprot.tsv", 
    row.names = FALSE, 
    quote = FALSE)

If you open the diff_genes_swissprot.tsv file in a text editor, this is what you should see.

id_for_ipath
UNIPROT:Q9MAN1
UNIPROT:Q5XEZ0
UNIPROT:O24457
UNIPROT:Q9MAM1
UNIPROT:Q9LNJ9
UNIPROT:Q06402
... more lines ....

We can copy-paste this list in the box to the right.

If you then click on the “Submit data” button, you should see the following reactions highlighted.

first metabolic pathway map

Ok, there is a lot of information there and possibly too much. Let’s filter out some genes based on their \(\log_{2}\) fold change.

We can calculate the median, 75th percentile and the 90th percentile of our fold changes.

diff_genes %>% 
  filter(log2FoldChange > 0) %>% 
  with(.,quantile(log2FoldChange, c(0.5,0.75,0.9)))

50%       75%       90% 
0.8695469 1.6819535 3.2561618 

Since 75% of the genes with a positive \(\log_{2}\) fold change have a value lower than 1.68, we are going to keep the genes with a higher fold change.

diff_genes_filtered = 
  diff_genes %>% 
  filter(log2FoldChange > quantile(log2FoldChange, 0.75)) 

# we query Ensembl again to retrieve the attributes
# attributes_to_retrieve = c("tair_symbol", "uniprotswissprot","entrezgene_id")
diff_arabidopsis_genes_annotated_2 <- biomartr::biomart(genes = diff_genes_filtered$genes,
                                                      mart       = "plants_mart",                 
                                                      dataset    = "athaliana_eg_gene",           
                                                      attributes = attributes_to_retrieve,        
                                                      filters =     "ensembl_gene_id" )  

diff_arabidopsis_genes_annotated_2 %>% 
  filter(uniprotswissprot != "") %>% 
  unique() %>% 
  mutate(id_for_ipath = paste("UNIPROT",uniprotswissprot,sep = ":")) %>% 
  dplyr::select(id_for_ipath) %>% 
  write.table(., file = "diff_genes_swissprot_2.tsv", row.names = FALSE, quote = FALSE)

In your diff_genes_swissprot_2.tsv file, you should have less entries now which will make the map slightly clearer.

simplified ipath metabolic map

5.2 MapMan

MapMan is a standalone software that you can install on your laptop to create insightful metabolic visualisation of transcriptomic, proteomic and/or metabolic data.

From Schwacke et al., 2019:

The MapMan framework was developed specifically for plants with the design goal to facilitate the visualization of omicsdata on plant pathways (Thimm et al., 2004). MapMan uses a simple hierarchical tree structure of terms referred to as ‘‘bins,’’ which describe biological contexts/concepts.
Major biological processes (e.g., photosynthesis) are encompassed in top-level bins, and each child bin represents a more narrowly focused subprocess or component within the context of the parent bin. Assignment of proteins to the lowest-level (i.e., leaf) bins was preferred in order to make the annotation as precise as possible, although assignment to abstract higher-level bins was supported.
Proteins were mostly assigned to a single bin, but for some proteins with functions in diverse biological processes it wasnecessary to correspondingly assign to multiple bins.

5.2.1 Requirements

You will need to create an account first and download the proper MapMan installation for your OS here. I’d recommend to download the latest stable version (not the “release candidate”).

You will then have to download the so-called “mapping file” which is an ontology relating genes to functional bins. You can also generate your own classification for your species of interest with the Mercator webtool. You’ll need to supply a fasta file containing your transcripts or your protein of interest.

The latest list of available mappings is here.

5.2.2 Pathway visualisation

One of the most used visualisation is the global overview of the metabolism. Up-regulated genes are shown in red and genes down-regulated in blue.

mapman visualisation

6. Other data mining tools

6.1 ThaleMiner

https://bar.utoronto.ca/thalemine/begin.do

6.2 Expression atlas

https://www.ebi.ac.uk/gxa/home

6.3 BAR

http://www.bar.utoronto.ca/

6.4 CoExprViz

http://bioinformatics.psb.ugent.be/webtools/coexpr/

7. Troubleshooting

If biomart refuses to query Ensembl again, run this command:

biomaRt::biomartCacheClear() # to solve a known bug https://github.com/BioinformaticsFMRP/TCGAbiolinks/issues/335

This will clean the cache memory and allow to perform the Ensembl query again.

8. Going further

8.1 Useful links

8.2. References

The Cluster Profiler companion book, a great place to start
Zhou et al. (2019). Metascape provides a biologist-oriented resource for the analysis of systems-level datasets. Nat Commun 10, 1523 (2019). link
Yates et al. (2020) Ensembl 2020, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D682–D688, Link
Tian et al. (2017) agriGO v2.0: a GO analysis toolkit for the agricultural community. Nucleic Acids Research, Volume 45, Issue W1, Pages W122–W129.Link
MapMan: MapMan4: A Refined Protein Classification and Annotation Framework Applicable to Multi-Omics Data Analysis. Schwacke et al. Molecular Plant, 12(6):879-892
Drost et al. (2017) Biomartr: genomic data retrieval with R. Bioinformatics 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821.
Darzi et al. (2018) iPath3.0: interactive pathways explorer v3. Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W510–W513, link

Key Points

Transcriptomic data integration with metabolic pathways require to map gene identifiers to pathway the correspondence

previous episode

Introduction to RNA-seq

lesson home