# Advanced Data Exploration on dataset #2

## Overview

Teaching:30 min

Exercises:30 minQuestions

Objectives

# Table of Contents

- 1. Data exploration
- 2. Finding the differential CpG sites
- 3. Heatmap of the differential CpG
- 4. Exercise
- References

# 1. Data exploration

## 1.1 First steps

First things first. Let’s download and import the CpG beta values into R.

```
download.file(url = "https://zenodo.org/record/4056406/files/cpg_methylation_beta_values.tsv?download=1",
destfile = "data/cpg_methylation_beta_values.tsv",
method = "wget",
quiet = TRUE)
df_cpg <- read.delim("data/cpg_methylation_beta_values.tsv",
header = TRUE,
stringsAsFactors = FALSE)
```

## Question

- Can you show the first 5 rows and 5 columns?
- How many CpG sites are being investigated?
- How many different samples?
## Solution

`df_cpg[1:5,1:5]`

`nrow(df_cpg)`

`ncol(df_cpg`

We can then download and import the age to sample correspondence into R.

```
download.file(url = "https://zenodo.org/record/4056406/files/cpg_methylation_sample_age.tsv?download=1",
destfile = "data/cpg_methylation_sample_age.tsv",
method = "wget",
quiet = TRUE)
# Import sample to age correspondence
sample_age <- read.delim("data/cpg_methylation_sample_age.tsv",
header = TRUE,
stringsAsFactors = FALSE)
```

## 1.2 Number and distribution of beta values

## Question

Are the number of CpG per sample comparable?

## Solution

Yes absolutely. In fact, there are exactly 27,578 CG sites measured for each sample.

`df_cpg_tidy %>% group_by(sample) %>% tally() %>% summary()`

Since the beta values arise from a microarray hybridization, there is *always* a signal associated to each CpG probe set. Some signals might be filtered but usually each probe get assigned a value.

## Question

Can you make a density plot to show the distribution of beta values across the different samples? Overlay all density curves so you can see if one sample is very different from the others (possible outlier).

## Solution

`df_cpg_tidy %>% ggplot(data = ., aes(x = beta_coef, fill = sample)) + geom_density(alpha = 0.1) + guides(fill = FALSE)`

This is what you should see:

# 2. Finding the differential CpG sites

As a way to represent CpG methylation values, a heatmap of CpG beta values clustered on both samples and CpG could reveal groups of CpG islands related to samples of the same age. This would be our first attempt to relate age to methylation values. Yet, we need to take it a step backward because…

## Discussion

What could be a potential issue when creating a heatmap based on the complete set of CpG values?

As we have thousands of CpGs, this would create a massive probably uninformative heatmap. Instead, we could keep the CpGs that shows contrasted methylation values across the different samples. Non-differential CpG islands would unlikely to be related to age differences.

## 2.1 Data massage

First, we need to add the age to each sample to the `df_cpg_tidy`

tibble.

```
# left join to add age to each sample using the "sample" column as common key
df_cpg_tidy_with_age = dplyr::left_join(df_cpg_tidy,
sample_age,
by = "sample")
df_cpg_tidy_with_age
```

You should get the following output:

```
# A tibble: 1,820,148 x 8
CpG_id sample beta_coef title source pair race age
<chr> <chr> <dbl> <int> <chr> <int> <chr> <int>
1 cg00000292 GSM712302 0.790 111 saliva sample 1 White 40
2 cg00000292 GSM712303 0.684 112 saliva sample 1 White 40
3 cg00000292 GSM712306 0.766 811 saliva sample 8 White 39
4 cg00000292 GSM712307 0.821 812 saliva sample 8 White 39
5 cg00000292 GSM712308 0.633 911 saliva sample 9 White 39
(... more lines ...)
```

If you would have only one CpG, this would be easy right? Let’s see how easy!

## Question

Can you perform a one-way ANOVA on the “cg00000292” CpG island?

Hint1: Use`age`

as the Y response variable and`beta_coef`

as the explanatory X variable?

Hint2: use`summary(my_aov_model)`

to display the related ANOVA p-value.## Solution

One approach to do this would be:

`df_cpg_tidy_with_age %>% filter(CpG_id == "cg00000292") %>% with(data = ., aov(age ~ beta_coef)) %>% summary()`

It gives the following output: ~~~ Df Sum Sq Mean Sq F value Pr(>F) beta_coef 1 154 154.09 1.99 0.163 Residuals 64 4955 77.42

To do this, the issue is that we need to perform a one-way ANOVA between using the age as the response variable (the Y) and *each* CpG island methylation values as the X explanatory variable. There, we need to perform….27,568 ANOVAs!

That’s really a lot of tests but we can rely on the *List Column Workflow* also called the *Nest Map Unest* strategy to do this. Try to understand the code below and if too complicated, simply execute it:

## 2.2 Multiple one-factor ANOVAs

Now the *piece de resistance* to perform the multiple ANOVAs all at once:

```
fits <- df_cpg_tidy_with_age %>%
nest(.data = ., - CpG_id) %>% # categorical variable used for ANOVA
mutate(fit = map(data, ~ aov(data = ., formula = .$age ~ .$beta_coef))) %>% # One-factor ANOVA with age as the Y and beta value as X
mutate(res = map(fit, glance)) %>% # Use broom::glance to return a tibble from a aov/lm object
unnest(res)
```

## 2.3 Detailed explanation

## Optional reading

If this section is too complex, feel free to skip it and move to the next section.

**Nesting:** The `nest(.data = ., - CpG_id)`

line reads the tidy dataframe and perform a nest operation placing all data related to a single CpG island on a single row. There are only two columns, the `CpG_id`

column and another special column called `data`

.

Try the following code:

```
df_cpg_tidy_with_age %>% nest(.data = ., - CpG_id)
# A tibble: 27,578 x 2
CpG_id data
<chr> <list>
1 cg00000292 <tibble [66 × 7]>
2 cg00002426 <tibble [66 × 7]>
3 cg00003994 <tibble [66 × 7]>
```

The `data`

column contains all info related to a single CpG island. For instance try this:

```
test <- df_cpg_tidy_with_age %>% nest(.data = ., - CpG_id)
test$data[test$CpG_id == "cg00000292"]
```

This will display the content of the list stored in the `data`

column related to CpG island `cg00000292`

.

```
[[1]]
# A tibble: 66 x 7
sample beta_coef title source pair race age
<chr> <dbl> <int> <chr> <int> <chr> <int>
1 GSM712302 0.790 111 saliva sample 1 White 40
2 GSM712303 0.684 112 saliva sample 1 White 40
3 GSM712306 0.766 811 saliva sample 8 White 39
```

**Mapping:** Mapping is the action of applying a function to a list for instance. Here we will *map* the `aov()`

function to each list item stored in the special `data`

column.

```
df_cpg_tidy_with_age %>%
nest(.data = ., - CpG_id) %>%
mutate(fit = map(data, ~ aov(data = ., formula = .$age ~ .$beta_coef)))
```

The `dplyr::mutate()`

function will add a new column named `fit`

to the tibble. Then the `map`

function will perform a one-way ANOVA on the `data`

special column. It will use `age`

as the response variable and `beta_coef`

as the explanatory variable.

```
# A tibble: 27,578 x 3
CpG_id data fit
<chr> <list> <list>
1 cg00000292 <tibble [66 × 7]> <aov>
2 cg00002426 <tibble [66 × 7]> <aov>
3 cg00003994 <tibble [66 × 7]> <aov>
```

The `<aov>`

indicates that we have ANOVA models stored for each CpG island in the `fit`

column.

**Unnesting:** To extract our results as a normal tibble, we need to *unnest* the fit column which contains list items and extract it as rows and columns.

This is what the following piece of code does:

```
fits <- df_cpg_tidy_with_age %>%
nest(.data = ., - CpG_id) %>% # categorical variable used for ANOVA
mutate(fit = map(data, ~ aov(data = ., formula = .$age ~ .$beta_coef))) %>%
mutate(res = map(fit, glance)) %>%
unnest(res)
```

There is one extra step because the `<aov>`

data structure is a bit peculiar and does not work well as a dataframe. We therefore use the `glance()`

function from the `broom`

package to do this. We create a new column for this called `res`

that will store our result temporarily. The corresponding R code is `mutate(res = map(fit, glance))`

.

Finally, we can unnest the `res`

column that contains our desired one-way ANOVA results (including ANOVA p-values).

```
# A tibble: 27,578 x 14
CpG_id data fit r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance
<chr> <lis> <lis> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 cg000… <tib… <aov> 0.0302 0.0150 8.80 1.99 0.163 2 -236. 478. 485. 4955.
2 cg000… <tib… <aov> 0.0105 -0.00500 8.89 0.677 0.414 2 -237. 480. 486. 5055.
3 cg000… <tib… <aov> 0.0301 0.0149 8.80 1.98 0.164 2 -236. 478. 485. 4955.
4 cg000… <tib… <aov> 0.0531 0.0383 8.69 3.59 0.0628 2 -235. 477. 483. 4838.
5 cg000… <tib… <aov> 0.000256 -0.0154 8.93 0.0164 0.899 2 -237. 480. 487. 5108.
```

As you can see, all CpG sites are present since our tibble exhibits 27,578 rows. Each CpG site thus has a associated p-value in the `p.value`

column.

Let’s use this info to filter our `df_cpg_tidy_with_age`

.

## 2.4 Filtering non-differential CpG sites

Since we performed more than 27,000 one-way ANOVAs, we need to correct for type I error (false positives). We can add a column named `fdr`

for false discovery rate that will contain adjusted p-values.
There are several methods to correct for type I error which can be “bonferroni” (most stringent) or “hochberg” (more relaxed). It goes beyond the scope of this lesson so we will choose one i.e. “hochberg” but feel free to explore: type `p.adjust.methods`

in the R console to show a list of possible methods.

```
fits_with_fdr <- fits %>%
mutate(fdr = p.adjust(p.value, method = "hochberg"))
# Threshold of 5% false discovery rate
cpg_differentials <- filter(fits_with_fdr, fdr < 0.05) %>%
dplyr::pull(CpG_id)
length(cpg_differentials)
```

We have 228 differential CpGs which should will be easier to display all together on the same heatmap.

Let’s pull out all CpG sites that are differential. This can be achieved rather simply with dplyr `filter()`

and `pull()`

. The latter function takes a tibble column and returns a vector (simpler data structure).

```
cpg_differentials <- filter(fits, p.value < 0.05) %>%
dplyr::pull(CpG_id)
head(cpg_differentials)
```

```
"cg00021527" "cg00037940" "cg00054706" "cg00055233" "cg00059225" "cg00080012"
```

We can now use this vector of differential CpG sites to filter out the “main” dataframe.

```
df_cpg_tidy_with_age_filtered <- filter(df_cpg_tidy_with_age,
CpG_id %in% cpg_differentials)
# Sanity check: do you find back 228 unique CpG sites?
length(unique(df_cpg_tidy_with_age_filtered$CpG_id))
```

If you find 228 unique CpG sites with the FDR hochberg correction, then you’re good to go!

# 3. Heatmap of the differential CpG

Now that with have a nice usable tibble with only differential CpG sites, let’s create the heatmap.

## 3.1 First version of the heatmap

```
df_for_heatmap <- pivot_wider(df_cpg_tidy_with_age_filtered,
id_cols = CpG_id,
names_from = "sample",
values_from = "beta_coef") %>%
column_to_rownames("CpG_id")
```

This is how it looks like.

```
df_for_heatmap[1:5,1:5]
GSM712302 GSM712303 GSM712306 GSM712307 GSM712308
cg00059225 0.3232136 0.36157080 0.34965740 0.35154480 0.4328789
cg00083937 0.5869906 0.61236270 0.61208990 0.50702600 0.7512166
cg00090147 0.0933590 0.09505562 0.07671635 0.08565685 0.1023634
cg00107187 0.1037067 0.12210180 0.11644700 0.11337940 0.1568705
cg00201234 0.2174944 0.23351220 0.19471140 0.21341510 0.3491383
```

We create the heatmap using the `pheatmap()`

function from the `pheatmap`

package. It has many different options. A convenient way to visualise it is to look at its documentation online here.

```
my_heatmap <- pheatmap(df_for_heatmap,
show_colnames = TRUE, # sample names
show_rownames = FALSE, # CpG identifiers (not very useful to show so hidden)
cluster_rows = TRUE, # group similar CpG sites together
cluster_cols = TRUE, # group similar samples together
fontsize_col = 4) # to display sample names
```

## 3.2 Slight upgrade

One way to dig into our data would be to add the sample age on top of the heatmap.

```
sample_annot <- sample_age %>% column_to_rownames("sample") %>% select(age)
my_heatmap <- pheatmap(df_for_heatmap,
annotation_col = sample_annot,
show_colnames = TRUE,
show_rownames = FALSE,
cluster_rows = TRUE,
cluster_cols = TRUE,
fontsize_col = 4)
```

## Discussion

How easy is it to find groups of CpG sites related to high ages?

## 3.3 Saving image on your hard drive

If you want to save your heatmap to an image file on your machine, you can use this helper function.

```
save_pheatmap_png <- function(x, filename, width=1200, height=1000, res = 150) {
png(filename, width = width, height = height, res = res)
grid::grid.newpage()
grid::grid.draw(x$gtable)
dev.off()
}
save_pheatmap_png(my_heatmap, "heatmap.png")
```

## 3.4 Saving the filtered tidy CpG dataset for the next episodes

```
# filter out unnecessary columns
# comply with wide format (same as original one but with only 228 CpG sites)
final_df <-
df_cpg_tidy_with_age_filtered %>%
dplyr::select(CpG_id, sample, beta_coef) %>%
pivot_wider(id_cols = sample, names_from = CpG_id, values_from = beta_coef)
# make it wide to comply with the original data format
write.table(x = final_df,
file = "data/differential_cpgs.tsv",
quote = FALSE,
sep = "\t")
```

# 4. Exercise

## Exercise

Here, a one-way ANOVA was used to test the relationship between CpG site methylation level and age.

- On which hypothesis the one-way ANOVA did rely?
- Can you test some or all of these hypotheses using R?
- If these assumptions are not met, find the equivalent of a one-way ANOVA among non-parametric tests. You can search for
`non-parametric ANOVA`

in your favorite web search engine.

# References

- The
`broom()`

package documentation - A vignette to demonstrate the use of
`dplyr`

,`broom`

and`map`

altogether to perform nested computations: Link

## Key Points