Introduction to dataset #1 (gene expressions)
|
This dataset measures gene expression from various human tissues.
Gene expression is measured from the hybridization of mRNA molecules to microarray probes.
Some tissues have genes that are uniquely or strongly expressed in them which makes them gene-markers of that tissue.
Finding tissue-specific markers can be done through several methods: PCA, clustering or even custom ones.
|
Exploratory Data Analysis of dataset #1
|
Computing several descriptive metrics and distribution plots is important to visualise value distributions and potential outliers.
Scaling is necessary to visualise values that show value differences of several order of magnitude.
A pairwise plot matrix can help to pinpoint samples with similar gene expression profiles.
|
Principal Component Analysis (PCA)
|
|
Hierarchical clustering and heatmaps
|
Scaling of expression values is essential for distance calculation and hierarchical clustering.
The clustering method of choice can have a profound impact
Although clustering is a powerful technique to describe data structure, it does not easily help to pinpoint at specific interesting genes.
|
Finding tissue-specific genes through feature engineering
|
Sometimes, creating a new variable is a necessary step to find interesting leads in a dataset.
Data transformation that converts a distribution to a normal one can benefit to one’s analysis.
|
Exercises on dataset #1
|
|
Introduction to dataset #2 (methylation and age)
|
|
Advanced Data Exploration on dataset #2
|
|
Linear, Multiple and Regularised Regressions
|
|
Random Forest analysis
|
|