This lesson is still being designed and assembled (Pre-Alpha version)

Random Forest analysis

Overview

Teaching: 30 min
Exercises: 30 min
Questions
Objectives

Table of Contents

1. Introduction

1.1 Definitions

Random Forest

Forest comes from Decision Trees

2. Decision trees

2.1 Tree vocabulary

Node Leaf Terminal Node Branch Pruning = how far to you grow the tree?

Regression tree

Metric for split quality = Residual Sum of Squares = how homogeneous are the observations split at decision node X.

3. From a single tree to a forest

3.1 Bagged trees

Bagging = bootstrap aggregation = reduce the variance due to random dataset split by combining the results of multiple decision trees. You work on all predictors/variables at once (all CpG sites). You generate N bootstrapped training datasets (using random sampling). No tree pruning

3.2 Random Forests

Similar to bagged trees but Random Forest = bagging on random subsets of variables and samples. You don’t work on all predictors at once.

Since not all of toe observations/samples are used to build the tree, the Y predictions based on the train set are compared to their actual real values using the left-over data (“out of bag”). Formally, this is called the Out-Of-Bag error estimate (OOB in short).

References

Key Points