Random Forest analysis

Overview

Teaching: 30 min
Exercises: 30 min

Questions

Objectives

1. Introduction
- 1.1 Definitions
2. Decision trees
2.1 Tree vocabulary
3. From a single tree to a forest
- 3.1 Bagged trees
- 3.2 Random Forests
References

1. Introduction

1.1 Definitions

Random Forest

Forest comes from Decision Trees

2. Decision trees

2.1 Tree vocabulary

Node Leaf Terminal Node Branch Pruning = how far to you grow the tree?

Regression tree

Metric for split quality = Residual Sum of Squares = how homogeneous are the observations split at decision node X.

3. From a single tree to a forest

3.1 Bagged trees

Bagging = bootstrap aggregation = reduce the variance due to random dataset split by combining the results of multiple decision trees. You work on all predictors/variables at once (all CpG sites). You generate N bootstrapped training datasets (using random sampling). No tree pruning

3.2 Random Forests

Similar to bagged trees but Random Forest = bagging on random subsets of variables and samples. You don’t work on all predictors at once.

Since not all of toe observations/samples are used to build the tree, the Y predictions based on the train set are compared to their actual real values using the left-over data (“out of bag”). Formally, this is called the Out-Of-Bag error estimate (OOB in short).

References

Data Camp decision tree guide
University of Cincinnati Business Analytics tutorials on regression trees and Random Forests.

Key Points

previous episode

Advanced Forensic Biology

lesson home