Random Forest analysis
Overview
Teaching: 30 min
Exercises: 30 minQuestions
Objectives
Table of Contents
1. Introduction
1.1 Definitions
Random Forest
Forest comes from Decision Trees
2. Decision trees
2.1 Tree vocabulary
Node Leaf Terminal Node Branch Pruning = how far to you grow the tree?
Regression tree
Metric for split quality = Residual Sum of Squares = how homogeneous are the observations split at decision node X.
3. From a single tree to a forest
3.1 Bagged trees
Bagging = bootstrap aggregation = reduce the variance due to random dataset split by combining the results of multiple decision trees. You work on all predictors/variables at once (all CpG sites). You generate N bootstrapped training datasets (using random sampling). No tree pruning
3.2 Random Forests
Similar to bagged trees but Random Forest = bagging on random subsets of variables and samples. You don’t work on all predictors at once.
Since not all of toe observations/samples are used to build the tree, the Y predictions based on the train set are compared to their actual real values using the left-over data (“out of bag”). Formally, this is called the Out-Of-Bag error estimate (OOB in short).
References
- Data Camp decision tree guide
- University of Cincinnati Business Analytics tutorials on regression trees and Random Forests.
Key Points