# Random Forest analysis

## Overview

Teaching:30 min

Exercises:30 minQuestions

Objectives

# Table of Contents

# 1. Introduction

## 1.1 Definitions

Random Forest

Forest comes from Decision Trees

# 2. Decision trees

# 2.1 Tree vocabulary

Node Leaf Terminal Node Branch Pruning = how far to you grow the tree?

Regression tree

Metric for split quality = Residual Sum of Squares = how homogeneous are the observations split at decision node X.

# 3. From a single tree to a forest

## 3.1 Bagged trees

Bagging = bootstrap aggregation = reduce the variance due to random dataset split by combining the results of multiple decision trees. You work on all predictors/variables at once (all CpG sites). You generate N bootstrapped training datasets (using random sampling). No tree pruning

## 3.2 Random Forests

Similar to bagged trees but Random Forest = bagging on random subsets of variables and samples. You don’t work on all predictors at once.

Since not all of toe observations/samples are used to build the tree, the Y predictions based on the train set are compared to their actual real values using the left-over data (“out of bag”). Formally, this is called the *Out-Of-Bag* error estimate (OOB in short).

# References

- Data Camp decision tree guide
- University of Cincinnati Business Analytics tutorials on regression trees and Random Forests.

## Key Points