Empirical Economics 2

Random forests

What are the random elements in a random forest? (Make a guess)
A: Each tree is estimated with a data set that is randomly drawn with replacement from the training data set.
B: We estimate several trees but pick a random subset of trees for the prediction.
C: When training a tree, at each node only a random subset of variables is considered for the optimal split.

Just A

A and B

A and C

A, B and C

Hyperparameter Tuning & Cross Validation

Consider the following proposal to tune the hyperparameter cp of a regression tree:

We define a grid of different candidate values for cp like 10, 5,2,1, 0.5,0.2,0.1, ....
We then estimate for each value of cp a regression tree on the training data set and assess the prediction quality on the test data set.
We then pick that value of the hyperparameter cp which yields the lowest RMSE (or MAE) on the test data set. (We can also possibly refine the grid around the optimal value of cp and search again.)

Which assessment do you think is commonly shared by machine learning experts about the method above?

The method above would not work at all.

The method above would work in principle, but it is problematic to use the test data set to compare a lot of hyperparameters.

The method above works fine and is one of the default approaches for hyperparameter tuning in machine learning.

Machine Learning

Regression Trees

Estimating a regression tree

Optional: Regression trees and dummy variable regression

Random forests

Hyperparameter Tuning & Cross Validation

k-Crossfold Validation