Videos and questions for Chapter 2 of the course "Empirical Economics with R" at Ulm University (taught by Sebastian Kranz)

Machine Learning

Which model will make the best predictions for the training data set?

Which model will make the worst predictions for the test data set? Make a guess...

One popular measure to quantify the prediction inaccuracy of a model is the Mean Absolute Error (MAE). Based on this name, make a guess about the precise formula for the MAE.:

Another popular measure to quantify the prediction inaccuracy of a model is the Root Mean Squared Error (RMSE). Make a guess about its formula:

For our three models, we find the following RMSE on the training and test data sets:

Based on the RMSE in the table above, which model would be considered the best according to the standard machine learning approach?


Regression Trees

Take a look at the estimated regression tree from our slides:

What is the predicted price (in 1000 Euro) for a car registered in 2005 with a horse power of 250?

What is the share of cars registered before 2006?

What is the share of cars with fewer than 224 PS?

Estimating a regression tree

NOTE: The following videos have been recycled from another course of mine. Therefore the numbers in the lecture slides and references to the exercises and in the videos don't fit.

Remark: Take a look at the lecture slides to see how a split for a nominal variable is computed.

Optional: Regression trees and dummy variable regression

The following content is not relevant for the exam.


Random forests

What are the random elements in a random forest? (Make a guess)
A: Each tree is estimated with a data set that is randomly drawn with replacement from the training data set.
B: We estimate several trees but pick a random subset of trees for the prediction.
C: When training a tree, at each node only a random subset of variables is considered for the optimal split.


Hyperparameter Tuning & Cross Validation

Consider the following proposal to tune the hyperparameter cp of a regression tree:

  1. We define a grid of different candidate values for cp like 10, 5,2,1, 0.5,0.2,0.1, ....

  2. We then estimate for each value of cp a regression tree on the training data set and assess the prediction quality on the test data set.

  3. We then pick that value of the hyperparameter cp which yields the lowest RMSE (or MAE) on the test data set. (We can also possibly refine the grid around the optimal value of cp and search again.)

Which assessment do you think is commonly shared by machine learning experts about the method above?

k-Crossfold Validation