Videos and questions for Chapter 2 of the course "Empirical Economics with R" at Ulm University (taught by Sebastian Kranz)
Which model will make the best predictions for the training data set?
Which model will make the worst predictions for the test data set? Make a guess...
One popular measure to quantify the prediction inaccuracy of a model is the Mean Absolute Error (MAE). Based on this name, make a guess about the precise formula for the MAE.:
Another popular measure to quantify the prediction inaccuracy of a model is the Root Mean Squared Error (RMSE). Make a guess about its formula:
For our three models, we find the following RMSE on the training and test data sets:
Based on the RMSE in the table above, which model would be considered the best according to the standard machine learning approach?
Take a look at the estimated regression tree from our slides:
What is the predicted price (in 1000 Euro) for a car registered in 2005 with a horse power of 250?
What is the share of cars registered before 2006?
What is the share of cars with fewer than 224 PS?
NOTE: The following videos have been recycled from another course of mine. Therefore the numbers in the lecture slides and references to the exercises and in the videos don't fit.
Remark: Take a look at the lecture slides to see how a split for a nominal variable is computed.
The following content is not relevant for the exam.
What are the random elements in a random forest? (Make a guess)
A: Each tree is estimated with a data set that is randomly drawn with replacement from the training data set.
B: We estimate several trees but pick a random subset of trees for the prediction.
C: When training a tree, at each node only a random subset of variables is considered for the optimal split.
Consider the following proposal to tune the hyperparameter cp
of a regression tree:
We define a grid of different candidate values for cp
like 10, 5,2,1, 0.5,0.2,0.1, ...
.
We then estimate for each value of cp
a regression tree on the training data set and assess the prediction quality on the test data set.
We then pick that value of the hyperparameter cp
which yields the lowest RMSE (or MAE) on the test data set. (We can also possibly refine the grid around the optimal value of cp
and search again.)
Which assessment do you think is commonly shared by machine learning experts about the method above?