As an example dataset, we'll import Consumer_Complaints.csv.
Beautiful, our goal here is to use the feature columns, such as CRIM, which is the per capita crime rate by town, AGE, the proportion of owner-occupied units built prior to 1940 and more to predict the target column.
Where the target column is the median house prices.
In essence, each row is a different town in Boston (the data) and we're trying to build a model to predict the median house price (the label) of a town given a series of attributes about the town.
Since we have data and labels, this is a supervised learning problem. And since we're trying to predict a number, it's a regression problem.
Our model achieves an MAE of 2.122. This means, on average our models predictions are 2.122 units away from the actual value.
Let's make it a little more visual.
Similar to classification, there are several metrics you can use to evaluate your regression models.
We'll check out the following.
- R^2 (pronounced r-squared) or coefficient of determination - Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.
- Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.
- Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).
MSE will always be higher than MAE because is squares the errors rather than only taking the absolute difference into account.
Now you might be thinking, which regression evaluation metric should you use?
- R^2 is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R^2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.
- MAE gives a better indication of how far off each of your model's predictions are on average.
- As for MAE or MSE, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences.
Scikit-Learn's RandomizedSearchCV allows us to randomly search across different hyperparameters to see which work best.
It also stores details about the ones which work best!
Let's see it in action.
First, we create a grid (dictionary) of hyperparameters we'd like to search over.
RandomizedSearchCV try n_iter combinations of different values. Where as, GridSearchCV will try every single possible combination.
And if you remember from before when we did the calculation: max_depth has 4, max_features has 2, min_samples_leaf has 3, min_samples_split has 3, n_estimators has 5.
That's 4x2x3x3x5 = 360 models!
This could take a long time depending on the power of the computer you're using, the amount of data you have and the complexity of the hyperparamters (usually higher values means a more complex model).
In our case, the data we're using is relatively small (only ~300 samples).
Since we've already tried to find some ideal hyperparameters using RandomizedSearchCV, we'll create another hyperparameter grid based on the bestparams of rs_model* with less options and then try to use GridSearchCV to find a more ideal set.
Note: Based on the bestparams of rs_clf implies the next set of hyperparameters we'll try are roughly in the same range of the best set found by RandomizedSearchCV.
The Jupyter Notebook can be found here, GitHub