Training and Tuning Machine Learning Models for Predicting Migration Rate
In this post we will apply various machine learning models and perform hyperparameter tuning to the problem of predicting a country`s net migration rate as a time series from its socio-economic factors.
The code for my project can be found here.
Data Preprocessing
Refer to the ML Training.ipynb
notebook for the code and final_dataset.csv for the data
Before diving into fitting a machine learning model for our dataset, we will do some data preprocessing to be aware of the structure of our dataset and to find any potential anomalies. After importing our final joint dataset into a dataframe object, we observe its first 5 rows, description, and info.
First of all, we can identify that the output column is the Net Migration Rate, the columns Country Name and Country Code need to be dropped, and the other columns are the input. Secondly, The description of the distribution values of the columns reassures us that no erroneous outlier is present in the data, since the min, max, and mean of each feature makes sense. What is worth noting is that the Net Migration Rate column has a minimum of -54.746, a maximum of 134.414, a mean of 0.287, a standard deviation of 9.865, and an IQR of (-2.89, 2.02). We will keep these values in mind when we evaluate the performance of the models later. Thirdly, the info table of the dataset shows us that two of our input features (Continent Code and HDI) are categorical. As such, we perform a One Hot Encoding on these two features to arrive at our final dataset (after dropping the Country Name, Country Code, and the categorical columns).
We split the data into input and output, and further split those into training and testing data. We have used an 80% - 20% split into training and testing which ensures that enough data is set aside for testing (around 733 rows of testing from the original 3,662 rows of the data).
We will also standardize the input features to use the scaled ones for the Support Vector Regression (SVR) and Neural Networks. While standardizing we fit the scaler for the testing data alone to prevent any data leakage. We use the same scaler to transform the testing data.
Linear Regression and its Variants
Plain Linear Regression
There was no need for a hyperparameter search for the plain linear regression model since there are no relevant hyperparameters (the normalization parameter is deprecated). After fitting the model and printing the intercept and coefficients of each feature, the error metrics and learning curve of the model are:
R2: 0.98438, MSE: 1.84479, RMSE: 1.35823, MAE: 0.41999
The R2 score is very close to 1 which is a positive indicator. The mean squared error seems acceptable considering that there are a lot of legitimate outliers in our data (standard deviation of 9 while the IQR has a range of 4). The mean absolute error is on the high end considering the distribution of the data: half of the data is between -2 and +2, so an average absolute error of around 0.4 seems considerable. We will keep an eye at this metric in the following models.
Huber Regressor
Considering the vast presence of outliers in our data, we will try a linear regression model robust to outliers. The Huber Regressor is one of the robust regression models that is “less sensitive to outliers than the mean squared error, and it is often more precise and converges faster than the mean absolute error” (Géron 290). After fitting this model with our training data, we get the below error metrics and the learning curve:
R2: 0.97997, MSE: 2.36503, RMSE: 1.53787, MAE: 0.23481
As expected, the mean absolute error decreased by 45% (from 0.42 to 0.23). This proves that the model is more reliable for the majority of the data that is centered around the mean. However, it performed worse on the outliers: this can be clearly seen by the wide range of the scores in the learning curve. The mean squared error increased from 1.84 to 2.36. The Huber Regressor can therefore help us predict the future net migration rate of a country as long as it doesn’t present outlier values in one of its features.
Linear Regression with Quadratic Features
We will introduce a pipeline that transforms the features into a polynomial of degree 2 followed by a linear regression model. After fitting the data, the error metrics and learning curve of this model is shown below:
R2: 0.98443, MSE: 1.83862, RMSE: 1.35596, MAE: 0.44821
We can conclude that the performance of the polynomial regression model is similar to the plain linear regression. The mean squared error only decreased slightly, so we can be confident that there is no need to square the features of our dataset to improve the performance of our model (the relationship appears to be linear at best). When we tried increasing the degree of the polynomial, it significantly worsened the performance of the model.
Elastic Net
According to the literature, “it is almost always preferable to have at least a little bit of regularization, so generally you should avoid plain Linear Regression. Ridge is a good default, but if you suspect that only a few features are actually useful, you should prefer Lasso or Elastic Net since they tend to reduce the useless features’ weights down to zero” (Géron 142).
Therefore, we will perform a hyperparameter tuning for an Elastic Net using Polynomial Features (considering that Elastic Nets are quickly modeled, we will directly perform a grid search without a random hyperparameter search). We are using a 10 times repeated 10-fold cross-validation. The hyperparameter grid is as follows:
Pipeline | Hyperparameter | Values | ||
---|---|---|---|---|
Elastic Net | alpha | 0.05 | 0.1 | 0.5 |
l1_ratio | 0.5 | 0.75 | 1.0 | |
tol | 0.00001 | 0.0001 | 0.001 | |
max_iter | 5,000 | |||
Polynomial Features | degree | 1 | 2 | 3 |
include_bias | True | False |
Keep in mind that when the polynomial degree is 1, it is equivalent to a stand alone Elastic Net model, alpha is the regularization parameter, and the closer the l1_ratio is to 1, the more the model behaves as a Lasso Regression and less like a Ridge Regression. After fitting 162 candidates, the code outputs the following as the best hyperparameters for the model.
{
'alpha': 0.05,
'l1_ratio': 1.0,
'max_iter': 5000,
'tol': 0.001,
'degree': 3,
'include_bias': True
}
We can conclude that the best model is a Lasso Regression (Elastic Net with l1_ratio equal to 1) with Polynomial Features of degree 3. This model might be preferable since the polynomial of degree 3 can capture many more combinations and nonlinear relations of the input features than a plain linear regression model. The Lasso Regression helps filter out the useless combination of features which also helps the model from overfitting (it reduces the useless features’ weights to zero unlike the linear regression). After running the model with the above parameters, the error metrics and learning rate are shown below.
R2: 0.97814, MSE: 2.58095, RMSE: 1.60653, MAE: 0.50331
While we are seeing a slight decrease in performance with this model, it has the advantage of being regularized to prevent over & under-fitting while capturing more complex relationships of the data than a plain linear regression (as confirmed by the hyperparameter tuning and cross-validation). We will also see later on that this model generalizes very well to the dataset without the time series.
Decision Tree Regressor
We will perform a random hyperparameter search on the following parameters.
Hyperparameters | Values | ||||
---|---|---|---|---|---|
max_features | auto | sqrt | |||
max_depth | 10 | 45 | 80 | 115 | 150 |
min_samples_split | 2 | 5 | 10 | ||
min_samples_leaf | 1 | 3 | 5 | 7 | 10 |
max_leaf_nodes | 10 | 45 | 80 | 115 | 150 |
The randomized search with cross validation returns the following dictionary of best parameters:
{
'min_samples_split': 10,
'min_samples_leaf': 3,
'max_leaf_nodes': 115,
'max_features': 'auto',
'max_depth': 115
}
Accordingly, we will narrow our search and provide a comprehensive grid search with the following parameters.
Hyperparameters | Values | ||
---|---|---|---|
max_features | auto | ||
max_depth | 100 | 120 | 150 |
min_samples_split | 5 | 10 | 20 |
min_samples_leaf | 1 | 5 | 10 |
max_leaf_nodes | 50 | 100 | 125 |
The grid search and cross validation outputs the following as the best hyperparameters.
{
'max_features': 'auto',
'max_depth': 150,
'min_samples_split': 5,
'min_samples_leaf': 1,
'max_leaf_nodes': 125
}
After creating a model with the above hyperparameters, the error metrics of the predicted feature and the learning curve of the model is shown below.
R2: 0.93394, MSE: 7.80085, RMSE: 2.79300, MAE: 0.88348
The R2 score is still relatively close to 1, however we are observing a considerable increase in the mean squared error and mean absolute error. This increase is especially concerning since the distribution of the variables is close to 0: IQR of (-2,2). Therefore, a root mean squared error of around 2.8 isn’t favorable. While it is true that the plain linear regression models (along with its variants) performed much better than the decision tree regressor, this model can help us interpret the rationale behind the machine learning prediction.
Random Forest Regressor
We will perform a random hyperparameter search on the following parameters.
Hyperparameters | Values | ||||
---|---|---|---|---|---|
n_estimators | 1 | 75 | 100 | ||
max_features | auto | sqrt | |||
max_depth | 10 | 45 | 80 | 115 | 150 |
min_samples_split | 2 | 6 | 10 | ||
min_samples_leaf | 1 | 5 | 10 | 20 | |
bootstrap | True | False |
The randomized search with cross validation returns the following dictionary of best parameters:
{
'bootstrap': False,
'max_depth': 92,
'max_features': 'sqrt',
'min_samples_leaf': 1,
'min_samples_split': 6,
'n_estimators': 25
}
Accordingly, we will narrow our search and provide a comprehensive grid search with the following parameters.
Hyperparameters | Values | ||||
---|---|---|---|---|---|
n_estimators | 25 | 50 | 100 | ||
max_features | auto | sqrt | |||
max_depth | 80 | 100 | 120 | ||
min_samples_split | 1 | 5 | 10 | ||
min_samples_leaf | 1 | 5 | 10 | ||
bootstrap | True | False |
The grid search and cross validation outputs the following as the best hyperparameters.
{
'bootstrap': True,
'max_depth': 100,
'max_features': 'auto',
'min_samples_leaf': 1,
'min_samples_split': 5,
'n_estimators': 100
}
After creating a model with the above hyperparameters, the error metrics of the predicted feature and the learning curve of the model is shown below.
R2: 0.96524, MSE: 4.10408, RMSE: 2.02585, MAE: 0.59007
As expected, the performance is boosted when using an ensemble of trees compared to only one decision tree. The mean squared error almost dropped in half and the mean absolute error dropped from 0.88 to 0.6. While the performance of the random forest regression is slightly worse than the linear regression models, it is a fairly reliable and robust model as we will see later when dealing with the inputs deprived of the time steps.
Support Vector Regression
We will perform a random hyperparameter search on the following parameters.
Hyperparameters | Values | |||
---|---|---|---|---|
C | 1 | 4 | 7 | 10 |
Kernel | linear | rbf | sigmoid |
The randomized search with cross validation returns the following dictionary of best parameters:
{
'C': 7,
'kernel': 'linear'
}
Accordingly, we will narrow our search and provide a comprehensive grid search with the following parameters.
Hyperparameters | Values | |||
---|---|---|---|---|
C | 6 | 7 | 8 | 10 |
Kernel | linear | rbf |
The grid search and cross validation outputs the following as the best hyperparameters.
{
'C': 10,
'kernel': 'linear'
}
After creating a model with the above hyperparameters, the error metrics of the predicted feature and the learning curve of the model is shown below.
R2: 0.98042, MSE: 2.31166, RMSE: 1.52041, MAE: 0.26166
We can see that the SVR model was able to achieve a considerable decrease in the mean absolute error (almost half of the linear regression’s error). However, according to the learning curve, for relatively few training data (less than 1,500 examples), the training error is worse than the cross-validation error. This hints at the possibility that the model is overfitting for small values of training examples, which might be due to the large value of C (10).
Feed Forward Neural Network
We will create two neural network models to compare their performance: one with the scikit-learn module and another with the keras and tensorflow modules.
Multilayer Perceptron Regressor with ScikitLearn
We will perform a grid hyperparameter search with the following parameters and values.
Hyperparameters | Values | ||
---|---|---|---|
hidden_layer_sizes | (50,) | (100,) | (150,) |
activation | identity | logistic | |
solver | lbfgs | adam | |
max_iter | 500 |
The grid search and cross validation outputs the following as the best hyperparameters.
{
'activation': 'identity',
'hidden_layer_sizes': (50,),
'max_iter': 500,
'solver': 'lbfgs'
}
While we do believe that the identity activation function is sub-optimal (it doesn’t allow for the complex patterns of passing through multiple layers), we will stick to the findings of the hyperparameter tuning and allow more freedom in the following keras neural network. After creating a model with the above hyperparameters, the error metrics of the predicted feature and the learning curve of the model is shown below.
R2: 0.98419, MSE: 1.86722, RMSE: 1.36646, MAE: 0.49410
Surprisingly and as shown above, the MLP model performs similarly to the Linear Regression models: all the error metrics are very close to the plain Linear Regression model. We can safely conclude that fitting our dataset with a Multilayer Perceptron Regressor is unnecessary as the relationship between our features isn’t very complex (a simple linear / polynomial relationship suffices).
Sequential Neural Network with Keras
We provide a build_model function to pass to the wrapper of KerasRegressor. That way, we can easily perform hyperparameter tuning and cross validation with the Keras module. The distribution of the hyperparameters is shown below.
Hyperparameters | Values | |||
---|---|---|---|---|
n_hidden | 0 | 1 | 2 | 3 |
activation | sigmoid | tanh | relu | identity |
n_neurons | continuous list from 1 to 100: 1, 2, 3, …, 100 | |||
learning_rate | continuous reciprocal function from 0.0003 to 0.03 |
Unfortunately, the random hyperparameter search took too long to yield any result (we left it running on Google Colab for a couple of hours in vain). Therefore, we will refer to the current literature when building our neural network.
The neural network built will consist of one input layer with 20 neurons (to capture as much of the input features as possible), two hidden layers with 30 neurons each, and one output layer (of one neuron). The activation function of the layers will be the identity function for the input layer and the SELU activation function for the two hidden layers. In fact, according to a 2017 paper by Günter Klambauer et al., if all the hidden layers use the SELU activation function (scaled version of the ELU activation function), the network will self-normalize, solving the vanishing/exploding gradients problem. Therefore, we will follow their recommendation which also includes initializing the hidden layer’s weights using the LeCun normal initialization. The following code represents our final model.
model = Sequential()
model.add(Dense(20, input_shape = (len(X_trainNorm[0]),),
activation = 'linear'))
model.add(Dense(30, activation = 'selu',
kernel_initializer="lecun_normal"))
model.add(Dense(30, activation = 'selu',
kernel_initializer="lecun_normal"))
model.add(Dense(1))
Concerning the optimizer, we will use the Nadam optimizer (Adam—adaptive moment estimation—optimizer with Nesterov Momentum), as it is recommended by the literature because “it will often converge slightly faster than Adam” (Géron 351).
After creating the neural network model with the above hyperparameters, the error metrics of the predicted feature and the learning curve of the model is shown below.
R2: 0.98136, MSE: 2.20120, RMSE: 1.48364, MAE: 0.44678
In line with our results from the previous MLP model, the sequential neural network doesn’t outperform the other machine learning model. As mentioned previously, this might be due to the fact that the relationship between the input features is fairly simple, so that a simple linear regression model can easily capture this pattern without the need for an overly advanced artificial neural network model.
Summary of the Error Metrics
For convenience, we provide below a summary table of the error metrics of all the machine learning models considered. We can observe that the Huber Regressor and Support Vector Regression are able to achieve the lowest mean absolute error (around 0.23 and 0.26 respectively). However, the plain linear regression and the linear regression with quadratic features are able to achieve the lowest mean squared error (around 1.84 each).
Machine Learning Model | R2 score | MSE | RMSE | MAE |
---|---|---|---|---|
Linear Regression | 0.98438 | 1.84479 | 1.35823 | 0.41999 |
Huber Regressor | 0.97997 | 2.36503 | 1.53787 | 0.23481 |
Linear Regression with Quadratic Features | 0.98443 | 1.83862 | 1.35596 | 0.44821 |
Lasso Regression (Elastic Net with l1_ratio=1) | 0.98397 | 1.89338 | 1.37600 | 0.42761 |
Elastic Net with Polynomial Features | 0.97814 | 2.58095 | 1.60653 | 0.50331 |
Decision Tree Regression | 0.93394 | 7.80085 | 2.79300 | 0.88348 |
Random Forest Regression | 0.96524 | 4.10408 | 2.02585 | 0.59007 |
Support Vector Regression | 0.98042 | 2.31166 | 1.52041 | 0.26166 |
Multilayer Perceptron Regressor | 0.98419 | 1.86722 | 1.36646 | 0.49410 |
Sequential Neural Network | 0.98136 | 2.20120 | 1.48364 | 0.44678 |
Interpretation of the Linear Regression and Decision Tree
We will briefly interpret the linear regression and the decision tree model to better understand how our machine learning models make their prediction. Shown below are the list of coefficients of the linear regression model, a bar plot of the feature importance of the decision tree, and a subset of the decision tree path of the model.
Intercept | 0.22119481051553427 |
Prev1 | 16.547780 |
Prev2 | -7.397643 |
DALYs | 0.049705 |
GDP | 0.044089 |
Life Expectancy | 0.145769 |
Inflation | -0.004923 |
Mortality | 0.026234 |
Healthcare expenditure | -0.039050 |
AF | 0.052919 |
AS | 0.061712 |
EU | -0.068737 |
NA | -0.016998 |
OC | -0.053462 |
SA | -0.022309 |
High | 0.015329 |
Low | -0.030158 |
Medium | -0.063294 |
Very High | 0.074775 |
As can be seen from the coefficient table, bar plot, and tree nodes, the models give the two previous time steps the most importance. The immediate previous time step (Prev1) seems to dominate the rationale of the model. For the linear regression model, the Life Expectancy, DALYs, and GDP seem to have the most weights from the other input features (the features have been standardized to allow for a direct comparison of the coefficients without inspecting the scales of the feature). For the decision tree model, the Inflation feature seems to be the most important factor from the rest of the input features for the decision tree (while still having an overly low importance relative to the time steps). However, the other input features (HDI, Continent, Life Expectancy, DALYs, etc…) seem to have a negligible effect on the model prediction.
Does this mean that a country is doomed by its previous net migration rate? If most of the prediction can be generated by just looking at the previous two time steps (since they have the largest coefficients and relative importance), how can a country target its policy to prevent a sudden emigration from its land?
Model Fitting without Time Steps
Since the above interpretation isn’t satisfying for the purpose of our project, we will remove the previous time steps from our model and follow the same guidelines above to tune the hyperparameters of the model and observe the change in performance so that we can provide a more useful interpretation of the model that doesn’t take into consideration the previous time steps. After performing the random and grid hyperparameter search with cross validation, we fit the models with the dataset. We provide below the error metrics of the models, the coefficients of the linear regression model, and the feature importance of the decision tree.
Machine Learning Model | R2 score | MSE | RMSE | MAE |
---|---|---|---|---|
Linear Regression | 0.23202 | 90.68742 | 9.52299 | 4.79592 |
Huber Regressor | 0.15452 | 99.83856 | 9.99192 | 4.38152 |
Linear Regression with Quadratic Features | 0.56009 | 51.94693 | 7.20742 | 4.27266 |
Lasso Regression (Elastic Net with l1_ratio=1) | 0.23450 | 90.39476 | 9.50762 | 4.63189 |
Elastic Net with Polynomial Features (degree 4) | 0.75917 | 28.43818 | 5.33275 | 3.32673 |
Decision Tree Regression | 0.77732 | 26.29485 | 5.12785 | 2.92504 |
Random Forest Regression | 0.84301 | 18.53802 | 4.30558 | 2.24595 |
Support Vector Regression | 0.42771 | 67.57886 | 8.22064 | 3.54472 |
36.33389 | 6.02776 | 3.33691 | ||
Sequential Neural Network | 0.79502 | 24.20565 | 4.91992 | 3.04692 |
Intercept | 0.25190295055289097 |
DALYs | 2.354790 |
GDP | 0.4475479 |
Life Expectancy | -0.3226781 |
Inflation | -0.3919908 |
Mortality | -1.927345 |
Healthcare expenditure | -1.170903 |
AF | 5.461093e+12 |
AS | 5.084628e+12 |
EU | 4.863591e+12 |
NA | 3.791603e+12 |
OC | 2.576027e+12 |
SA | 2.971740e+12 |
High | 8.253593e+14 |
Low | 8.447472e+14 |
Medium | 8.031996e+14 |
Very High | 8.513803e+14 |
As seen above, it appears that the DALYs, Life Expectancy, Mortality, Healthcare expenditure, Inflation, and GDP seem to be the most significant features considering their considerable weight in the linear regression model and in the bar plot of the feature importance for the decision tree. The HDI and Continent categories seem to play a negligible role (as their weight is in the order of 1014 and 1012 respectively, and they are barely shown in the bar plot above). This is good news! countries can’t change the continent in which they are located, so our model shows that a nation isn’t doomed by the continent in which it is located. Additionally, it often takes years of hard work for a country to improve its Human Development Index, because this index spans multiple economic, educational, and social sectors of life. This interpretation was able to show that if countries focus on the other features (DALYs, GDP, Inflation, etc…), they can produce tangible change in preventing a migration crisis.
Concerning the performance of the model, as expected the performance decreased significantly for nearly all models. Removing the time steps seriously weakens the models. However, through this analysis, we are able to detect the most robust models who still performed relatively strongly without the time steps (which could help countries that don’t regularly or accurately measure their net migration rate, so they can’t rely on their previous time steps). Those robust models are: Elastic Net with Polynomial Features (degree 4), Decision Tree Regressor, Random Forest Regressor, and the Feed Forward Neural Network. It is striking that the decision tree and random forest models were able to achieve comparable results without the time steps unlike the other models.
In conclusion, we strongly recommend the usage of an Elastic Net with Polynomial Features or a Random Forest Regression model in the task of predicting the net migration rate from the features considered in this project. These two models proved to be strongly robust and performed the best on the task at hand. We believe there is no need for an overly complex model like a Support Vector Regressor or an Artificial Neural Network as the relationship among our features is fairly simple.
References
“Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow,” Aurélien Géron (2019).
“Incorporating Nesterov Momentum into Adam,” Timothy Dozat (2015).
“Self-Normalizing Neural Networks,” G. Klambauer, T. Unterthiner and A. Mayr (2017).