Exploratory Data Analysis for Predicting Migration Rate from Socio-economic Factors
In this post we will apply exploratory data analysis concepts to the problem of predicting a country’s net migration rate as a time series from its socio-economic factors.
The code for my project can be found here.
Introduction
In this project, we will perform data analysis followed by ML model fitting to predict a country’s net migration rate as a time series from socio-economic factors. The following are some useful definitions to keep in mind.
Net migration rate: the difference between the number of migrants entering and those leaving a country in a year, per 1,000 midyear population (U.S. Census definition). If the rate is positive, it indicates more people leaving the country than entering it (net immigration rate), and if it is negative, it indicates more people entering the country than leaving it (net emigration rate).
DALYs (Disability-Adjusted Life Years): One DALY represents the loss of the equivalent of one year of full health. DALYs for a disease or health condition are the sum of the years of life lost due to premature mortality (YLLs) and the years lived with a disability (YLDs) due to prevalent cases of the disease or health condition in a population (WHO).
HDI (Human Development Index): A composite index measuring average achievement in three basic dimensions of human development: healthcare, education, and economic situation (UNDP).
Other socio-economic factors which include GDP, Life Expectancy, Inflation, Mortality, and Healthcare expenditure (collected from the World Bank) will also guide us in performing our data analysis and ML training.
The final dataset can be found here.
Feature Correlation
At first, we have to look at the correlation of the different features to get a sense of how the features interact with each other, which will help guide us through the exploratory data analysis. After reading the dataset into a Pandas data frame, we obtain the following correlation matrix.
import pandas as pd
df = pd.read\_csv('Dataset.csv', header=0)
df.corr()
One one level, we can look at how the various features correlate with the net migration rate. The year variable has a near zero correlation with the net migration rate, which is a good aspect of our dataset. It points to the fact that the distribution of the net migration rate is independent of the year considered. However, the main factors correlated with the net migration rate are the Human Development Index (HDI), Mortality, and disability-adjusted life years (DALYs). The former indicator correlates positively with the migration while the two latter indicators correlate negatively. This makes sense as the higher the country is developed, the more immigrants might come into the country. While the higher the mortality and DALYs are, the more likely the country is to experience a loss of citizens through emigration. The other indicators, namely life expectancy, healthcare expenditure, and GDP seem to have limited correlation with the output feature. The inflation indicator has the lowest correlation (-0.5%) which might push us to eliminate the column altogether from the data. However, we still have to aggregate the data and visualize it to make that decision.
Another important aspect to note is how the input features correlate among each other. For example, the HDI and life expectancy seem to highly correlate (91%). In fact, one of the factors taken into consideration when calculating the HDI of a country by the UN is the life expectancy in that country. Hence, it makes total sense for these two indicators to correlate. However, we might have to remove one of them when training a machine learning model on the data, or alternatively merge the two columns. Unexpectedly also, mortality and DALYs are highly correlated (67%): they both indicate the development of the health sector in a country. The variables HDI and DALYs are also highly correlated (-80%). We will keep a close look at all the above features throughout the EDA so that we can conclude on whether to remove on of the above features or merge them together through dimensionality reduction at a later stage.
Data Analysis with R
We will perform the first part of our analysis using R. We will have to import the necessary library to visualize our plots (using ggplot2) and read our dataset. We will also transform the Year column into a Date type in R, divide the HDI from numerical to categorical (low, medium, high or very high), and remove any entry with unknown continent code.
library(ggplot2)
setwd("dataset\_directory")
df <- read.csv("Dataset.csv", header=TRUE, na.strings = "")
for (i in seq\_len(length(df$Year))) {
df$Year\[i\] <- (paste("01-01-",as.character(df$Year\[i\]),sep=""))
}
df$Year <- as.Date(df$Year, format="%d-%m-%Y")
unique(df$Year)
sapply(df, class)
df$HDI <- sapply(df$HDI, cut, breaks = c(0, 0.55, 0.7, 0.8, 1),
labels = c("Low", "Medium", "High", "Very High"))
df <- subset(df, df$Continent.Code!="Unknown")
Feature Distribution
We will plot the distribution of some of the features throughout the years. The following R code plots the corresponding figures.
ggplot(df, aes(x=Year, y=Net.Migration.Rate, group=Year)) +
geom\_boxplot() +
coord\_cartesian(ylim=c(-50,50)) +
labs(title = "Variation of Net Migration Rate per Year",
y = "Net Migration Rate", x = "Year")
ggplot(df\[df$DALYs<150000,\], aes(x=Year, y=DALYs, group=Year)) +
geom\_boxplot() +
labs(title = "Variation of DALYs per Year",
y = "DALYs", x = "Year")
ggplot(df, aes(x=Year, y=GDP, group=Year)) +
geom\_boxplot() +
coord\_cartesian(ylim=c(-30,30)) +
labs(title = "Variation of GDP growth per Year",
y = "GDP growth (%)", x = "Year")
ggplot(df, aes(x=Year, y=Inflation, group=Year)) +
geom\_boxplot() +
coord\_cartesian(ylim=c(-25,110)) +
labs(title = "Variation of Inflation per Year",
y = "Inflation, consumer prices (annual %)", x = "Year")
Net Migration Rate per Year
As we can see the median of this distribution remains roughly stable throughout the years: it stays around 0. It is also clear how there are a lot of outliers in this distribution. This is expected since a large positive migration rate in one country should translate into a negative migration rate in other countries (the immigrants of one are the emigrants of the other). The fact that the median is stable around 0 from 1960 till 2020 confirms the reliability of the data.
DALYs
Concerning the boxplot of the DALYs, the data presents a decrease in the median and standard deviation of the DALYs with the progress of the years. This shows the overall increase in the quality of healthcare around the globe, which explains the decrease in the DALYs. The data has become more concentrated in recent years (decrease in the standard deviation) which might be due to globalization and recent efforts by the UN to better the living conditions of underdeveloped nations.
GDP Growth
Concerning the boxplot of the GDP growth (as a percentage) per year, we can see the fluctuations of this feature throughout the years. The median fluctuates around 5% with occasional downfalls. We can point out specific years where the fall of the GDP growth was expected. For example, the US stock market crash of 2008, which affected most countries, caused a sharp decline of the GDP growth in that year. Additionally, the 2020 Coronavirus Stock Market Crash is also clearly visible in the boxplot where the median GDP growth of countries becomes negative for the first time since 1960, which indicates an overall decline in the GDP in most countries of the world due to the pandemic.
Inflation
Concerning inflation, it is striking to see that in early and later years (1960–1970 and 2000–2020) the median inflation fluctuates a bit above 0 and the standard deviation is smaller to the other periods. However from 1970 till 2000, we see that the data becomes more dispersed and that the median inflation is significantly larger than before 1970 or after 2000. This might be due to the fact that during this period, a lot of countries experienced political and economical crises, which skewed the data towards having a larger inflation. We can clearly see a large number of outliers in the data too, which confirms our hypothesis.
Data Aggregation
We will perform data aggregation according to the HDI level as well as the country’s continent.
Aggregation by HDI Level
First the UN divides the HDI into four levels: low, medium, high, and very high. After aggregating according to this classification, we can visualize the variation of the Net migration rate per year per HDI level using the R code below.
agg1 <- aggregate(cbind(Net.Migration.Rate) ~ HDI+Year, data=df, FUN=mean)
ggplot(agg1, aes(x=Year, y=Net.Migration.Rate, color=HDI)) +
geom\_line(stat="identity", lwd=1.2) +
geom\_smooth(linetype=2) +
labs(title = "Variation of Net Migration per Year",
subtitle = "Divided according to HDI",
y = "Net Migration Rate", x = "")
We can clearly see a distinction of this evolution according to the HDI level of the countries. As expected, the countries with a very high HDI have the greatest net migrant rate: a lot of immigrants come to highly developed countries. However, there is a decline in this rate starting in 2010. This might be due to stricter immigration policies.
We can also point out that countries who have a high HDI rank second in terms of migration rate. However, it is worthy to note that while this rate was positive in 1990, it decreased slowly to become negative in 2020. One hypothesis might be that some countries who had a high HDI in 1990 progressed and developed into having a very high HDI, climbing up the HDI ranking, while the remaining countries might have faced national difficulties preventing them from increasing their HDI. Hence, this decreased the overall average of the net migrant rate of high HDI countries (similar to what a sampling bias might do to the statistic).
Surprisingly, although countries with a low and medium HDI present a negative net migration rate, countries with a medium HDI have the lowest rate. Why aren’t countries with a low HDI with the lowest net migration rate? This might be due to the fact that extremely underdeveloped countries do not allow their citizens to emigrate freely from the country. For instance, African and Asian countries with very low HDI might suffer from conservative norms and low education and financial status, preventing them from immigration (such as the African tribes and Arab Bedouins) or it could even be a political regime prohibiting emigration.
Aggregation by Continent
Furthermore, we can split the data according to the six continents: Asia (AS), Africa (AF), Europe (EU), North America (NA), South America (SA), and Oceania (OC). The code below does this aggregation.
agg2 <- aggregate(cbind(Net.Migration.Rate) ~ Continent.Code+Year, data=df, FUN=mean)
ggplot(agg2\[,\], aes(x=Year, y=Net.Migration.Rate, color=Continent.Code)) +
geom\_line(stat="identity", lwd=1, alpha=0.5, linetype=1) +
#facet\_wrap(~ Continent.Code) +
geom\_smooth(linetype=2)+
labs(title = "Variation of Net Migration per Year",
subtitle = "Divided according to Continent",
y = "Net Migration Rate", x = "Year")
We don’t see a strong correlation between the continents. Most continents have a negative migration rate, except for Asia. This might be due to the fact that most immigration happens between neighboring countries in the same continent. For instance, Syrians fleeing war to Lebanon, Venezuelans fleeing the inflation to Columbia, Mexicans immigrating to the USA, Ukrainians fleeing the Russian war to Poland and Moldova. All these examples strengthen our hypothesis that migration is concentrated among neighboring countries. Hence, the emigrants of a country get translated into immigrants for the neighboring country, in the same continent: resulting in an overall fluctuation of the net migration rate around 0 for most continents, especially in the past two decades i.e., from 2000 till 2020. We will try applying this theory when visualizing the migration rates through maps.
Variation of the features in specific countries
We will visualize the effect of inflation on net migration through the case study of a few countries in order to generalize on the interplay of the different features.
Net migration and Inflation
We will visualize the effect of inflation on net migration through their variation in two countries: Honduras, a country in Central America, and Iraq, a middle eastern country in Asia. The time series of this evolution can be shown below.
countryName <- "Honduras"
plot1 <- ggplot(df\[df$Country.Name==countryName,\], aes(x=Year, y=Inflation)) +
geom\_line(stat="Identity") +
geom\_smooth() +
labs(title = "Variation of Net Migration and Inflation",
y = "Inflation, consumer prices (annual %)", x = "", subtitle = countryName)
plot2 <- ggplot(df\[df$Country.Name==countryName,\], aes(x=Year, y=Net.Migration.Rate)) +
geom\_line(stat="Identity") +
geom\_smooth() +
labs(y = "Net Migration Rate", x = "Year")
countryName <- "Iraq"
plot3 <- ggplot(df\[df$Country.Name==countryName,\], aes(x=Year, y=Inflation)) +
geom\_line(stat="Identity") +
geom\_smooth() +
labs(title = "Variation of Net Migration and Inflation",
y = "Inflation, consumer prices (annual %)", x = "", subtitle = countryName)
plot4 <- ggplot(df\[df$Country.Name==countryName,\], aes(x=Year, y=Net.Migration.Rate)) +
geom\_line(stat="Identity") +
geom\_smooth() +
labs(y = "Net Migration Rate", x = "Year")
gridExtra::grid.arrange(plot1, plot3, plot2, plot4,nrow = 2)
The negative correlation between these two variables is clear: an increase in inflation correlates with a decrease in the net migration rate and vice versa in the two countries. When these countries experience a peak in inflation, it correlates with a significant decrease in the net migration i.e., an increase in the number of emigrants fleeing the country. When this inflation decreases, the net migration rate increases, signifying either that the emigrants are coming back to the country or that new immigrants are coming into the country due to the stabilization of the economical situation in this nation.
Net migration and neighboring countries
Furthermore, it is interesting to visualize the evolution of net migration in neighboring countries, especially those undergoing economical, financial or political crises. The graphs below shows the variation of the net migration for Venezuela & Colombia, and for Mexico and the United States. As a historical background, we should consider that since 1970 Colombians have been fleeing to Venezuela to avoid the violent conflict of their homeland. However, as of 2016, the roles have been reversed: Venezuelans have been immigrating to Colombia because of the terrible financial crises of their country.
This trend is clearly shown in the data, whereas in early years the net migration of Colombia was negative but that of Venezuela positive. However, in recent years, the migration rate has been increasing in Colombia and decreasing in Venezuela (due to the mass immigration of Venezuelans to Colombia). The sharp decrease in this rate in Venezuela coincides with its spike in Colombia (around 2016–2020), which confirms our hypothesis.
While the situation in Colombia and Venezuela might be considered as an outlier, we can see a similar trend when neighboring countries have a huge disparity in economical or developmental opportunities. For instance, in Mexico and the United States, the variation of their net migration rate coincides with each other (whereas a sharp increase in one reflects a decrease in the other, especially around the year 2000).
A similar trend can be seen in other neighboring countries, e.g. Albania & Greece, Bangladesh & India, Syria & Lebanon, Oman & Yemen. The spike in the migration rate in one of these countries is correlated with a fall in its neighboring country around the same period. The widespread nature of this phenomena (which we can’t disregard as being a few outliers) will necessitate us to encode spatial locality in our model to consider the features of the neighboring countries (including the net migration rate) in order to predict the final migration rate of the given country.
Data Analysis with Python
We will now plot the data as scatter plots and visualize it on maps using Python.
Scatter Plots
First, we will divide the countries into their HDI rank and plot them according to their net migration rate, GDP growth and Inflation percentage. The three dimensional scatter plot can be shown below.
As expected, the countries with a high net migration rate also happen to have a high HDI, a high GDP growth and a low inflation percentage, whereas countries with a negative migrant rate are mostly underdeveloped and suffer from a low GDP growth and a high inflation.
Next, we will visualize the scatter plot of net migration rate, DALYs, and mortality.
Considering the high correlation of DALYs and mortality, we expect to see a clear distinction in the plot. As expected the countries with a high DALYs also have a high mortality and vice versa. We can clearly see in the plot that highly developed countries with low DALYs and mortality experience a high migrant rate whereas underdeveloped countries with high DALYs and mortality suffer from an increased number of emigration (negative net migration rate).
We are also able to animate this scatter plot by year to visualize the variation in those different features as a time series. You can refer to the code on GitHub to generate this scatter plot animation. Through this animation, we can learn about the yearly trends of the data. We notice that the DALYs and mortality tend to decrease from year to year and that the GDP growth fluctuates around 0 and 10% for most years except for its occasional fall during a market crash (e.g. 2008–2009 and 2020). Concerning the inflation, before 200 we can see a considerable dispersion of the data (indicating a large standard deviation) while after the year 2000 countries become closer in regards to the percentage of inflation. All these insights confirm what we deduced at the beginning from the boxplots of the distribution of the features in prior project section.
Variation of migration according to DALYs, Mortality (social factors), and HDI
Variation of migration according to GDP, inflation (economic factors), and HDI
Animated Heatmap of Migration
Considering the spatial and temporal features of our dataset, the best way to visualize the data would be through an animated world map. To do so, we will use the geopandas python module and merge our dataset with the world dataset available through the module. That way, the resulting dataset will include the appropriate country names and geometries so that it can be easily converted into a folium Map. Besides doing so, the prepForHeatMap function normalizes the net migration rate because the weight input of the HeatMap function must be between 0 and 1. The normalization disregards the outliers (with a z-score above 3 or below -3) when calculating the normal score, and then substitutes those with a z-score above 3 with a score of 1 and those below -3 with a score of 0. That way the outliers won’t skew the normal distribution while staying in the data (so that we don’t end up with missing countries on the map). Refer to the GitHub code for details about the implementation in the GetMap.py file.
The purple countries are the ones with a low migration rate while the green ones have a high migration rate. This map gives us an overview of the migration flow in the year 2020. The most popular destinations for immigrants appear to be the US, Canada, Eastern Europe, Australia & New Zealand (not shown in the figure) and the Arab states of the Persian Gulf (Saudi Arabia, Kuwait, UAE). There are a few outliers in the map, namely the countries of central America who are accepting Venezuelans immigrants suffering from the crisis of their country along with Lebanon who housed the Syrian immigrants fleeing war.