CM3015 - Machine Learning and Neural Networks

1. Introduction

1.1 Context

The World Health Organization (WHO)’s Global Health Observatory (GHO) data repository tracks life expectancy for countries worldwide by following health status and many other related factors.

Dataset:
Kaggle - WHO Life Expectancy

Although there have been a lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that the effect of immunization and the human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives the motivation to resolve both the factors stated previously by formulating a regression model based on a mixed-effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health-related factors as well. Since the observations of this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to the lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

1.2 Content

The project relies on the accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data sets are made available to the public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries have been collected from the same WHO data repository website and its corresponding economic data was collected from the United Nations website. Among all categories of health-related factors, only those critical factors were chosen which are more representative. It has been observed that in the past 15 years, there has been a huge development in the health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project, we have considered data from years 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data set. An initial visual inspection of the data showed some missing values. As the data sets were from WHO, we found no evident errors. Missing data were handled in R software by using the Missmap command. The result indicated that most of the missing data were for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables were then divided into several broad categories:​Immunization related factors, Mortality factors, Economical factors and Social factors.

1.3 Inspiration

The data-set aims to answer the following key questions:

Ideally, this data will eventually inform countries concerning which factors to change in order to improve the life expectancy of their populations. If we can predict life expectancy well given all the factors, this is a good sign that there are some important patterns in the data. Life expectancy is expressed in years, and hence it is a number. This means that in order to build a predictive model one needs to use regression.

Reference: Kaggle - WHO Life Expectancy

1.4 Purpose

The purpose of our model is not to answer the questions as listed above, but rather to build a model that can predict the life expectancy of a person to a reasonable extent of accuracy, based on a set of 20 input parameters, namely:

1.5 Model information

We're trying to predict the life expectancy of a person based on the set of input parameters as listed above, so our problem is one of Regression to arbitrary values, being the age of the person.

We'll be using the Adam optimizer and the "mse" loss function when compiling our model which is done in section 2.5.

2. Methodology

2.1 Setting up the environment

2.1.1 Anaconda / Miniconda

I'll be using Jupyter Notebook with Anaconda, so I'll be following the following instructions.

Install TensorFlow

  1. Download and install Anaconda or the smaller Miniconda.
  2. On Windows open the Start menu and open an Anaconda Command Prompt. On macOS or Linux open a terminal window. Use the default bash shell on macOS or Linux.
  3. Choose a name for your TensorFlow environment, such as “tf”.
  4. To install the current release of CPU-only TensorFlow, recommended for beginners:
    conda create -n tf tensorflow
    conda activate tf
    Or, to install the current release of GPU TensorFlow on Linux or Windows:
    conda create -n tf-gpu tensorflow-gpu
    conda activate tf-gpu

Open the Anaconda Prompt and start jupyter notebook:
jupyter notebook

I received an error the first time I tried starting jupyter notebook, so I had to add the following paths to my environmental variables in Windows:
anaconda3/Library/bin
anaconda3/Scripts

See screenshot below:
Screenshot%202022-02-08%20130026.png

Once jupyter notebook started up, I received an error stating that the notebook wasn't trusted.
This was solved with the following command:
jupyter trust Final_Project.ipynb

Next, I was still unable to import TensorFlow in jupyter notebook and it gave me an error: Module TensorFlow not found.
I knew it was installed correctly and this was confirmed when checking the tensorflow version number in the Anaconda prompt.
I found the following post that shows that a link needs to be created between the virtual environment and TensorFlow:

There was also another problem which was that I had to update the kernelspec list as explained here:

So, the issues were solved with the following commands:

Once I had the kernelspec path corrected, I created the link to jupyter:

Start Jupyter Notebook from within the virtual environment in the Anaconda prompt:
jupyter notebook

The browser opens automatically and the kernel, named "Python TensorFlow" can now be selected as the kernel to use as seen in the below screenshot:
screenshot1.png

Check if TensorFLow was successfully installed and verify the version number:
import tensorflow as tf
print(tf.__version__)

References:

2.1.2 Additional packages

We'll also be using other Python packages like Pandas and Scikit-learn, which can be installed using the Anaconda command prompt as follows:
pip install wheel
pip install pandas
pip install -U scikit-learn
conda install pandas
conda install -c conda-forge matplotlib
pip install -U scikit-learn
pip install scikeras

Keras is built on top of TensorFlow and can be accessed through TensorFlow as follows:
from tensorflow import keras

2.2 Data loading and observing

To start with, we'll load the life_expectancy.csv dataset file into a pandas DataFrame called dataset.

Let's have a look at the data by printing out the first 5 rows of the dataframe.

From the table above, we can see the columns and the data types we'll be working with.
Life Expectancy is the column with the data that we'll be predicting, so this will be used as the label data when training the model.
Since life expectancy is expressed as a number, our predictive model will be making use of regression.

Let's have a look at the summary statistics of the data.

Next, we want to drop the Country column because to create the predictive model, knowing which country the data comes from can be confusing and it's not a column we can generalize over.
We want to learn general patterns for all the countries, not only those dependent on specific countries.

Next, we want to ensure that our dataset does not contain any NaN values. Any NaN values in the dataset will cause problems later on when we train our model and the "loss" and "mae" values would not be calculated, nor would the weights be updated and the training would essentially fail.

I ran into this problem while training the model, and although it's not shown here in this notebook, I went through the below-referenced guide to troubleshoot the problem.

Reference:

Let's see if there are any NaN values in our dataset.

As we can see, our dataset does contain NaN values. Let's see how many.

Ok, that's quite a few.

We have two options here, namely:

  1. we could either replace the NaN values with the mean value of the column within which the NaN value finds itself,
  2. or we could just remove the entire row from the dataset so that none of the values for that data point is taken into consideration when training the model.

The second option would however reduce the size of our dataset and could potentially lead to overfitting, but how much is still to be determined.

Let's just go for the second option for now and see how our dataset is affected. We'll save the modified dataset into a new variable in order to retain the original for reference purposes and in case we want to explore option 1.

Now that I've removed all rows containing NaN values, let's see how big our dataset is now in comparison to before the removal of data.

Now that we have the data we need, we're ready to split the data into labels and features.
The labels are contained in the "Life expectancy" column, so let's start with that.

Features span from 1st column to the last column, not including the "Life expectancy" column.
Let's assign a subset of the dataframe in a new variable called features.

As we can see from the printout above, we now have all of the columns, except the "Life expectancy" column in the features subset of the dataset.

2.3 Data Preprocessing

2.3.1 One-Hot-Encoding

Since neural networks cannot work with string data directly, we need to convert our categorical features into numerical. One-hot encoding creates a binary column for each category.

Looking at the features data we currently have, we'll note that it still contains columns that are categorical, for example, the "Status" column tells us whether the country is developed, or developing. This column needs to be converted into a numerical column and a method to do so is called one-hot-encoding. Let's apply one-hot-encoding on all the categorical columns.

Looking at the printout above, it can be noticed that our features dataset has one extra column. It now has 21 columns and previously had 20 columns. The status columns were separated into two columns, one called "Status_Developed" and "Status_Developing".

2.3.2 Splitting the data

In machine learning, we train a model on training data, and we evaluate its performance on a held-out set of data, our test set, not seen during the learning.

Let's move on to splitting our data into training and test sets.
We can use the sklearn.model_selection.train_test_split() function to do this.
We'll start with a 20% test size and a random state of 20.

2.3.3 Standardize / Normalize

The usual preprocessing step for numerical variables, among others, is standardization that rescales features to zero mean and unit variance.
We need to do this because our features have different scales or units. “BMI” and the “Adult Mortality” column’s have completely different interval units.
By having features with differing scales, the optimizer might update some weights faster than the others.

Normalization is another way of preprocessing numerical data: it scales the numerical features to a fixed range - usually between 0 and 1.

There are more than one way to do this. One example would be to subtract the mean and divide by the standard deviation as shown here for example:
mean = train_data.mean(axis=0)
features_train -=mean
std = train_data.std(axis=0)
features_train /= std

features_test -= mean
features_test /= std

Important to note from the example above is that the mean and std is computed from the training set. It's important that no computation be done on the test data.

For this step, however, I'm going to use the sklearn.compose.ColumnTransformer instance to set up the normalization procedure.
We'll keep the example above commented out for references purposes.

For this instance, we need to list all the numerical features in the dataset, but we can also use DataFrame.select_dtypes() to select float64 or int64 feature types automatically.

Now that we have an instance of ColumnTransformer, we're going to fit it to the training data and at the same time transform it using the ColumnTransformer.fit_transform() method.
We'll assign the scaled data to a new variable called features_train_scaled.
We'll also use the same instance to transform the data instance features_test and assign it to a new variable called features_test_scaled.

It's important to note that ColumnTransformer() returns NumPy arrays.
We can convert them back to a pandas DataFrame so we can see some useful summaries of the scaled data.

Using the pandas DataFrame .describe() method, we can get useful information regarding the normalized data, such as the mean, standard deviation, etc.

2.4 Building the model

2.4.1 Model instance

Let's create a model instance of the tensorflow.keras.models.Sequential model.

2.4.1 Input layer

The following code initializes an input layer for a DataFrame my_data that has 15 columns:
from tensorflow.keras.layers import InputLayer
my_input = InputLayer(input_shape=(15,))

Notice that the input_shape parameter has to have its first dimension equal to the number of features in the data. You don’t need to specify the second dimension: the number of samples or batch size.

The following code avoids hard-coding with using the .shape property of the my_data DataFrame:
num_features = my_data.shape[1]
my_input = tf.keras.layers.InputLayer(input_shape=(num_features,))

Add the input input layer to the model:
my_model.add(my_input)

We can print a useful summary of the model like this:
print(model.summary())

Let's create the input layer to the network model using tf.keras.layers.InputLayer with the shape corresponding to the number of features in the dataset.

Add the input layer to the model.

2.4.2 Hidden layers

Hidden layers are added to the model with the following command:
from tensorflow.keras.layers import Dense
my_model.add(Dense(64, activation='relu'))

With the activation parameter, we specify which activation function we want to have in the output of our hidden layer. There are a number of activation functions such as softmax, sigmoid, but relu (Rectified Linear Unit) is very effective in many applications.

Adding more layers to a neural network naturally increases the number of parameters to be tuned. With every layer, there are associated weight and bias vectors.

We can add any number of hidden layers to our model, but for starters, we'll just add one.

2.4.3 Output layer

The output layer can be added to the model as follows:
from tensorflow.keras.layers import Dense
my_model.add(Dense(1))

Note that you don’t need to specify the input shape of this layer since Tensorflow with Keras can automatically infer its shape from the previous layer.

Let's add an output layer with one neuron since a single output is needed for regression prediction, to the model.

2.4.4 Model build function

Instead of building the model step by step, as shown above, we can create a function that builds the complete model in one go. This will allow us to create multiple model instances should we want to in order to make comparisons between models, etc.

2.4.5 Model summary

It's important to distinguish the difference here between the model (my_model), which was built step by step and the build_model function, which contains more hidden layers and makes use of regularization techniques.

I'll be using my_model up to section 2.7 before we start looking at ways to improve model performance.

2.4.5.1 Manually built model: my_model

Now that we've completed building the model (my_model), let's print out a summary of the model to see summarize.

2.4.5.2 Model builder function

The model builder function (section 2.4.4) contains multiple hidden layers, as well as utilizes regularization techniques in the form of dropout. Refer to section 2.5.1.5 for information on regularization.

Note the difference in the total parameters between the two models. We'll see later how the performance differs between the two models.

2.5 Compiling the model

2.5.1 Optimizers

Optimizers are used to improve the speed and performance of a model by making modifications to the attributes of a neural network, such as the weights and the learning rate during the training phase of the model.
This helps reduce the overall loss and improves accuracy.

There are quite a few different optimizers to choose from. I'm not going to cover all of them in detail, but we'll list the ones that are published on the TensorFlow API docs:

Reference:

Below is an example of how to setup the Adam optimizer: from tensorflow.keras.optimizers import Adam
opt = Adam(learning_rate=0.01)

When configuring the optimizer, the learning rate is set. See section 2.5.1.1 for more information regarding the learning rate.

A model is compiled as follows:
my_model.compile(loss='mse', metrics=['mae'], optimizer=opt)

loss denotes the measure of learning success and the lower the loss the better the performance. In the case of regression, the most often used loss function is the Mean Squared Error mse (the average squared difference between the estimated values and the actual value).

Additionally, we want to observe the progress of the Mean Absolute Error (mae) while training the model because MAE can give us a better idea than mse on how far off we are from the true values in the units we are predicting.

2.5.2 The Adam optimizer

The Adam optimizer is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best-known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favourably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

Reference:

Create an instance of the Adam optimizer and set the learning rate to 0.01

Now that we have an instance of the Adam optimizer, we need to compile the model with the following parameters:

2.6 Fit, Evaluate and Test the model

2.6.1 Train the model

The following command trains a model instance "my_model" using training data "my_data" and training labels "my_labels":
my_model.fit(my_data, my_labels, epochs=50, batch_size=3, verbose=1)

model.fit() takes the following parameters:

Let's log the results into a variable called history so that we can use that information when evaluating the model.

2.6.2 Evaluate the model

Let's have a look at the model history in order to get a clearer picture of the model's performance.

Above we can see that we have a high variance in the beginning, so it's hard to see the lowest point of the plot and at how many epochs that low point is. Let's omit the first couple of epochs to modify the scale.

2.6.3 Improving model

There are a couple of things we can do to improve the model's performance. Some of the things we'll consider is tuning the hyperparameters such as mentioned below, as well as implementing K-fold cross-validation since our dataset is considerably small.

2.6.3.1 Hyperparameters

Hyperparameter tuning is probably the most costly and intensive process of neural network training. The following parameters are called hyperparameters and they can be changed and tweaked to improve the performance of the model:

  1. the learning rate
  2. number of batches
  3. number of epochs
  4. hidden layers
  5. regularization (dropout)

2.5.1.1: Learning Rate

Neural networks are trained with the gradient descent algorithm and one of the most important hyperparameters in the network training is the learning rate. The learning rate determines how big of a change you apply to the network weights as a consequence of the error gradient calculated on a batch of training data.

A larger learning rate leads to a faster learning process at the cost to be stuck in a suboptimal solution (local minimum). A smaller learning rate might produce a good suboptimal or global solution, but it will take much longer to converge. In the extremes, a learning rate too large will lead to an unstable learning process oscillating over the epochs. A learning rate too small may not converge or get stuck in a local minimum.

2.5.1.2: Batch size

The batch size is a hyperparameter that determines how many training samples are seen before updating the network’s parameters (weight and bias matrices).

When the batch contains all the training examples, the process is called batch gradient descent. If the batch has one sample, it is called the stochastic gradient descent. And finally, when 1 < batch size < number of training points, is called mini-batch gradient descent. An advantage of using batches is for GPU computation that can parallelize neural network computations.

How do we choose the batch size for our model? On one hand, a larger batch size will provide our model with better gradient estimates and a solution close to the optimum, but this comes at a cost of computational efficiency and good generalization performance. On the other hand, a smaller batch size is a poor estimate of the gradient, but the learning is performed faster. Finding the “sweet spot” depends on the dataset and the problem, and can be determined through hyperparameter tuning.

2.5.1.3: Epochs

The number of epochs is a hyperparameter representing the number of complete passes through the training dataset. If the data is split into batches, in one epoch the optimizer will see all the batches.

How do you choose the number of epochs? Too many epochs can lead to overfitting, and too few to underfitting. One trick is to use early stopping: when the training performance reaches the plateau or starts degrading, the learning stops.

Below is an example plot of overfitting where the parameters in the neural network are increased:
Screenshot%202022-02-09%20112436.png

We know we are overfitting because the validation error at first decreases but eventually starts increasing. From the plot, we can see that the training could have been stopped earlier (around epoch 50).

We can specify early stopping in TensorFlow with Keras by creating an EarlyStopping callback and adding it as a parameter when we fit our model.
from tensorflow.keras.callbacks import EarlyStopping
stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=40)
history = model.fit(features_train, labels_train, epochs=num_epochs, batch_size=16, verbose=0, validation_split=0.2, callbacks=[stop])

Here, we include the following:

2.5.1.4 Hidden layer

Hidden layers are added between the input and the output layer of the model. We can have any number of hidden layers and we can specify the number of neurons that each layer should have, as well as specify the activation function that should be used. These parameters can influence the performance of the model.

2.5.1.5 Regularization: dropout

Regularization is a set of techniques that prevent the learning process to completely fit the model to the training data which can lead to overfitting. It makes the model simpler, smooths out the learning curve, and hence makes it more ‘regular’. There are many techniques for regularization such as simplifying the model, adding weight regularization, weight decay, and so on. The most common regularization method is called "dropout".

Dropout is a technique that randomly ignores, or “drops out” a number of outputs of a layer by setting them to zeros. The dropout rate is the percentage of layer outputs set to zero (usually between 20% to 50%).

In Keras, we can add a dropout layer by introducing the Dropout layer.

Input layer
model.add(input)

Hidden layers
model.add(layers.Dense(128, activation = 'relu'))
model.add(layers.Dropout(0.1))
model.add(layers.Dense(64, activation = 'relu'))
model.add(layers.Dropout(0.2))
model.add(layers.Dense(24, activation = 'relu'))
model.add(layers.Dropout(0.3))

Output layer
model.add(layers.Dense(1))

2.6.4 Test the model

When the training is finalized, we use the trained model to predict values for samples that the training procedure haven’t seen: the test set.

The following commands evaluates the model instance "my_model" using the test data "my_data" and test labels "my_labels":
val_mse, val_mae = my_model.evaluate(my_data, my_labels, verbose = 0)

In our case, model.evaluate() returns the value for our chosen loss metrics mse and the mae.

Let's just double check that our mae score is indeed correct by running a model prediciton on the holdout data and comparing the mae scores.

2.6.5 Baseline - How good is our model?

At the moment, we don't know how good the model really is, because we have nothing to compare it with. We can see from the final metric from the test result on the holdout data above that we're out by about 3.71 years, which does not seem that bad, but what we need is a baseline with which we can compare it with.

Above we can see the baseline model have error. Our model currently has an error rate that is much lower than the baseline error, so our model has a performance that is doing much better than the baseline.

2.7 Auto tuning

Manually tuning the hyperparameters to find the combination of parameters that results in a model with the lowest loss is quite cumbersome, luckily we can have two ways of auto-tuning our model, namely:

  1. Grid search
  2. Randomized search

Grid search, or exhaustive search, tries every combination of desired hyperparameter values, whereas Random Search goes through random combinations of hyperparameters and doesn’t try them all.

Let's implement a grid search for the best parameters and see what the results are.

Let's print out the result and see which hyperparameters are the best suited to our model.

2.7.2 Early stopping

The number of computations we had to do with Autotuning using the Grid search method was immense and it took a really long time to complete. To make this process more efficient, we could just use Randomized search, or we could make use of early stopping.

As we know, too many epochs lead to overfitting and too few lead to underfitting. Making use of early stopping will determine the optimal number of epochs to use. This is based on when the training performance reaches the plateau or starts degrading, the learning stops.

Let's build a new model and see how early stopping could be used to find the correct number of epochs.

Above we can see that the training process was stopped at epoch number 307. The validation score came down and then started increasing again just before this number, this training was stopped. Let's plot the results for more clarity.

From the plots above we can see the loss and validation score decrease and then stabilise. It's at this point that we know we're starting to overfit the model and can terminate training.

3. Conclusions

We've seen how a Neural Network model can be used to predict the life expectancy of a person based on a set of 20 input parameters with a relatively good degree of accuracy.

We've gone through the process of importing and preprocessing the data as well as splitting the data into groups used for training, validating and testing. The testing set is used exclusively for testing the model and we've seen how this model performs better than a baseline model.

We've looked at auto-tuning the model to find the best hyperparameters for the model to use and we've also seen how early stopping could be used to find the best number of epochs to use which lies between the lines of under and overfitting.

4. Acknowledgements and References

I've made references throughout the document where relevant, but the main sources upon which this project is built are the following:

The data was collected from WHO and United Nations websites with the help of Deeksha Russell and Duan Wang.

The following resources were used to gain a good understanding of deep learning, as well as contain coding examples: