1. Exploratory Data Analysis

In this coursework we are going to be working with the Wine dataset. This is a 178 sample dataset that categorises 3 different types of Italian wine using 13 different features. The code below loads the Wine dataset and selects a subset of features for you to work with.

1.1. Visualising the data

The first part of tackling any ML problem is visualising the data in order to understand some of the properties of the problem at hand. When there are only a small number of classes and features, it is possible to use scatter plots to visualise interactions between different pairings of features.

The following image shows what such a visualisation might look like on the Iris dataset that you worked on during the Topic exercises.

image.png

Your first task is to recreate a similar grid for the Wine dataset, with each off-diagonal subplot showing the interaction between two features, and each of the classes represented as a different colour. The on-diagonal subplots (representing a single feature) should show a distribution (or histogram) for that feature.

You should create a function that, given data X and labels y, plots this grid. The function should be invoked something like this:

myplotGrid(X,y,...)

where X is your training data and y are the labels (you may also supply additional optional arguments). You can use an appropriate library to help you create the visualisation. You might want to code it yourself using matplotlib functions scatter and hist - however, this is not strictly necessary here, so try not spend too much time on this.

Manual Grid plot implementation

I started off coding this grid plot from scratch, which took me a while.
I used the Iris dataset since the example given was based on the Iris dataset, which made it easier to use as a reference.
I later, after strugling for hours, decided to look for resources and I found the Seaborn library and decided to use it, since implementation is quite easy and sraight forward.

I left the original code here for reference purposes, but as you'll see I was unable to implement the diagonal charts correctly.

Seaborn implementation

Below is the correct implementation on the wine data, using the Seaborn library.

References

Lets just test the docstring to see that it works as expected.

Next, lets call the myplotGrid function with the wine data as given.
As we can see below, implementing the distribution chart (diagonal), which is what I struggled with with my manual implementation, is quite easy and straightforward.

1.2. Exploratory Data Analysis under noise

When data are collected under real-world settings they usually contain some amount of noise that makes classification more challenging. In the cell below, invoke your exploratory data analysis function above on a noisy version of your data X.

Try to perturb your data with some Gaussian noise,

# initialize random seed to replicate results over different runs
mySeed = 12345 
np.random.seed(mySeed) 
XN=X+np.random.normal(0,0.6,X.shape)

and then invoke

myplotGrid(XN,y)

Data with added noise

Below is the grid plot with noise induced onto the wine data with the given function above.

Q1. Exploratory data analysis

Based on your exploratory analysis, if you were to build a classifier using only two of the available features, which ones would you choose and why? Answer as fully as you can.

Answer:

I would choose the flavanoids and alcohol as the two properties to use for classification.

If we look at the scatterplot in row 0 : col 1, or row 1 : col 0, which is the same plot, just rotated, we can see that there are a clear distiction between these two characteristics.
These two characteristics would work well when we need to classify the wines into one of the three classes.

Also when looking at the distribution charts of these two characteristics, we can see that the three classes have distinct differences related to where they peak.
The ash distribution chart would be an example of a bad characteristic to use when it comes to classification since they look very familiar.

Q2. Data with noise

What do you observe by plotting the data without noise compared to plotting with added Gaussian noise?

Answer:

When looking at the scatter plots, the data plotted without noise are more clustered together, based on the class type.
With the noise added to the data, there seem to be a lot more overlapping between the different classes, which would increase the difficulty as far as classification is concerned.

When looking at the distribution plots, we see a huge difference with the flavanoids.
Without the noise, class 2 had a peak much higher than classes 0 and 1, whereas with the noise, classes 0 and 1 have higher peaks than 2.

2. Implementing kNN

In the cell below, develop your own code for performing k-Nearest Neighbour classification. You may use the scikit-learn k-NN implementation from the labs as a guide - and as a way of verifying your results - but it is important that your implementation does not use any libraries other than the basic numpy and matplotlib functions.

Define a function that performs k-NN given a set of data. Your function should be invoked similary to:

    y_ = mykNN(X,y,X_,options)

where X is your training data, y is your training outputs, X_ are your testing data and y_ are your predicted outputs for X_. The options argument (can be a list or a set of separate arguments depending on how you choose to implement the function) should at least contain the number of neighbours to consider as well as the distance function employed.

Hint: it helps to break the problem into various sub-problems, implemented as helper function. For example, you might want to implement separate function(s) for calculating the distances between two vectors. And another function that uncovers the nearest neighbour(s) to a given vector.

Helper functions

The K Nearest Neighbor algorithm is a supervised machine learning algorithm that relies on labelled input data to a function that assigns the appropriate label to new unlabelled data.

There are a couple of basic steps that needs to be taken in order for the algorithm to work and these are:

References:
Step 1

Calculate the Euclidean Distance between the test point and each point in the dataset

Step 2

Find the nearest neighbors for a given point.
The number of neighbors are based on the number K

Step 3

Classify the point based on what the majority specifies.

KNN function

Call the functions in sequence on the data given.
Run Step 1, 2 and 3

Normalizer function

In order to get rid of outliers in the data, we need to normalize the data.
We'll see later down how normalizing the data improves the accuracy of the classifier.
There are a couple of ways to normalize data, but I've implemented the min max normalizer function.

Run the function

Lets run the function on the wine data and see what predictions we get on the test data

Sklearn

We have no idea whether the classified ouput from the implemented function above is correct.
We need to compare the classification output with that of the sklearn KNN classifier.

Lets create a classifer with sklearn and run it on the same data.

Compare the data

Lets compare the output from the implemented function to that of the sklearn function.

Output vs sklearn

As we can see above, the outputs from the two functions are the same, thus we can assume that the implementation is correct.

It needs to be noted that we ran the functions on the raw data, in other words, no preprocessing was done on the data beforehand.
Lets normalize the data and see if that makes any improvement.

It needs to be noted that there are a couple of ways to normalize data, so we'll use the sklearn library to make comparisons to the implemented function.

References:
Normalized data

As we can see above, the results from the sklearn preprocessing library differs from that of the implemented normalize function.

I looked into the way that normalization is done in that library and found the solution in the reference given above. The normalize function has one of three options, namely:

I tested all three options on the data to see which one gives the best results and it turns out that "l2" is the option to use.
Based on the above reference, "l2" makes use of the following calculation: np.sqrt((X * X).sum(axis=1))

Lets continue and see if and how normalizing the data improves the classifyer's accuracy.

Normalizing performance

Above we can clearly see that normalizing the data before running it on the KNN classifyer, clearly has a benefit.
The accuracy has increased from 0.86 to 0.94.

Lets again make a comparison to the sklearn classifyer and see if the results from the implemented function is the same.

Sklearn validation

Above we can see that the results corresponds with the output from the sklean classifyer.

3. Classifier evaluation

In the cell below, implement your own classifier evaluation code. This should include some way of calculating confusion matrices, as well as common metrics like accuracy.

Write some additional code that lets you display the output of your confusion matrices in a useful and easy-to-read manner.

You might want to test your functions on some test data, and compare the results to the sklearn library versions.

Accuracy score

The accuracy score is a score based on the number of correctly predicted results in comparison to the actual correct target values.
We simply count the number of correct predictions and divide it by the number of predictions.
This will give us a value between 0 and 1.
We could if we wanted to multiply that number by 100 to give us a percentage value.

Run the accuracy function

Lets run the accuracy function above on the data and see what we get.
We'll also run the sklearn accuracy_score function on the same data so that we can validate the results.

Accuracy conclusion

We can see from the results above that they are the same, thus we can safely assume that the implemented function is correct.

Confusion Matrix.

Lets start by first using the confusion matrix from the sklearn library.
This way we can see what the results should be and we can work backward from there to implement a function with the same output.

Confusion matrix from scratch

In order to duplicate the confusion matrix above, we're going to create a 3 x 3 matrix and use the labels as the bin positions.
We'll count the number of labels and add them to the correct bin.
The true labels represent the row of the matrix and the predicted labels represent the columns of the matrix.

We can see from the output above that the matrix returned from the function is correct.
Lets plot the matrix.

Precision score

Precision for Multi-Class Classification

In an imbalanced classification problem with more than two classes, precision is calculated as the sum of true positives across all classes divided by the sum of true positives and false positives across all classes.

Precision = Sum c in C TruePositives_c / Sum c in C (TruePositives_c + FalsePositives_c)

Example:

A model makes predictions and predicts 70 examples for the first minority class, where 50 are correct and 20 are incorrect.
It predicts 150 for the second class with 99 correct and 51 incorrect. Precision can be calculated for this model as follows:

This example is similar to the sklearn micro method.

Reference:

Machinelearningmastery.com

sklearn library

The sklearn library can be used with five different options for the average parameter.

If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

Will be ignored when y_true is binary.

Reference:

scikit-learn.org

Lets run the precision score on the data and see what the results are.

Above we can see that the micro option returns the same result as the accuracy score.
With the 'micro' option, the matrix is calculated globally by counting the total true positives, false negatives and false positives.

Lest look at the macro option.

From the score above, it's difficult to see how that method arrives at that answer.

Lets use the None option and see if we can get more information by seeing the score per label.

Above we can see that the score is calculated based on the columns, which corresponds with the predicted labels.

Lets compare the precision scores between the implemented function and the sklearn function to ensure the result is correct.

Recall score

sklearn library

This parameter is required for multiclass/multilabel targets. If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

Reference:

scikit-learn.com

Lets print out the recall score on the same data and see what we get.

The recall score seems to be very similar to the precision score, but the calculation seems to have been done on true values, as opposed to the predicted values.
Lets print out the score again with the None option for the average parameter.

Above we can see that the calculation is based on the True values, which corresponds with the matrix row.
Lets implement the function.

I'm going to use the precision_score_scratch function as a baseline and just make the required modifications, so basically change the calculation from column based to row based.

Lets compare the recall scores between the implemented function and the sklearn function to ensure the result is correct.

4. Nested Cross-validation using your implementation of KNN

In the cell below, develop your own code for performing 5-fold nested cross-validation along with your implemenation of k-NN above. You must write your own code -- the scikit-learn module may only be used for verification purposes.

Your code for nested cross-validation should invoke your kNN function (see above). You cross validation function should be invoked similary to:

accuracies_fold = myNestedCrossVal(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)

where X is your data matrix (containing all samples and features for each sample), 5 is the number of folds, y are your known output labels, list(range(1,11) evaluates the neighbour parameter from 1 to 10, and ['euclidean','manhattan',...] evaluates the distances on the validation sets. mySeed is simply a random seed to enable us to replicate your results.

Notes:

Nested cross validation implementation

Based on the number of splits / folds chosen, so for instance if 4 is chosen, then the data will be seperated into 4 sets with a 75% / 25% split, where 75% to be used as the training set and 25% to be used as the testing set.

We'll add the accuracy score for each fold together and then divide that cumulative sum by the number of folds to give us the average accuracy score.

References:
Run the N-Fold function on the wine data

Below we'll run the function on the wine data and target values, using 5 as the number of folds / splits.
This will result in 5 folds, split into 80% training data and 20% testing data each.

You'll note in the code above that there are quite a few lines of code that are commented.
I ran quite a few tests to see which methods produce the best results.
I found that using the sklearn normalizing function with the "l2" option produced the best results and that's the one that I ended up using.

Now that we've called the function above and we have all our dataframes, we simply need to print them out to get detailed information on each fold and k value.
The dataframes are constructed in such a way that the column represent the fold and rows represent the k values.
The last comlumn of the dataframe, "avg", is the average value between the number of folds.

N-Fold function with noise induced

Below we're going to induce noise into the wine dataset before running it through the nested Knn function.
The same code will be used as provided early to generate the noise.

Results

As we can see above, the accuracy deteriorated quite significantly.
With the wine data we found the best accuracy to be:
{0.8488888888888889}{1}
whereas with the noise induced data, we found the best accuracy to be:
{0.6239682539682538}{4}
where the first number represents the accuracy on a scale from 0 to 1 and the second number represents the number of neighbors used to make the classification.

5. Summary of results

Using your results from above, fill out the following table using the clean data:

Fold accuracy k distance
1 .833333 1, 2, 5, 9 & 10 ?
2 .861111 1 - 4 ?
3 .750000 1, 2 & 4 ?
4 .885714 4 ?
5 .942857 1 & 2 ?
total .854603 $\pm$ 0.06355032244

Where total is given as an average over all the folds, and $\pm$ the standard deviation.

Now fill out the following table using the noisy data:

Fold accuracy k distance
1 .555556 1 & 2 ?
2 .722222 3, 4 & 6 ?
3 .583333 1, 2, 6, 7 & 8 ?
4 .742857 4 ?
5 .600000 3 ?
total .6407936 $\pm$ 0.07652294918

I'm not exactly sure what the expecting is for the completion of the tables.
Since we have 5 folds for every k, that means there should be 5 tables that needs to be completed, not just one.

Since only one table is requested for the clean data and one for the noisy data, I'll continue on the bases that the best score for each fold should be recorded, as well as the corresponding amount of nearest neighbours, k. There are instances where multiple k's resulted in the same / highest score. All of them will be recorded in the table under the column k.

The tables above were completed based on this assumption.

5.2. Confusion matrix summary

Summarise the overall results of your nested cross validation evaluation of your K-NN algorithm using two summary confusion matrices (one for the noisy data, one for the clean data). You might want to adapt your myNestedCrossVal code above to also return a list of confusion matrices.

Use or adapt your evaluation code above to print the two confusion matrices below. Make sure you label the matrix rows and columns. You might also want ot show class-relative precision and recall.

6. More questions

Now answer the following questions as fully as you can. The answers should be based on your implementation above. Write your answers in the Markdown cells below each question.

Q3. Influence of noise

Do the best parameters change when noise is added to the data? Can you say that one parameter choice is better regardless of the data used?

Answer:

As we can see from the dataframes that were printed under section 4, we can see that there is a significant difference on the performance of the classifyer. We could already see how the data was influenced in section one with the scatter and distribution plots where there was a visible increase in the classes overlapping.

From the printed data the effect became clear.

I prefer using the accuracy parameter and this is also the parameter that seems to have been affected the most with the noisy data, although, the difference is still significant.

Q4. Tie break

Assume that you have selected the number of neighbours to be an even number, e.g., 2. For one of the neighbours, the suggested class is 1, and for the other neighbour the suggested class is 2. How would you break the tie? Write example pseudocode that does this.

Answer:

There are a couple of options to choose from to resolving a tie, namely:

Choosing a random one would be the easiest, but since that would not take this distances into consideration I feal that it's not the best solution.
If pnt1 is 1 unit away and pnt2 is 2 units away from the point to be classified, pnt3, should pnt1 not have a greater weight than pnt2, since it's closer to pnt3?

In step 2, the function nearest_neighbors, we're already sorting the dataframe by distance and then choose the closest ones based on the number specified.

If it turns out to be a tie, simply ignore the least weighted point, and in the case of the sorted dataframe you would make the following change to the code above:

This method is essentially the same as changing k, but based on the weight of the points.

Q5. Beyond Wine

If you were to run your k-nn algorithm on a new dataset (e.g., the breast cancer dataset, or Iris), what considerations would you need to take into consideration? Outline any changes that might be needed to your code.

Answer:

From an analytical point of view, you would need to take into consideration which characteristics you want to use.
On the Iris dataset, that would not be too difficult since you only have four characteristics, so you'd most likely use all of them, but with the breast cancer dataset, you around 30 characteristics, with many combinations, all of which would have different results and outcomes, with different accuracy results.
Thus, the data would have to be analized and you'd need to decide which ones would be best suited for the desired outcome.

As far as the code is concerned, no changes are required. I have tested the same code on the Iris dataset and it worked as expected. The only concern would be with the implemented min_max_normalize function. During testing I got radically different results when using this function in comparison with the sklearn normalize function.

Another place where changes might be required is with the implemented confusion matrix, percision and recall functions.
These functions are currently written to produces a 3 x 3 matrix, for datasets with 3 labels.
If you use a dataset where you want to include more than 3 labels, modifications, or optimazations to these functions would have to be made.