Homework 7

Due: Friday, May 10 at 11:59pm

Skeleton Code: hw7.zip

Instructions

Please read the instructions carefully. An incorrectly formatted submission may not be counted.

In this homework, we will start from some skeleton code. You are given the mlmodel module. We will implement part 4 of this homework on top of the skeleton code, just as in homework 6. To submit, place these modules inside a folder called uni-hw7 (for for me this would be tkp2108-hw7). Then compress this folder into a .zip file. For most operating systems, you should be able to right click and have a “compress” option in the context menu. This should create a file called tkp2108-hw7.zip (with your uni). Submit this on courseworks.

Part 1 - Knowledge Check

Answer the following questions in a file part1.txt:

What is Vectorization? Provide a definition and provide an example of vectorization from class or the homework.
What is recursion? What is required in order to write a recursive algorithm? Give an example of a recursive function from class or the homework.
Provide a brief (1 sentence) explanation of each of the following terms. Then provide an example of the term from the homeworks.

Object Oriented Programming
Constructor
Inheritance
Polymorphism
Encapsulation
Supervised Learning

Part 2 - Pandas

You will find a CSV file with generated data here. Using this file, implement the following python functions with Pandas:

def readData(filename):
    """return a dataframe with index 'Row ID'"""

def filterBy(df, quantity=None, discount=None, profit=None):
    """given the input dataframe, return just the rows where Quantity column is greater than the input variable quantity, and similarly for discount and profit.

    If the value passed in is None, do not filter with it.
    """

def stateSegmentHeatmap(df)
    """Draw a heatmap with seaborn where:
        - states as row pivots
        - segments as column pivots
        - color by profit

    See the example below"""

Part 3 - OOP

Implement a new class RandomList that inherits from Python’s standard list, with the following features:

Implement a constructor that takes a single argument
- if its a list, create a RandomList from that data shuffled
- if its an integer, create a RandomList from that number of random integers between 0 and 100
- else, raise an exception
Overload the __str__ and __repr__ methods to exhibit the behavior below
Overload the __add__ operator to exhibit the behavior below
- if adding to a regular python list, return a list with the two sets of data
- if adding to a RandomList, return a new RandomList with the concatenated data shuffled
- else, raise an exception
add a new method shuffle which randomly shuffle’s the RandomList’s data

Here is an example IPython session with with a sample run:

In [1]: rl1 = RandomList(10)  # creates a RandomList with 10 random integers between 0 and 100
In [2]: rl2 = RandomList([1, 2, 3])  # creates a RandomList from this list, shuffling the contents
In [3]: rl3 = RandomList("test")  # throws error
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
----> 1 rl3 = RandomList("test")  # throws error

final.py in __init__(self, data)
---> 18             raise NotImplementedError


In [4]: print(rl1)
random([28, 31, 22, 63, 45, 8, 28, 9, 31, 96])

In [5]: print(rl2)
random([2, 1, 3])

In [6]: lst = [rl1, rl2]

In [7]: print(lst)
[<RandomList [28, 31, 22, 63, 45, 8, 28, 9, 31, 96]>, <RandomList [2, 1, 3]>]

In [8]: print(rl1)
random([28, 31, 22, 63, 45, 8, 28, 9, 31, 96])

In [9]: rl1.shuffle()

In [10]: print(rl1)
random([63, 9, 96, 28, 8, 22, 31, 45, 28, 31])

In [11]: print(rl1 + [1, 2, 3])  # make new `list` from rl1's data and the other list
[63, 9, 96, 28, 8, 22, 31, 45, 28, 31, 1, 2, 3]

In [12]: print(rl1 + rl2)  # make a new RandomList frp, rl1'1 data and the other list, shuffling
random([22, 96, 3, 28, 31, 1, 45, 8, 31, 9, 2, 28, 63])

In [13]: print(rl1 + "abc")  # throws error
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-14-4481c54f83fc> in <module>
----> 1 print(rl1 + "abc")  # throws error

final.py in __add__(self, other)
---> 71         raise NotImplementedError

Part 4 - Modeling - KNN and Scikit Learn

For this part, we will build on the code we started in hw6 and implement 3 predictive models for solving our supervised classification problem.

As a high level overview, we will do a few things:

Split out the first column of our training and testing data as “training labels” and “test labels”
Train our hand written KNN, the SKLearn KNN, and the SKLearn SVM with our training data and training labels
Use the models to predict the testing data, generating a “predicted labels” numpy array
Compare our “predicted labels” to our “test labels” to calculate the accuracy

This sounds complicated, but in code its pretty straightforward:

# split dataset into training and testing data, with separate labels
# make sure data is randomly shuffled beforehand!!
train, train_labels, test, test_labels = splitDataset(df, test_percentage=20)

# Print some info about test/train split
print("Test dataset has {} entries".format(len(test)))
print("Train dataset has {} entries".format(len(train)))

# run our knn classifier
nn_accuracy = NNClassifier(train, test, train_labels, test_labels, k)

# run the knn classifier from scikit-learn
knn_accuracy = SKLearnKnnClassifier(train, test, train_labels, test_labels, k)

# run the svm classifier from scikit-learn
svm_accuracy = SKLearnSVMClassifier(train, test, train_labels, test_labels)

# print the accuracies
print("Accuracies:\n\tKNN (1006):{:.1%}\n\tKNN (sklearn):{:.1%}\n\tSVM (sklearn):{:.1%}".format(nn_accuracy, knn_accuracy, svm_accuracy))

This should lead to some output like:

Accuracies:
	KNN (1006):91.7%
	KNN (sklearn):91.7%
	SVM (sklearn):89.3%

More concretely, we will be implementing 3 main functions:

Scikit Learn Classifiers

Recall from class the appeal of scikit-learn: Polymorphism! That means that all scikit-learn classifiers have the same shape:

model.fit(training_data, training_labels)

predicted_labels = model.predict(testing_data)

So both these functions should be super straightforward to implement

def SKLearnKnnClassifier(training, testing, training_labels, testing_labels, k):
    # instantiate model
    knn = KNeighborsClassifier(n_neighbors=k)

    # fit model to training data
    # predict test labels
    # return % where prediction matched actual

def SKLearnSVMClassifier(training, testing, training_labels, testing_labels):
    # instantiate model
    svm = SVC(kernel="rbf")

    # fit model to training data
    # predict test labels
    # return % where prediction matched actual

NNClassifier

This function is our handwritten K Nearest Neighbors algorithm. It sounds a littel daunting, but in reality its pretty straightforward. My skeleton code breaks it apart into two separate steps:

Go through each unknown datapoint and call the knn function to get the predicted label
For a given data point, calculate the predicted label

Let’s start with the first function:

def NNClassifier(training, testing, training_labels, testing_labels, k):
    # preallocate labels
    predicted_labels = np.zeros(len(testing_labels)).astype(str)

    # run knn on each point and assign its label
        # fill in the labels with knn

    # return % where prediction matched actual
    return 0.0

For each row in testing, we call knn to get back a label ("B" or "M") and we put that in a preallocated vector predicted_labels. Then at the end, we compare predicted_labels to testing_labels to generate our accuracy. When we’re testing NNClassifier, we can have knn just hardcoded to return “B” or “M” until we implement it.

def knn(data, data_labels, vector, k):
    return "B"

For the knn function, we have to do a little work. Given known datapoints data with labels data_labels, and unknown data point vector and an input parameter k:

Calculate the distance from vector to each known data point in data
Get the smallest k
Get the labels in data_labels corresponding to those k
Take a vote (statistically speaking this is the “mode”)

def knn(data, data_labels, vector, k):
    # preallocate distance array
    distances = np.zeros(len(data_labels))

    # calculate distances

    # set labels

    # take vote amongs top labels

Here is one possible solution (but you’re welcome to come up with your own!)

To calculate euclidean distances, we can use exactly the same strategy as from our nbody example (just in 30 dimensions). For each data point in data, calculate the distance between it and vector and put the result into the preallocated array distances.

To find the closest k, we just want to sort from minimum to maximum. But be careful! If we try to sort our distances array, we’ll lose the order, and we need the order because each position in distances corresponds to a point in data and a label in data_labels. So if we sort distances, the element at position 5 might move to position 3, but the label in data_labels will still be at position 5! So we need to be careful not to reorder our data.

Instead, we can take advantage of some clever numpy slicing. Recall the following code:

x = np.array([5, 2, 1, 4, 0])

# we can slice with an array of positions to select

x[ [0, 1, 2] ]  # slice the 0th position, 1st position, and 2nd position element

# result: array([5, 2, 1])

So if we could calculate the order of the elements without rearranging them:

x = np.array([3, 5, 3, 2, 2, 1, 4, 4, 0])

order = some_function(x)

# order == [8, 5, 3, 4, 0, 2, 6, 7, 1]

# because the 8th element of x (0) would be 1st in order if sorted
#         the 5th element of x (1) would be 2nd in order if sorted
#         the 3rd element of x (2) would be 3rd in order if sorted
#         the 4th element of x (2) would be 4th in order if sorted

Then slicing the data would put it in order:

x = np.array([3, 5, 3, 2, 2, 1, 4, 4, 0])

order = some_function(x)

x[order]

# result: [0, 1, 2, 2, 3, 3, 4, 4, 5]

But with our code, if we can calculate the order of the distances, then we can slice the labels with the same order:

distances = np.array([0, 3, 2, 0, 1])
labels = np.array(["B", "M", "M", "B", "B"])

order = some_function(distances)  # [0, 3, 4, 2, 1]
labels_in_order = labels[order] # ['B', 'B', 'B', 'M', 'M']

And to look at the k nearest labels, we just look at the first k labels of labels_in_order!

k = 3
first_k_labels = labels_in_order[:k] # ["B", "B", "B"]
predicted_label = mode(first_k_labels)  # mode(["B", "B", "B"]) which is "B"