- May 15, 2020

Assignment 3: kNN and SVM UVA CS 6316 : Machine Learning (Fall 2019) Out: Oct. 10 2019 Due: Oct. 22, Sun midnight 11:59pm, @ Collab a The assignment should be submitted in the PDF format through Collab. If you prefer hand-writing QA parts of answers, please convert them (e.g., by scanning or using an app) into PDF form. b For questions and clarifications, please post on piazza. c Policy on collaboration: Homework should be done individually: each student must hand in their own answers. It is acceptable, however, for students to collaborate in figuring out answers and helping each other solve the problems. We will be assuming that, with the honor code, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such collaboration. d Policy on late homework: Homework is worth full credit at the midnight on the due date. Each student has three extension days to be used at his or her own discretion throughout the entire course. Your grade would be discounted by 15% per day when you use these 3 late days. You could use the 3 days in whatever combination you like. For example, all 3 days on 1 assignment (for a maximum grade of 55%) or 1 each day over 3 assignments (for a maximum grade of 85% on each). After you’ve used all 3 days, you cannot get credit for anything turned in late. e Policy on grading: 1: 50 points in total. 20 points for code submission (and able to run). 30 points for displaying results in the report. 2: 40 points in total. 20 points for code submission (and able to run). 10 point for result submission. 10 points for the result table in your report. 3: 10 points in total. 5 points for answering each question. 4: 10 extra points if you wins the competition! (Get Top 20 accuracy in leaderboard) The overall grade will be divided by 10 and inserted into the grade book. Therefore, you can earn 11 out of 10. 1 KNN and Model Selection (k) (programming) In this problem, we will implement k-Nearest Neighbor classifier and select k using cross validation. We have provided dataset file “Movie Review Data.txt”. Please use the “knn template.py” template to guide your work. Please follow the instructions and the function names/descriptions in the template. Feel free to cross-check your implementation against sci-kit’s KNN. Other requirements or recommendations are the same as HW1 and HW2. This problem covers two main tasks: • (1): To implement a very simple classifier, k-nearest neighbors (kNN), from scratch. • (2): To implement a 4-folds Cross Validation strategy for selecting the best k for kNN classification. Feel free to reuse the cross-validation code in the previous HWs. 1 Detailed Instructions: 1. We have provided dataset file “Movie Review Data.txt”. The template also provides a x, y = read file(file) function to load the dataset. 2. You are required to fill-in the function findBestK(x, y, klist, nfolds) following the guidance in the template. You are given a list of k ∈ {3, 5, 7, 9, 11, 13}. The goal is to find the best k using nfolds = 4 fold cross validation. 3. The first step is implementing the cross validation method that splits data into folds: x train, y train, x test, y test = fold(x, y, i, nfolds) .i indicates the fold number (iterated from 0 to nfolds − 1). This function returns the ith train and test fold from x and y. Feel free to try other folds for cross validation, kfold=3-fold or kfold=10-fold. 4. After getting a train and test fold, the next step is implementing kNN classification method using the function: y predict = classify(x train, y train, x test, k) Where k is the number of data points that kNN takes into consideration when labeling an unlabeled data point. Note here the training data is part of the testing (classify) algorithm. In kNN, we label a new data point based on the k points that are the closest to it in our train dataset. In this function, for each testing point, find its k nearest neighbors in the training set (those having the smallest Euclidean distance to the testing point). As reviewed in class, kNN requires a distance metric to compute distance between samples. We will use Euclidean distance as the measurement of distance in kNN. Then we label the testing point according to the majority voting rule. The predictions y predict is a list of 0 or 1 predictions obtained from the classify method. 5. Once predictions are obtained, the next step is to evaluate the prediction. For this, implement the calc accuracy method: acc = calc accuracy(y predict, y) Where y predict is the list of 0 or 1 predictions obtained from the classify method and y is the true label for the testing points (the last column of the test data). Repeat this process for all folds, to get cross validation accuracy for each k. (Hint1: If your accuracy is below 50%, please consider how the order of the samples are dictated by the class.) 6. Task 1: Report the cross validation accuracy for all values of k as well as the best k. Discuss why some values work better than others. 7. Task 2: Implement function barplot(klist, accuracy list). A bar graph is recommended to show how accuracy varies with k. Use k as x-axis and accuracy as y-axis. Put the bar plot in your report. You can use matplotlib for this question. Att: we will not consider the computational speed in grading your kNN code. 2 2 Support Vector Machines with Scikit-Learn and preprocessing This question covers two main tasks: • (1) For this assignment, you will implement an SVM classifier. Given a proper set of attributes, your program will be able to determine whether an individual makes more than 50,000 USD/year. • (2) You will use scikit-learn’s C-Support Vector Classifier: sklearn.svm.SVC. Install the latest stable version of scikit-learn following directions available at http://scikit-learn.org/stable/install.html Dataset Details: Two dataset files are provided, labeled sample set ”salary.labeled.csv” and unlabeled sample set ”salary.2Predict.csv”. Make sure to download them from Collab! The ”salary.labeled.csv” is the total training data available to you. The unlabeled sample set ”salary.2Predict.csv” is a text file in the same format as the labeled dataset “salary.labeled.csv”, except that its last column includes a fake field for class labels. The details of the column names and feature properties are provided in the svm template.py. You are required to generate/ predict labels for samples in ”salary.2Predict.csv” (including about 10k samples). Note: We will evaluate your output ‘predictions’ – an array of strings (“>50K” or “<=50K”) according to the true labels of these test samples in ”salary.2Predict.csv” (ATT: you don’t have these labels !!! ). This simulates a Kaggle-competition in which test labels are always held out and only team-ranking will be released after all teams have submitted their predictions. When grading this assignment, we will rank all students’ predictions. So please try to submit the best performing model that you can! Best groups in the tournament will receive bonus credit. Detailed Steps: 1. The first step is loading and preprocessing the data: • The dataset contains both categorical and continuous features(See sample data in Table 1). First, implement load data function in svm template.py. This function takes as input filename training csv and processes into numerical vector representation. For example, to start, you can try normalization on continuous features and one-hot encoding on categorical data. • For preprocessing, use techniques mentioned in class! • Be creative. A good data preprocessing can help you win the tournament. 2. The next step is to choose a good hyperparameter for SVM classifier via cross validation. You are required to write a wrapper code for cross-validation as part of train and select model(training csv), and iterate through different hyperparameters (just like we did in HW1/2). • The different hyperparameters you can choose include SVM kernels (basic linear kernel / polyno- mial kernel / RBF kernel), C value, .... A few examples are provided in the svm template.py. • Call sklearn.svm.SVC with your parameter to get a classification! • You need to include the CV classification accuracy results in your pdf report by performing 3-fold cross validation (CV) on the labeled set ”salary.labeled.csv” (including about 38k samples) using at least three different SVM kernels you pick. Please provide details about the kernels you have tried and their performance (e.g. 3CV classification accuracy ) including results on both CV train accuracy and CV test accuracy into the writing. For instance, you can summarize the results into a table with each row containing kernel choice, kernel parameter, CV train accuracy and CV test accuracy. 3. Next, also as part of train and select model(training csv), select the best hyperparameter, train a model using the selected hyperparameter and also return the corresponding cross validation score as well as the trained SVC model. 4. Finally, use the trained model to make predictions on the test set using predict(“salary.2Predict.csv”, trained model) and save the output. The svm template.py provides output results(predictions) to save the predictions in the desired format. Remember not to shuffle your prediction outputs! 3 Submission Instructions: You are required to submit the following : 1. A python program svm.py (filled in svm template.py): It should be able to train and select a model using a set of hyperparameters on the training data, these hyperparameters should be hard coded. 2. We should be able to use the learned/ trained model to classify samples in an unlabeled test set using the following function: p r e d i c t i o n s = p r e d i c t ( ‘ ‘ s a l a r y . 2 Pred i c t . csv ” , t ra ined mode l ) 3. A file “predictions.txt” generated by: o u t p u t r e s u l t s ( p r e d i c t i o n s ) Please do not archive the file or change the file name for the automated grading. 4. In your PDF submission, • A table reporting classification accuracy (score) averaged over the test folds, along with details of the best performing kernel, C etc. • You need to include the CV classification accuracy results in your pdf report. Please provide details about the hyperparameters you have tried and their performance (e.g. 3CV classification accuracy ) including results on both train and test folds into the writing. For instance, you can summarize the results into a table with each row containing kernel choice, kernel parameter, CV train accuracy and CV test accuracy. • Please also discuss in the written submission about how you preprocessed the data, and how you chose your SVM hyperparameters. Classes: >50K, <=50K. Attributes: age: continuous. workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. fnlwgt: continuous. education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. education-num: continuous. marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. sex: Female, Male. capital-gain: continuous. capital-loss: continuous. hours-per-week: continuous. native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. Table 1: About the data in SVM coding. 3 Sample QA Questions: Question 1. Support Vector Machine Soft-margin linear SVM can be formulated as the following constrained quadratic optimization problem: argmin{w,b} 12w Tw + C ∑m i=1 i such that yi(w Txi + b) ≥ 1− i i ≥ 0 ∀i where C is the regularization parameter, which balances the margin (smaller wTw) and the penalty of mis-classification (smaller ∑m i=1 i). 4 (a) (True/False) Number of support vectors do not depend on the selected value of C. Please provide a one-sentence justification. In the following figure, nC=0 is the number of support vectors for C getting close to 0 (limC→0) and nC=∞ is the number of support vectors for C gets close to ∞. (b) Select the correct option:( ) (1) nC=0 > nC=∞ (2) nC=0 < nC=∞ (3) nC=0 = nC=∞ (4) Not enough information provided (c) Match the following list of kernels used for SVM classification with the following figures: (1) Linear Kernel : K(x, x′) = xTx′ (2) Polynomial Kernel (order = 2) : K(x, x′) = (1 + xTx′)2 (3) Radial Basis Kernel : K(x, x′) = exp(− 12 ||x− x′||2) 5 6