- May 15, 2020

Assignment 2: More RegressionsUVA CS 6316 :Machine Learning (Fall 2018)Out: W4Due: 1001 Tue midnight 11:59pm @ Collaba The assignment should be submitted in the PDF format through Collob. If you prefer hand-writing QAparts of answers, please convert them (e.g., by scanning or using an app like Genuis Scan) into PDFform.b For questions and clarifications, please post on Piazza.c Policy on collaboration:Homework should be done individually: each student must hand in their own answers. It is acceptable,however, for students to collaborate in figuring out answers and helping each other solve the problems.We will be assuming that, with the honor code, you will be taking the responsibility to make sure youpersonally understand the solution to any work arising from such collaboration.d Policy on late homework:Homework is worth full credit at the midnight on the due date. Each student has three extension daysto be used at his or her own discretion throughout the entire course. Your grade would be discountedby 15% per day when you use these 3 late days. You could use the 3 days in whatever combinationyou like. For example, all 3 days on 1 assignment (for a maximum grade of 55%) or 1 each day over3 assignments (for a maximum grade of 85% on each). After you’ve used all 3 days, you cannot getcredit for anything turned in late.e Policy on grading:1: 45 points in total. 20 points for code submission (and able to run). 10 points for each questionanswered in your report.2: 45 points in total. 5 points for answering the primer question. 20 points for code submission (andable to run). 10 point for displaying figures. 10 points for answering questions in your report.3: 10 points in total. 5 points for each question.4: 10 points of extra credit. See question 2.The overall grade will be divided by 10 and inserted into the grade book. Therefore, you can earn 11out of 10.Please provide proper steps to show how you get the answers.11 Polynomial Regression (Programming)There are many datasets where a standard linear regression model is not sufficient to fit the data (i.e. thedata is not linear). Thus, we need a higher order model to better fit the data. For this function, recallfrom the class lectures that the problem of learning the parameters is still linear although we are learning anonlinear function. There are TWO portions to this section.First, you must generate 300 samples using “DataGenerationPoly.py”. Then you must fill out the”PolyRegressionTemplate.py” and turn it in as “PolyRegression.py”.Second, you must complete a written portion and turn it in as part of the pdf with the rest of theassignment. You must explain (and include) the three plots(see details below) and whether the outputtedtheta makes sense given the true underlying distribution.1.1 Data Generation• First, we will generate our data. In machine learning, we assume that there in an underlying datadistribution. There should be some hidden function that maps the inputs to the outputs we are tryingto predict plus some noise. The noise comes from not having all the relevant information and some ofthe inputs being wrong. We try to figure out what the hidden function is so that we can make goodguesses of the output given the input.• We have provided ”DataGenerationPoly.py” that has the underlying data distribution that we aretrying to find. We get samples by running the script with a command line argument specifying thenumber of samples we want. The data is saved as “dataPoly.txt”. We will use this data for the nextsteps.1.2 Polynomial Regression Model Fitting• ATT: Please re-use the gradient-descent regression solver you implemented in HW1 in the function:solve regression(x, y). If you did not have a good implementation, please contact [email protected]collab.its.virginia.edu.• Task 1 hyperparameter tuning: We will plot polynomial order versus training and validation loss.We will use all 300 samples for this task. For this plot, use 60% of the total data as training dataand 20% as validation data. Leave the last 20% for testing later. As discussed in class, validationloss is used to tune hyperparameters because training loss is minimized by the highest variance model.Additionally, if you use test loss to tune hyperparameters then test loss is no longer a good estimationof the true error of your model which is useful to know after producing said model.Refer to get loss per poly order(x, y, degrees) in the template.This function should explore multiple different orders of fitting polynomial regressions, like d ∈{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} as discussed in class. In this plot, x-axis shows the value of d and the y-axis represents the MSE loss on the training set and MSE loss on the validation set(as two differentcurves in the same plot).• Task 2: In part 2, we will use the best hyperparameter d from part 2 to train our final model. Weuse both the validation and training data to train our model. This is because now that we have tunedour hyperparameter we no longer need the validation set but the extra training data is useful. Thenwe use the test error as a good estimation of the true error. Similar to HW1, you will plot the best fitcurve: showing the data samples and also draw the best-fit curve learned on the same graph. Pleaseinclude the best fit curve, final test MSE error and the best θ as part of your write up. Also discusshow you got the best θ. Now, please take a look at the data generation script and note your obser-vations as part of the write up if the learnt θ makes sense in relation to the true underlying distribution.Att: When you try to draw the best-fit curve, please do not directly use the x from thetraining set or validation set. Instead, you can (1) just sort the x and y values by x whenplotting the curve; OR (2) you can use the function linspace(start, end,NumSamples) fromnumpy to get a set of x (uniformly distributed) for plotting the best-fit-curve nicely.2• Task 3: Similar to the epoch vs. training losses we require you to generate in HW1, now pleasegenerate a figure including two curves: 1. the GD epoch vs. training loss, and 2. the GD epoch vs.loss on the testing data. This is a great figure to visually check how the GD optimize to reduce thetraining loss and the gap between the training loss and test loss along the gradient descent optimizationpath.• Task 4: Last, you are required to plot training and testing loss versus dataset size. In real life collectingmore data can be expensive. This plot can help tell if collecting more data would even help the modelincrease accuracy. Additionally, this graph indicates whether your model has overfit/underfit.Refer to get loss per num examples(x, y, example num, train proportion) in the coding template.Polynomial regression using a degree of 8 will be used for this question. The code should generatea figure with x-axis representing the value of n, the number of examples used for the model and they-axis showing the training MSE loss and the test MSE loss(as two different curves). For this plot, youwill vary n using {10, 20, 30, 40, 50, 60, 70, 80, 90, 100}. example num is a list of n. train proportionis the proportion of n to be used for training, and the rest for testing. For this question, you will usetrain proportion = 0.5. In a ‘real’ setting, we would use more of the data for training. Please includethe plot in the written submission.We should be able to run ”python3 PolyRegression.py” and it should work!32 Ridge Regression (programming and QA)There are THREE portions to this section.First, you have questions to answer that act as a primer for the coding question. Include the answers inthe written portion.Second, you must generate 200 samples using “DataGenerationRidge.py”. Then you must fill out the”RidgeRegressionTemplate.py” and turn it in as “RidgeRegression.py”.Third, you must explain the plot, the L2 norms, and the testing loss i.e. why are some parts relativelyhigh and other parts low.Extra Credit: code up gradient descent for ridge regression and show that it matches modified normalequation for the first 5 values of beta.2.1 QA• Here we assume Xn×p represents a data sample matrix which has p features and n samples. Yn×1includes target variable’s value of n samples. We use β to represent the coefficient. (Just a differentnotation. We had used θ for representing coefficient before.)• 1.1 Please provide the math derivation procedure for ridge regression (shown in Figure 1)Figure 1: Ridge Regression / Solution Derivation / 1.1(Hint1: provide a procedure similar to how linear regression gets the normal equation through mini-mizing its loss function. )(Hint2: λ|β|2 = λβTβ = λβT Iβ = βT (λI)β)(Hint3: Linear Algebra Handout Page 24, first two equations after the line “To recap,”)• 1.2 Suppose X =1 23 65 10 and Y = [1, 2, 3]T , could this problem be solved through linear regression?4Please provide your reasons.(Hint: just use the normal equation to explain the reason)• 1.3 If you have the prior knowledge that the coefficient β should be sparse, which regularized linearregression method should you choose? (Hint: sparse vector)2.2 Programming• Similar to the previous section, we will first generate data using the provided script “DataGenera-tionRidge.py”. We generate 200 samples in this part. The data is saved as “dataRidge.txt”. We willuse this data for the next steps.• Task 1: We will use cross-validation for this question. You are required to plot training loss andvalidation loss (as two curves in the same plot) as a function of hyperparameter λ. In this plot, thex-axis is λ and y-axis is the training loss and validation loss. You will use 4-fold cross validation forthis question. Refer to the function cross validation(x train, y train, lambdas) in the template. Here,the validation loss is the average validation loss across the 4 folds during cross-validation. You are alsorequired to reimplement normal equation for ridge regression. The values of λ are specified in thetemplate. Please include some discussion about your observations from the plot i.e. discuss the trendsbased on the increasing or decreasing λ values.• Task 2: Please write down the best λ in the previous step in the written submission. Finally, your codeshould print out L2 norms and the test loss of the learnt βλ parameters for the best λ as well as othervalues of λ (specified in the template). Please include these values as part of the written submission.Also, include your observations about the norms and the test loss, discussing general trends of thevalues.• Task 3: In this part, you are required to submit a bar graph showing the learnt values of β vectorfrom the best λ. If β is a p× 1 vector, in this plot, the x-axis is i where i ∈ {1, . . . , p} and the y-axisdenotes the βi. Now take a look at the data generation file and discuss your observations if the learntβ makes sense in relation to the true underlying distribution.Explanation of Ridge Regression:5• As you can see from the data generation file, most of the features are useless. The output depends onthe bias (via the pseudofeature) and x1. The rest of the xs are noise that has no influence on y.• However, straight linear regression will use those values to predict the output exactly. Aka the modellearns the noise.• Ridge regression penalized the l2 vector norm of beta. The model’s predictions get worse less quicklywhen it lowers weights associated with unimportant features than with important features. As a result,the model learns less noise.Explanation of k-fold cross validation:• If you do not have a lot of data, then using a training set, validation set, and testing set doesn’t workvery well. For example, if your validation set is too small you won’t be able to tune hyperparameterswell because the validation loss will be too noisy.• One solution is k-fold cross validation. It is more complicated (so takes longer to code) and runs slowerthan using training, validation, and testing sets. However, It allows you to better hyperparametertune.• We take the training set and split it into ”k” folds. Then we combine all the folds execept one and usethat as the training set with the leftout fold as validation set. We do this k times using each left outfold as a validation set. Then we average the training losses and validation losses and say that is ourtraining and validation loss.• In this way all the training data gets some time being part of the validation set, so the validation lossis less noisy and we can pick better hyperparameters.63 Sample Questions:Question 1. Basis functions for regressionFigure 2: Basis functions for regression (c) with one real-valued input (x as horizontal axis) and one real-valued output (y as vertical axis).We plan to run regression with the basis functions shown as above, i.e., y = β1φ1(x)+β2φ2(x)+β3φ3(x).Assume all of our data points and future points are within 1 ≤ x ≤ 5. Is this a generally useful set of basisfunctions to use ? If ”yes”, explain their prime advantage. If ”no”, explain their biggest drawback. (1 to 2sentences of explanation are expected.)7Question 2. Polynomial RegressionSuppose you are given a labeled dataset (with one real-valued input and one real-valued output) includingpoints as shown in Figure 3 :Figure 3: A reference dataset for regression with one real-valued input (x as horizontal axis) and one real-valued output (y as vertical axis).(a) Assuming there is no bias term in our regression model and we fit a quadratic polynomial regression(i.e. the model is y = β1x+β2×2) on the data, what is the mean squared LOOCV (leave one out crossvalidation) error?8