COMP0036 LSA – Assignment 1 September 16, 2019 1 1 Question 1 1.1 Answer for (a) Using the ordinal least squares(OLS) method to find such a curve model is not feasible. Because the OLS leads to no unique solution. A cubic function to model the relationship between x and y takes the following format: y = ∑ ni>=0,n1+n2+n3+n4+n5=3 w ∏ 1≤i≤5 xnii The real number of the independent variable in the model is C38 = 56, which is larger than the sample number 50. Assuming Z is the real design matrix of this model, then the OLS solution takes the form: W = (ZTZ)−1ZTy As mentioned before, Z is a matrix of 50 × 56, thus ZTZ is non-invertible. That is to say, we don’t have enough equations to specify the number of unknowns. 1.2 Answer for (b) If one of the attributes is discarded from the cubic model,then the number of the inde- pendent variables in the model is C37 = 35 ,and the dimension of the design matrix is 50× 35. We now have enough equations to determine the unknown parameters. So the problem in (a) can be solved. 1.3 Answer for (c) In (a) we indeed encountered an over fitting problem, the model pays much more at- tention to the performance of the in-sample data and shows poor performance to the out-of-sample data. To improve this, we need to consider a shrinkage method that could control the complexity of the model. Ridge regression can do this by adding some penalty to the weights in the model. Ridge regression adds the `2 penalty to the weights by adding a threshold t as ‖ w ‖22 < t. Using the theory of Lagrange Multipliers, we can rewrite the objective function as: argmin w 1 2 n∑ i=1 (y(i) − wx(i))2 + λ 2 ‖ w ‖22 By finding the optimal weight, we can get the following solution:w = (XTX+λI)−1XTy. In this question, our design matrix is Z, so we change X to Z.The rank of ZTZ + λI is always 56, so ZTZ + λI is invertible and we can find a unique solution for the model. 2 1.4 Answer for (d) If we need to do some feature selection and only hold the most important features, Ridge regression would be quite limited. We should use the LASSO method. Different from the Ridge regression, Lasso uses the `1 − norm to shrikage some weight to 0. So if we want to hold some important features in our model and discard those unimportant, we can add a `1-norm regularization to our objective function and use a gradient descent method to find the optimal solution. Finally, we just need to keep the features with nonzero weight. 3 2 Question 2 2.1 Answer for (a) As ∼ N(0, α) and x are independent variables, so y ∼ N(wx, α). P (S|w) = n∏ i=1 P ((x(i), y(i))|w) = n∏ i=1 (2piα2)− 1 2 exp(−(y (i) − wx(i))2 2α2 ) = (2piα2)− n 2 exp(− ∑n i=1(y (i) − wx(i))2 2α2 ) So we can get A and B as, A = (2piα2)− n 2 and B = − ∑n i=1(y (i)−wx(i))2 2α2 2.2 Answer for (b) Considering the Bayesian theory, w is now a random variable and w ∼ N(0, β).As said before y ∼ N(wx, α), so we can get the posterior distribution of w as: P (w|x, y) = P (x, y, w) P (x, y) = P (y|x,w)P (w) P (x, y) We can see P (x, y) is a constant that is not related to w, we assume P (x, y) = Z,then P (w|x, y) = 1 Z P (y|x,w)P (w) = 1 Z (2piα2)− 1 2 exp(−(y − wx) 2 2α2 )(2piβ2)− 1 2 exp(− w 2 2β2 ) = (2piα2)− 1 2 (2piβ2)− 1 2 exp(−((y − wx) 2 2α2 + w2 2β2 )) 2.3 Answer for (c) The Maximum A Posteriori estimation(MAP) can be derived by the following equation: argmax w P (w|x, y)⇔ argmax w P (y|x,w)P (w) Then we use the log likelihood to find this solution, the log likelihood takes the following form: L = argmax w ln( n∏ i=1 P (y(i)|x(i), w)P (w)) = − 1 2α2 n∑ i=1 (y(i) − wx(i))2 − 1 2β2 w2 + const Taking the derivative of the above function, we can get the MAP estimate of w as: wMAP = ∑n i=1 x (i)y(i) α2 × ( ∑n i=1(x (i))2 α2 + 1 β2 )−1 4 2.4 Answer for (d) No, this solution is not unique. Because MAP estimation is only a point estimate of w, while in Bayesian theory, w is a random variable. We only use MAP to do a sort of halfway house in Bayesian inference. 2.5 Answer for (e) The MAP estimation of w reminds me of the MLE(Maximum Likelihood Estimator) method. From the analysis(c) we can find, the biggest difference between MAP estima- tion and MLE methods is that MAP takes the prior distribution of w into consideration. So if the prior distribution of w is a uniform distribution, then these two method will give the same result. 5 3 Question 3 3.1 Answer for (a) Discriminative Classification, also known as the distribution-free method, is to learn a classifier by giving the optimal loss function directly. Like SVM, it just tries to learn a suitable loss function that can minimize the classification loss. This is different from Generative Classification methods. As Generative Classification always makes use of the probability distribution of the sample, and give the optimal result by maximizing the posterior probability that a sample belonged to a class. 3.2 Answer for (b) The main purpose of classification is to find a relationship between the observed sample x and the classification y. Due to y being a binary variable, we introduce a latent variable y∗ and map the latent variable to y by: y = { 1, if y∗ ≥ 0 0, if y∗ < 0 We assume y∗ has the following relationship with x: y∗ = wx+, and ∼ Logsitic(0, 1) PY (y = 1|x) = P (y∗ ≥ 0|x) = P (wx+ ≥ 0) = P ( ≥ −wx) = 1− P ( < −wx) = 1− 1 1 + ewx = ewx 1 + ewx = 1 1 + e−wx And this is equivalent to the posterior probability described in question(b) 3.3 Answer for (c) If we allow the Logistic parameters to take more general values a ∈ R and b > 0, the final logistic regression model will take the form: PY (y = 1|x) = 11+exp(−wx+a b ) . Then, unlike the logistic regression model stated in (b), we will get a differnt linear discriminant separating hyperplane. The hyperplane in this situation is defined as: wx+ a = 0 6 3.4 Answer for (d) When we use a Gaussian Noise Latent Variable Model, that is we assume ∼ N(0, 1), then similar to (b), we can describe the posterior of PY (y|x) as: PY (y = 1|x) = P (y∗ ≥ 0|x) = P (wx+ ≥ 0) = P ( ≥ −wx) = 1− P ( < −wx) = 1− Φ(−wx) = Φ(wx) = 1√ 2pi ∫ wx −∞ e− t2 2 dt And PY (y = 0|x) = 1− PY (y = 1|x). 3.5 Answer for (e) There indeed exists some difference between these two models when it comes to outliers in the sample. If we compare the distribution, we will find out that the the rising slope of the model derived in(d) is steeper than the model derived in (b). So Gaussian Noise Latent Variable Model will be more robust to outliers than the Logistic Noise Latent Variable Model. 3.6 Answer for (f) When k=2, there are two classes. In order to make it comform to our previous notation, we assume the two classes are 0 and 1. PY (y = 1|x) = e w1x ew0x + ew1x = 1 1 + e−(w1−w0)x Let w = w1 − w0, then we can see that this is a logistic regression. 3.7 Answer for (g) For the Multinomial Logistic Regression, when K > 2, the sample belonging to class j(1 ≤ j < K) must satisfy PY (y = j|x) > PY (y = i|x), ∀i, i 6= j, where 1 ≤ i < K. That is: exp(wjx)∑K k=1 exp(wkx) > exp(wix)∑K k=1 exp(wkx) ⇒ exp(wjx) > exp(wix) ⇒ wjx > wix ⇒ (wj − wi)x > 0 7 So we can get the classification boundary for Multinomial Logistic Regression, and we can see the classification boundary is a series linear discriminant separating hyperplane. 8