UNIVERSITY COLLEGE LONDON Faculty of Engineering Sciences BENG0095: Assignment 2 Dr. Dariush Hosseini ([email protected]) Monday 14th December 2020 Overview • Assignment Release Date: 14th December 2020 • Assignment Hand-in Date: 5th January 2021 at 4.00 p.m. • Weighting: 25% of module total • Format: Problems 1 Guidelines • You should answer all TWO questions. • Note that not all questions carry equal marks. • You should submit your final report as a pdf via the module’s Moodle page. • Within your report you should begin each question on a new page. • You should preface your report with a single page containing, on two lines: – The module code and assignment title: ‘BENG0095: Assignment 2’ – Your candidate number: ‘Candidate Number: [YOUR NUMBER]’ • Your report should be neat and legible. • You must use LATEX to format the report. A template, ‘BENG0095 Solution Template.tex’, is provided on the module’s Moodle page for this purpose. • Please attempt to express your answers as succinctly as possible. • Please note that if your answer to a question or sub-question is illegible or incomprehen- sible to the marker then you will receive no marks for that question or sub-question. • Please remember to detail your working, and state clearly any assumptions which you make. • Unless a question specifies otherwise, then please make use of the Notation section as a guide to the definition of objects. • Failure to adhere to any of the guidelines may result in question-specific deduction of marks. If warranted these deductions may be punitive, and on occasion may result in no marks being awarded for the assignment. Page 2 Notation & Formulae Inputs: x = [1, x1, x2, …, xm] T ∈ Rm+1 Outputs: y ∈ R for regression problems y ∈ {0, 1} for binary classification problems Training Data: S = {(x(i), y(i))}ni=1 Input Training Data: The design matrix, X, is defined as: X = x(1)T x(2)T · · x(n)T = 1 x (1) 1 · · x(1)m 1 x (2) 1 · · x(2)m · · · · · · · · · · 1 x (n) 1 · · x(n)m Output Training Data: y = y(1) y(2) · · y(n) Data-Generating Distribution: S is drawn i.i.d. from a data-generating distribution, D Derivatives of Traces: ∂ ∂Xtr ( BXXT ) = BX+BTX ∂ ∂Xtr ( BXTX ) = XBT +XB Page 3 Properties of Rotation Matrices: For any n-dimensional rotation matrix, R ∈ Rn×n, its properties include: RRT = I detR = ±1 Page 4 1. (a) [3 marks] State the Representer Theorem, and briefly explain its importance in kernel approaches to machine learning. (b) [4 marks] Consider a version of the so-called ‘Support Vector Machine’ (SVM) classifier optimi- sation problem: min w 1 2 ‖w‖22 + C n∑ i=1 max ( 1− y(i)w · x(i) ) Here: {(x(i), y(i))}ni=1 represents a set of training data, where x ∈ Rm are input attributes, while y ∈ {−1, 1} is the output label; w ∈ Rm is the weight vector of the linear discriminant which we seek, and; C > 0 is some constant. Is this problem susceptible to the ‘Kernel Trick’? Explain. (c) [3 marks] State Mercer’s Theorem, and briefly explain its importance in kernel approaches to machine learning. (d) Given: that κ1 : Rm ×Rm → R, and κ2 : Rm ×Rm → R are valid Mercer kernels; that κ˜(u,v) = κ1(u,v)κ2(u,v) is also a valid kernel; that α > 0 is some scalar; and that p(·) is a polynomial with non-negative coefficients, then show that the following are valid kernels (here u,v ∈ Rm denote input attribute vectors): (i) [3 marks] κ(u,v) = κ1(u,v) + κ2(u,v) (ii) [3 marks] κ(u,v) = ακ1(u,v) (iii) [3 marks] κ(u,v) = p(κ1(u,v)) (iv) [3 marks] κ(u,v) = exp(κ1(u,v)) Page 5 (e) [3 marks] Hence show that the following ‘Radial Basis Function’ (RBF) is a valid kernel: κ(u,v) = exp ( −‖u− v‖ 2 2 2σ2 ) Where σ > 0 [Total for Question 1: 25 marks] Page 6 2. Assume an unlabelled dataset, {x(i) ∈ Rm}ni=1, with sample mean, x = 1n . In the ‘Projected Variance Maximisation’ approach to PCA we are interested in finding the d-dimensional subspace spanned by {u[j] ∈ Rm}dj=1, where d < m, such that u[i] · u[j] = δij ∀i, j, for which the sum of the sample variance of the data projected onto this subspace is maximised. We let X = [(x(1) − x), (x(2) − x), . . . , (x(n) − x)]T and U = [u[1],u[2], . . . ,u[d]]. (a) [9 marks] Consider the orthogonal projection, PU, of the unlabelled data into the space spanned by the columns of U such that the projection of the data point x(i) is given by: PUx (i) = U(UTU)−1UTx Using this expression, and without making any further assumptions regarding the nature of U, demonstrate that the ‘Projected Variance Maximisation’ approach to PCA leads to an optimisation problem which can be solved by finding the stationary points associated with the following Lagrangian, where H is a symmetric matrix composed of Lagrange multipliers associated with the problem. (Note that we denote the Lagrange multiplier associated with the constraint u[i] · u[j] = δij by λ[i,j]): tr ( UTSU ) + tr ( H(I−UTU)) In so doing you should show that: S = 1 n XTX [H]ij = { λ[i,i], i = j, λ[i,j] 2 , i 6= j. (b) [7 marks] Making use of the expressions for the derivatives of traces included in the ‘Notation & Formulae’ section, and seeking critical points of the Lagrangian, we derive the following problem: ∇UL = 0 = ∂ ∂U tr(UTSU) + ∂ ∂U tr ( H(I−UTU)) = SU+ STU−UH−UHT = 2(SU−UH) =⇒ SU = UH (1) One solution to this problem is given by U = Ue, where the columns of Ue are the first d eigenvectors of S, {u[j]e }dj=1, with corresponding eigenvalues {λ[j]e }dj=1 which are ordered by magnitude with the largest first. In this case He, the form of the matrix H associated with this solution, is characterised by: He = diag[λ [1] e , . . . , λ [d] e ] Typical algorithms for finding the eigenvectors of a k×k matrix have a computational cost that scales like O(k3). Page 7 Using problem (1) and the solution given by U = Ue, derive the following equation which characterises equivalent solutions {v[j] ∈ Rn}dj=1: 1 n XXTv[j] = λ[j]e v [j] In so doing characterise v[j] in terms of quantities given so far. (c) [2 marks] Explain why this might be valuable for n m. (d) [7 marks] Given a solution v[j], find an expression for u [j] e in terms of this solution. [Total for Question 2: 25 marks] Page 8 欢迎咨询51作业君