辅导案例-ECMM444

ECMM444 Fundamentals Of Data Science Continuous Assessment 2 This continuous assessment (CA) comprises 60% of the overall module assessment. This is an individual exercise and your attention is drawn to the College and University guidelines on collaboration and plagiarism, which are available from the College website. As a rule of thumb, to understand when collaboration becomes plagiarism, consider the following: it is OK when students communicate and support each other in better understanding the concepts presented in the lectures; it is not OK when students communicate how these concepts can be combined and used to solve specific Assignments questions. Question 1 Acquire the Iris dataset using the following procedure: from sklearn.datasets import load_iris X,y = load_iris(return_X_y=True) The data matrix contains 150 vectors (also called instances) with 4 attributes each (i.e. it is a 150 x 4 matrix) and the vector contains the class encoded as the integers 0,1, and 2. a) Split the data matrix in two data matrices and each containing a balanced number of instances per class (i.e. contains as many instances from class as ). Split the class vector into and accordingly (i.e. the first instance in is the class of the first instance in , etc). y Dtr Dts Dtr k Dts y ytr yts ytr Dtr b) Define a function to compute the distance between two vectors (of arbitrary dimension) as the length of the difference vector. c) Using the distance function, build the function one_knn_predict that implements the 1 nearest neighbor classification technique. This will be later used to implement a k-nearest neighbor classifier. The function takes in input two data matrices and and a target vector . For each instance in it returns the class associated to the closest (i.e. the least distant) instance in . d) Create a function fit_LDA that takes in input a data matrix and an associated class vector and outputs the fit parameters for the Linear Discriminant Analysis (LDA) classifier. Create a function test_LDA that takes in input a data matrix and the fit parameters for the LDA classifier and returns a prediction for each element in the data matrix. The two functions fit_LDA and test_LDA form your implementation of a LDA classifier (i.e. do not use a third party library implementation for the LDA classifier). e) Use and to fit your implementation of the 1 nearest neighbor classifier using the one_knn_predict function and your implementation of the LDA classifier. Compute the accuracy of the 1 nearest neighbor classifier and the accuracy of the LDA classifier on . The accuracy is the proportion of true results (i.e. when the class predicted was the same as the true class) over the total number of predictions. (Total 30 marks) Dts Dtr ytr Dts Dtr Dtr ytr Dts Question 2 Acquire the Iris dataset as indicated in Question Q1. Select only the instances relative to a single class and denote the data matrix as . a) Create a function add_missing that takes in input a data matrix and a number and returns a data matrix and a matrix . Each row of the matrix contains the row and column indices of an entry in chosen uniformly at random. The matrix is a copy of except for the entries specified in : the value in each entry is the column average of . b) Create a function impute that takes in input a data matrix and a number and returns a data matrix . The matrix is the reconstruction of using its largest singular vectors and values (i.e. it is the truncated SVD reconstruction of ). c) Compute the average length of the difference vectors between the corresponding instances in and in its reconstructed version. d) Repeat the following 30 times: from generate using apply impute to using a specific to compute compute between and Consider the average over the 30 trials. Report a plot of the average value when the procedure is repeated 30 times, for and . (Total marks 30) D D k D′ k × 2 P P i j Dij D D′ D P Dij ′ Di M r M ′ M ′ M r M E D D D′ k = 50 D′ r M E D M E E k = 50 r = 1, 2, 3, 4 Question 3 a) Build a function make_data to generate a sample matrix from a random multivariate Gaussian distribution. The function should take in input the vector space dimension and the desired number of samples. To define the distribution, generate a random -dimensional vector as the mean with values in and a random covariance matrix with values in [–1,1]. Your procedure should guarantee that the covariance matrix is a positive definite matrix. b) Using the function make_data , generate 3 sample matrices, each containing 200 instances of dimension . Combine them in a single data matrix . Build a corresponding class vector containing the identity of the Gaussian distribution of origin for each instance. c) Perform a Principal Component Analysis and compute as the 2 dimensional projection of along the main components. Plot distinguishing the instance class by color. d) Create a function fit_QDA that takes in input a data matrix and an associated class vector and outputs the fit parameters for the Quadratic Discriminant Analysis (QDA) classifier. Create a function test_QDA that takes in input a data matrix and the fit parameters for the QDA classifier and returns a prediction for each element in the data matrix. The two functions fit_QDA and test_QDA form your implementation of a QDA classifier (i.e. do not use a third party library implementation for the QDA classifier). e) Compute and plot the decision surface of your Quadratic Discriminant Analysis classifier relative to the data matrix and class vector (hint – you can use a set of test points arranged on a regular 2D grid). You should obtain a plot similar to the following: p p [−20, 20] p × p p = 4 D y D′ D D′ D′ y (Total marks 40) Submitting your work Please write your student ID in the first cell of the notebook. You should submit the Jupyter notebook containing the code with its output for all the questions. Make a separate cell for each point a), b), c), etc of each question. Submit a single archive file .zip or .tgz containing both a PDF copy of your notebook and also the source file with extension .ipynb . Markers will not be able to give feedback if you do not submit the PDF version of your code and marks will be deducted if you fail to do so. Marking criteria Work will be marked against the following criteria. Although it varies a bit from question to question they all have approximately equal weight. Does your algorithm correctly solve the problem? In most of the questions the required code has been described, but not always in complete detail and some decisions are left to you. Is the code syntactically correct? Is your program a legal Python program regardless of whether it implements the algorithm? Is the code beautiful or ugly? Is the implementation clear and efficient or is it unclear and extremely inefficient (e.g. it takes more than a few minutes to execute)? Is the code well structured? Have you made good use of functions? Are you using Numpy functions on entire arrays when possible? Is the code well laid out and commented? Is there a comment describing what the code does? Have you used space to make the code clear to human readers? There are 10% penalties for: Not submitting the PDF version of your programs. Not creating functions as instructed in the questions.

辅导案例-ECMM444

Related

Previous Post辅导案例-CMP-4011A

Next Post辅导案例-CSCI 320

Author admin