- June 15, 2020

Question Mark Out of A1 8 A2 12 A3 10 A4 6 A5 9 B1 8 B2 7 B3 10 C1 6 C2 12 C3 6 C4 6 TOTAL 100 Page 2 of 43 Instructions Answer each question in the space provided. You can write in pen or pencil. Marks are indicated next to each question. The total mark for the exam is 100. Part A (45 marks in total) Question A.1 (1+1+1+1+2+1+1=8 marks) Consider the following set of numbers: -25, 2, 3, 8, 10, 14, 18, 21, 32. For each of the questions below, state your answer, showing working if necessary. (a) What is the median? (b) What is the 1st quartile? (c) What is the 3rd quartile? (d) What is the interquartile range. (e) Hence sketch a box-plot. Lay it out horizontally below. Be sure to mark the values of the various parts. Marks / 6 Page 3 of 43 (f) You are told the mean of the numbers is 9.222 and the mean of their square is 309.666. What is the sample standard deviation? (g) If you only knew the mean and sample standard deviation of the sample, what does Chebyshev’s inequality tell you? Marks / 2 Page 4 of 43 Question A.2 (4+2+2+4=12 marks) Throughout this question, show your working and leave your answer in a clear from. Of those reporting to a medical clinic, 2% have medical condition Z. It is assumed that this figure of 2% is also the base rate across the population. There is a test for condition Z such that, for those patients who have condition Z, 85% will test positive; and for those patients who do not have condition Z, 25% will test positive. (a) If a patient tests positive, what is the probability that the patient has condition Z? After some consideration, it is decided that the test gives too many false positives, and it is decided to modify the test as follows. The new test is simply to administer the original test twice, where it is assumed that these two tests give results that are independent of one another. A patient will be considered to have tested positive on the new test precisely in those cases where both tests on the original test return a positive result. (b) If a patient has condition Z, what is the probability that the patient will test positive on the new test? Marks / 6 Page 5 of 43 (c) If a patient does not have condition Z, what is the probability that the patient will test positive on the new test? (d) If a patient returns a positive result on this new test, what is the probability that the patient has condition Z? Marks / 6 Page 6 of 43 Question A.3 (2+3+3+2=10 marks) Consider the probability density func- tion given at the right, defined by p(x) = 1 2x : 0 ≤ x ≤ 1 0.5 : 1 ≤ x ≤ 2 0.25 : 2 ≤ x ≤ 3 0 : otherwise Consider the cumulative density func- tion P (x) corresponding to p(x), and the quantile function Q(p). (a) What is P (0.5) and Q(0.375)? (b) Derive the function for P (x). Marks / 5 Page 7 of 43 (c) Hence give the quantile function Q(p) corresponding to p(x). (d) Hence, or otherwise, write pseudo-code for an algorithm that will generate a sample from this distribution. Marks / 5 Page 8 of 43 Question A.4 (2+2+2=6 marks) If E [X] = 1 and E [ X2 ] = 4, E [Y ] = 0 and E [ Y 2 ] = 1, and X and Y are independent, then: (a) Calculate E [ 2X2 + (X + 1)2 ] . (b) Calculate E [ (X + 1)(Y + 1)2 ] . (c) Calculate V [(X + 1)(Y + 1)]. Marks / 6 Page 9 of 43 Question A.5 (3+3+3=9 marks) Consider the probability density function given by a mixture of two Gaussians with identical standard deviation σ, as p(x|ρ, µ1, µ2, σ) = ρN(x|µ1, σ) + (1− ρ)N(x|µ2, σ) where N(·|·) is the probability debsity function of a Gaussian. Thus the expected value of function f(x) under this distribution is given by Eρ,µ1,µ2,σ [f(x)] = ρEN(µ1,σ) [f(x)] + (1− ρ)EN(µ2,σ) [f(x)] where the two expected values on the right hand side are done using Gaussian distributions. (a) What is the mean of x for the mixture of two Gaussians? (b) What is the mean of x2 for the mixture of two Gaussians? Marks / 6 Page 10 of 43 (c) What is the variance for the mixture of two Gaussians? Marks / 3 Page 11 of 43 Part B (25 marks in total) Question B.1 (3+2+3=8 marks) You have data x distributed as Poisson with rate λ = 16, so x ∼ Pois(16). (a) Show how to use the central limit theorem to get an approximate value for p(10 ≤ x ≤ 20). Compute the approximate value, noting that the Z tables are only accurate to 2 decimal places. (b) You have a sample of 10 values from this distribution, and compute its mean x. What is an approximate distribution for x? (c) What are 95% confidence intervals for the mean x, according to this approximation? Marks / 8 Page 12 of 43 Question B.2 (2+5=7 marks) While IQ is considered to have a mean of 100 and standard deviation of 15. You expect students in your masters class will have a higher mean. (a) Given a sample of size 10, compute a one-sided 95% confidence interval in the form (−∞, I] for where the measured mean should lie. (b) You get data from 10 students with the form [104, 120, 100, 112, 133, 138, 111, 118, 114, 118]. Note that the mean of the sample is 116.8 and the mean of the squares of the sample is 13765.8. Test the null hypothesis that the students’ IQ has mean 100. Without assuming you know the standard deviation, give the test statistic and the p-value for this data. Note the tables of statistics given at the back of the exam will not allow you to lookup the p-value precisely. Marks / 7 Page 13 of 43 Question B.3 (2+2+4+2=10 marks) You obtain paired data (X,Y ) with values ~x = [4.59, 4.60, 6.32, 4.85, 3.27, 5.92, 1.92, 6.90, 4.82, 5.39] and ~y = [2.89, 2.46, 3.28, 2.34, 2.11, 3.56, 1.77, 3.29, 2.46, 2.60]. The various sample means (us- ing the above data) are: x = 4.859 y = 2.677 x2 = 25.516 y2 = 7.460 xy = 13.670 (a) What is the correlation co-efficient between X and Y ? What does this tell you about X and Y ? (b) Fit a simple linear model to this data in the form Yˆ = β0 + β1X What are your estimates for β0 and β1? Marks / 4 Page 14 of 43 (c) What are the standard errors for β0 and β1? (d) Test the hypothesis the β1 = 0. What is your test statistic and its p-value? What is the outcome of the test? Marks / 6 Page 15 of 43 Part C (30 marks in total) Question C.1 (2+2+2=6 marks) You have a data set supplied as real-valued pairs (X,Y ) and you wish to regress X onto Y . You have 2 models: A: a 4 degree polynomial yˆ = 4∑ i=0 aix i B: a 20 degree polynomial yˆ = 20∑ i=0 aix i (a) Describe how the bias of models A and B differ. (b) Describe how the variance of models A and B differ. Marks / 4 Page 16 of 43 (c) If you had 100 data points in your sample, which of ther two models would you recom- mend? Justify your answer. Marks / 2 Page 17 of 43 Question C.2 (5+3+2+2=12 marks) (a) You wish to build a na¨ıve Bayes classifier regressing Booleans A, B and C onto the Boolean X. Someone has already counted the data for you to create frequency tables below: A=0 A=1 B=0 B=1 C=0 C=1 X=0 10 40 30 20 15 35 X=1 30 20 5 45 40 10 Construct probability tables as needed to specify the estimated na¨ıve Bayes classifier for the task. Then give the formula for the classifier and describe how it would be used. Marks / 5 Page 18 of 43 (b) Consider the probabilities p(A=0|X=0) and p(B=0|X=1). Compute their standard errors, making any assumptions as needed? What can you say about the resulting estimates? (c) Which would be better, the na¨ıve Bayesian classifier or the logistic regression classifier for this data set? Justify your answer. Marks / 5 Page 19 of 43 (d) The first step of the k-means algorithm is to initialise the centroids. Describe a way this could be done, and why it is OK to use it. Marks / 2 Page 20 of 43 Question C.3 (6=6 marks) Consider the probability density function given below, defined by p(x) = 2 pi √ 1− (2x− 1)2 : 0 ≤ x ≤ 1 2 pi √ 1− (2x− 3)2 : 1 ≤ x ≤ 2 0 : otherwise This is two semi-circles side-by-side of radius 1/2, then scaled by 4/pi to get a PDF. Page 21 of 43 (a) Devise pseudo-code for a rejection sampler for this distribution. Note the maximum value is marked at 2pi . Marks / 6 Page 22 of 43 Question C.4 (5+1=6 marks) You wish to build a decision tree to predict a three-valued variable X. The first two features to test are Booleans A and B. Someone has already counted the data for you to create frequency tables below: A=0 A=1 B=0 B=1 X=0 10 40 30 20 X=1 30 20 5 45 X=2 30 20 45 5 (a) Compute and report the quality measure for the attributes A and B using the informa- tion gain metric. Marks / 6 Page 23 of 43 (b) Hence say which attribute is recommended to use at the root of the tree? Page 24 of 43 Blank page for additional answers if needed. Page 25 of 43 Blank page for additional answers if needed. Page 26 of 43