- August 20, 2020

UNIVERSITY OF SOUTHAMPTON MATH6170W1 SEMESTER 1 EXAMINATION 2019/20 MATH6170 Statistical Inference for Data Scientists Duration: 120 min (2 hours) This paper contains five questions. Answer ALL questions. An outline marking scheme is shown in brackets to the right of each question. Formula sheet FS/MATH6170/2019-20 will be available. Only University approved calculators may be used. A foreign language direct ‘Word to Word’ translation dictionary (paper version) ONLY is permitted provided it contains no notes, additions or annotations. Page 1 of 6 Copyright 2020 v01 c© University of Southampton Page 1 of 6 2 MATH6170W1 1. [20 marks] Let X1, . . . , Xn be independent and identically distributed random variables from the Laplace distribution with p.d.f. fX(x; θ) = 1 2θ exp ( −|x| θ ) , x ∈ R, θ > 0. Here |x| denotes the absolute value of x. (a) [6 marks] Write down the log-likelihood function for θ and find an expression for its maximum likelihood estimator, θˆ. (b) [6 marks] Find the expected information for θ and hence show that the asymptotic variance of θˆ is θ 2 n . (c) Suppose the Xi represent the strength of cables (measured on the log scale). If the cable is too weak, then it is not suitable for industrial use. Let ν denote the first percentile of the distribution of Xi, so that the probability that a log-cable strength Xi is less than ν is 0.01. (i) [4 marks] Show that the distribution function of the Xi is, for t ≤ 0, FX(t) = 1 2 exp ( t θ ) and hence find the maximum likelihood estimator of ν as a function of θˆ. (ii) [4 marks] Derive an approximate 95% confidence interval for ν as a function of θˆ. Copyright 2020 v01 c© University of Southampton Page 2 of 6 3 MATH6170W1 2. [20 marks] Suppose that X1, . . . , Xn are a random sample from the random variable X with p.d.f. fX(x; θ) = 2 θ x exp ( −x 2 θ ) , x > 0, θ > 0. (a) [4 marks] Show that Y = X2 has an exponential distribution and obtain the parameter of this distribution. (b) [4 marks] State the Neyman-Pearson Lemma for testing H0: θ = θ0 against H1: θ = θ1. (c) [7 marks] Consider the problem of testing H0: θ = θ0 against H1: θ = θ1, where θ1 > θ0. Show that the critical region of the most powerful test for testing H0 against H1 can be expressed as C = {x1, . . . , xn : n∑ i=1 x2i > c}, for some c > 0. Hence, conclude that this test is also the uniformly most powerful test for testing H0: θ = θ0 against HA: θ > θ0. (d) [5 marks] Explain why for the level α uniformly most powerful test, c is the 1− α quantile of a gamma distribution and give the parameters of this distribution. Copyright 2020 v01 c© University of Southampton TURN OVER Page 3 of 6 4 MATH6170W1 3. [20 marks] Assume that Y1, Y2, . . . , Yn are independent observations and Yi is normally distributed with mean βxi and a known variance σ2 for i = 1, . . . , n, where x1, . . . , xn are known constants. (a) [5 marks] Write down the likelihood function and then the log-likelihood function for β. Show that the maximum likelihood estimate of β is given by βˆ = ∑n i=1 xiyi∑n i=1 x 2 i . (b) [6 marks] Show that E(βˆ) = β and Var(βˆ) = σ2∑n i=1 x 2 i . (c) [6 marks] Assume that β has a normal prior distribution with mean β0 and variance τ 2, where β0 and τ 2 are known constants. Derive the posterior distribution of β, and show that E(β|y) = σ21 (∑n i=1 xiyi σ2 + β0 τ 2 ) , where σ21 = Var(β|y) = 1∑n i=1 x 2 i σ2 + 1 τ2 , for y = (y1, y2, . . . , yn)T. (d) [3 marks] Show that the mean of the posterior distribution is a weighted average of the prior mean β0, and the maximum likelihood estimate of β. Copyright 2020 v01 c© University of Southampton Page 4 of 6 5 MATH6170W1 4. [20 marks] (a) [3 marks] Describe the role of a prior distribution in Bayesian inference, and give examples of two different methods of selecting prior distributions. Suppose that X1, . . . , Xn are independent and identically distributed random variables following an exponential distribution with probability density function f(x|θ) = θe−θx , x > 0 , θ > 0 . (b) [5 marks] Show the Jeffreys’ prior distribution for θ is given by pi(θ) ∝ 1/θ. (c) [4 marks] Show the posterior distribution for θ under the Jeffreys’ prior is a gamma distribution with parameters α = n , β = n∑ i=1 xi . (d) [8 marks] Find an appropriate normal approximation to the posterior distribution of θ using the posterior mode and Hessian of the log posterior density. Copyright 2020 v01 c© University of Southampton TURN OVER Page 5 of 6 6 MATH6170W1 5. [20 marks] Suppose that X1, . . . , Xn are independent and identically distributed uniform random variables on the interval (0, θ). A pareto prior distribution is assumed for θ, with density pi(θ) = { babθ−b−1 , for θ > a 0 , otherwise , with known constants a > 0 and b > 1. (a) [10 marks] Show that this is a conjugate prior distribution. (b) [10 marks] Find the mean and the mode of the posterior distribution. For which loss functions are these the Bayes estimators? END OF PAPER Copyright 2020 v01 c© University of Southampton Page 6 of 6 Solutions i MATH6170W1 1. (a) The log-likelihood function is log fX(x; θ) = n∑ i=1 ( − log 2− log θ − |xi| θ ) = −n log θ − 1 θ n∑ i=1 |xi|+ c. Differentiating gives d log fX(x; θ) dθ = −n θ + 1 θ2 n∑ i=1 |xi|. Setting this to zero gives θˆ = ∑n i=1 |xi| n , a maximum, since H(θ) = d2 log fX(x; θ) dθ2 = n θ2 − 2 θ3 n∑ i=1 |xi| and H(θˆ) = n θˆ2 − 2 θˆ3 nθˆ = − n θˆ2 < 0. (b) Now I(θ) = −E[H(θ)] = n θ2 − 2 θ3 n∑ i=1 E(|Xi|). Furthermore, E(|X|) = ∫ ∞ −∞ |x| 1 2θ exp ( −|x| θ ) dx = ∫ ∞ 0 x 1 θ exp ( −x θ ) dx = θ, since this integral is the expectation of an exponential(1/θ) distribution. Therefore, I(θ) = − n θ2 + 2 θ3 nθ = n θ2 and Var(θˆ) = 1 I(θ) = θ2 n . (Question 1 continued on next page) Copyright 2020 v01 c© University of Southampton TURN OVER Page 1 of 9 Solutions ii MATH6170W1 (c) (i) F (t) = ∫ t −∞ 1 2θ exp ( −|x| θ ) dx = ∫ t −∞ 1 2θ exp (x θ ) dx, for t ≤ 0, = [ 1 2 exp (x θ )]t −∞ = 1 2 exp ( t θ ) . Require ν such that FX(ν) = 0.01. Therefore, ν = log(0.02)θ and, by the invariance property of the m.l.e., νˆ = log(0.02)θˆ = −3.91θˆ. (ii) An approximate 95% confidence interval for ν is νˆ ± 1.96s.e.(νˆ), where s.e.(νˆ) = 3.91s.e.(θˆ) = 3.91 θˆ√ n . Hence, [ −3.91θˆ − 1.96× 3.91 θˆ√ n ,−3.91θˆ + 1.96× 3.91 θˆ√ n ] . Alternatively, by applying a monotonic transformation to the approximate 95% confidence interval for θ, θˆ ± 1.96s.e.(θˆ), gives[ −3.91 ( θˆ − 1.96 θˆ√ n ) ,−3.91 ( θˆ + 1.96 θˆ√ n )] . Both simplify to[ −3.91 ( 1 + 1.96√ n ) θˆ,−3.91 ( 1− 1.96√ n ) θˆ ] . Copyright 2020 v01 c© University of Southampton Page 2 of 9 Solutions iii MATH6170W1 2. (a) Here y = x2. Hence, x = √ y and dxdy = 1 2 √ y . Since y > 0, fY (y) = 1 2 √ y 2 θ √ y exp (−yθ) = 1θ exp (−yθ) . Therefore, Y follows the exponential distribution with parameter 1/θ. (b) Suppose that x1, . . . , xn, observations of random variables X1, . . . , Xn whose joint p.d.f. is fX(x; θ). Amongst all tests of significance level ≤ α (size ≤ α) of H0: fX = fX(x; θ0) against H1 : fx = fX(x; θ1), the test of smallest Type II error probability (largest power) is the likelihood ratio test which rejects H0 if fX(x;θ1) fX(x;θ0) is large. (c) Here the log-likelihood function is log fX(x; θ) = n log ( 2 θ ) + n∑ i=1 log xi − 1 θ n∑ i=1 x2i . Hence, the test obtained using the Neyman-Pearson Lemma rejects H0 if log fX(x; θ1)− log fX(x; θ0) > K =⇒ n log ( 2 θ1 ) − 1θ1 ∑n i=1 x 2 i − n log ( 2 θ0 ) + 1θ0 ∑n i=1 x 2 i > K =⇒ −n log ( θ1 θ0 ) + ∑n i=1 x 2 i ( 1 θ0 − 1θ1 ) > K =⇒ ∑ni=1 x2i ( 1θ0 − 1θ1) > K1 =⇒ ∑ni=1 x2i > c, for some c, provided θ1 > θ0. Since this test does not depend on the actual value of θ1 as long as θ1 > θ0, it is also the uniformly most powerful test for H0 against HA. (d) For the uniformly most powerful test with size α, we require c such that P (T > c) = α, where T = ∑n i=1Xi, or equivalently P (T < c) = 1− α. Hence, c is the 1− α quantile of T . In part (a), it has been shown that Y = X2 ∼ exponential(θ). Therefore, under H0, X2i ∼ exponential(θ0). Hence, T = ∑n i=1Xi ∼ ∑n i=1 exponential(θ0) = gamma(n, θ0). Copyright 2020 v01 c© University of Southampton TURN OVER Page 3 of 9 Solutions iv MATH6170W1 3. (a) Here the likelihood function is given by; fY(y; β) = ∏n i=1 1√ 2piσ2 exp {− 12σ2(yi − βxi)2} = ( 1 2piσ2 )n/2 exp {− 12σ2 ∑ni=1(yi − βxi)2} . Hence the log-likelihood function is given by: log fY(y; β) = −n 2 log(2piσ2)− 1 2σ2 n∑ i=1 (yi − βxi)2 . Now ∂ ∂β log fY(y; β) = − 1 2σ2 n∑ i=1 (−2xi)(yi − βxi) . Therefore, ∂∂β log fY(y; β) = 0 implies βˆ = ∑n i=1 xiyi∑n i=1 x 2 i . Also, ∂2 ∂β2 log fY(y; β) = − 1 σ2 n∑ i=1 x2i < 0 , , so a maximum has been identified. (b) Here E(βˆ) = E (∑n i=1 xiYi∑n i=1 x 2 i ) = ∑n i=1 xiE(Yi)∑n i=1 x 2 i = ∑n i=1 xiβxi)∑n i=1 x 2 i = β . and Var(βˆ) = Var (∑n i=1 xiYi∑n i=1 x 2 i ) = ∑n i=1 x 2 iVar(Yi) ( ∑n i=1 x 2 i ) 2 = σ2 ∑n i=1 x 2 i ( ∑n i=1 x 2 i ) 2 = σ2 1∑n i=1 x 2 i . Copyright 2020 v01 c© University of Southampton Page 4 of 9 Solutions v MATH6170W1 (c) Likelihood: f(y1, . . . , yn|β) = ∏n i=1 1√ 2piσ2 e− 1 2σ2 (yi−βxi)2 ∝ e− 12σ2 ∑n i=1(yi−βxi)2 . Prior: pi(β) = 1√ 2piτ 2 e− 1 2τ2 (β−β0)2 . Therefore, the posterior density is proportional to: pi(β|y) ∝ e− 12σ2 ∑n i=1(yi−βxi)2− 12τ2 (β−β0)2 = e− 1 2{ 1σ2 ∑ni=1(yi−βxi)2+ 1τ2 (β−β0)2} = e− 1 2M , with M = ∑ y2i σ2 − 2β ∑ yixi σ2 + β 2 ∑ x2i σ2 + β 2 1 τ2 − 2β β0σ2 + β 2 0 τ2 = β2 (∑ x2i σ2 + 1 τ2 ) − 2β (∑ yixi σ2 + β0 τ2 ) + ∑ y2i σ2 + β20 τ2 = β2 ( 1 σ21 ) − 2β β1 σ21 + ∑ y2i σ2 + β20 τ2 = (β−β1) 2 σ21 − 1 σ21 (∑ yixi σ2 + β0 τ2 )2 + ∑ y2i σ2 + β20 τ2 . Here, σ21 = 1∑ x2i σ2 + 1 τ2 and β1 = σ 2 1 (∑ yixi σ2 + β0 τ 2 ) . Clearly, the posterior density is therefore proportional to the density for the distribution: β|y ∼ N(β1, σ21). (d) The MLE of β: βˆ = ∑ yixi∑ x2i . (Question 3 continued on next page) Copyright 2020 v01 c© University of Southampton TURN OVER Page 5 of 9 Solutions vi MATH6170W1 We can write the posterior expectation as: β1 = σ 2 1 (∑ yixi σ2 + β0 τ2 ) = ∑ yixi σ2 + β0 τ2∑ x2i σ2 + 1 τ2 = τ 2 ∑ yixi+σ 2β0 τ2 ∑ x2i+σ 2 = τ2 ∑ yixi∑ x2i + σ 2∑ x2i β0 τ2+ σ 2∑ x2i = w1βˆ+w2β0w1+w2 , where w1 = τ 2 and w2 = σ2∑ x2i . Copyright 2020 v01 c© University of Southampton Page 6 of 9 Solutions vii MATH6170W1 4. (a) The prior distribution encapsulates knowledge about the unknown parameters, or other quantities, before data is observed. After data has been observed, it is updated to a posterior distribution using Bayes theorem. Two common choices of prior distribution are conjugate priors, which when combined with the likelihood result in a posterior from the same distribution, and non-informative priors, which contain little or no information on the parameter, for example by assigning equal probability/density, to each possibility. (b) Jeffreys’ prior: pi(θ) ∝ I(θ)1/2 , where I(θ) is the Fisher information for θ. Likelihood: f(x|θ) = θn exp { −θ n∑ i=1 xi } . Log-likelihood: log f(x|θ) = n log θ − θ n∑ i=1 xi . Hence, d log f dθ = n θ − n∑ i=1 xi , and d2 log f dθ2 = − n θ2 . Then, I(θ) = −E ( − n θ2 ) = n θ2 , and therefore pi(θ) ∝ 1 θ . (c) Posterior density: pi(θ|x) ∝ θn exp { −θ n∑ i=1 xi } × 1 θ = θn−1 exp { −θ n∑ i=1 xi } . Hence θ|x ∼ Gamma (n,∑ni=1 xi). [2] (Question 4 continued on next page) Copyright 2020 v01 c© University of Southampton TURN OVER Page 7 of 9 Solutions viii MATH6170W1 (d) Normal approximation θ|x ∼ N(µ˜, σ˜2) with µ˜ equal to the posterior mode, and σ˜2 is the inverse negative Hessian of the log posterior density, evaluated at the posterior mode. Log-posterior: log pi(θ|x) ∝ (n− 1) log θ − θ n∑ i=1 xi . Hence d log pi(θ|x) dθ = n− 1 θ − n∑ i=1 xi , and so the posterior mode is given by µ˜ = n− 1∑n i=1 xi . The Hessian is given by: d2 log pi(θ|x) dθ2 = −n− 1 θ2 , and hence σ˜2 = − [ −n− 1 µ˜2 ]−1 = 1 n− 1 × ( n− 1∑n i=1 xi )2 = n− 1 ( ∑n i=1 xi) 2 . Copyright 2020 v01 c© University of Southampton Page 8 of 9 Solutions ix MATH6170W1 5. (a) The likelihood and prior are given by f(x|θ) = { θ−n 0 < xi < θ ∀xi 0 otherwise = { θ−n θ > max(xi) 0 otherwise pi(θ) = { babθ−b−1 θ > a 0 θ ≤ a Hence, the posterior density is proportional to pi(θ|x) = { θ−(b+n)−1 θ > max{a,max(xi)} = m 0 otherwise and so θ|x ∼ Pareto(m, b+ n). As the prior and posterior distributions are from the same family, the Pareto prior is conjugate for the uniform likelihood. (b) pi(θ|x) = (b+ n)(m)b+nθ−(b+n)−1 . Posterior mode: pi(θ|x) is clearly maximised when θ equals its minimum, as for fixed p, θ−p → 0 as θ →∞, i.e. θ˜ = m. Posterior mean: E(θ|x) = (b+ n)mb+n ∫ ∞ m θ−(b+n) dθ = (b+ n)mb+n (b+ n− 1)mb+n−1 = (b+ n)m b+ n− 1 . The posterior mode is the Bayes estimator under the 0-1 loss function; the posterior mean is the Bayes estimator under the squared-error loss. Copyright 2020 v01 c© University of Southampton TURN OVER Page 9 of 9