- August 20, 2020

Paper ID:00205 FAMILY NAME: OTHER NAME(S): STUDENT ID: SIGNATURE: SCHOOL OF RISK AND ACTUARIAL STUDIES TERM 2, 2019 ACTL 3142: ACTUARIAL DATA AND ANALYSIS FINAL EXAM INSTRUCTIONS: 1. TIME ALLOWED2 HOUR 2. READING TIME10 MINUTES 3. THIS EXAMINATION PAPER HAS 26 PAGES. 4. TOTAL NUMBER OF QUESTIONS5 5. TOTAL MARKS AVAILABLE100 6. MARKS AVAILABLE FOR EACH QUESTION ARE SHOWN IN THE EXAMINATION PA- PER (AND OVERLEAF). ALL QUESTIONS ARE NOT OF EQUAL VALUE. 7. ANSWER ALL QUESTIONS IN THE SPACE ALLOCATED TO THEM. IF MORE SPACE IS REQUIRED, USE THE ADDITIONAL PAGES AT THE END. 8. CANDIDATES MAY BRING a. THEIR OWN CALCULATORS. ALL CALCULATORS MUST BE APPROVED. 9. ALL ANSWERS MUST BE WRITTEN IN INK. EXCEPT WHERE THEY ARE EXPRESSLY REQUIRED, PENCILS MAY BE USED ONLY FOR DRAWING, SKETCHING OR GRAPH- ICAL WORKS. 10. THIS PAPER MAY NOT BE RETAINED BY THE CANDIDATE. CANDIDATESMUST CEASE WRITING IMMEDIATELY WHEN INSTRUCTED TO DO SO BY THE SUPERVISOR AT THE END OF THE EXAMINATION. . 1 [25 marks] 2 [29 marks] 3 [12 marks] 4 [16 marks] 5 [18 marks] [total: 100 marks] Page 2 of 26 Question 1 [25 marks] In this question we are going to apply several statistical learning techniques to analyse a dataset of daily counts of rented bicycles in Washington D.C. during year 2012, along with weather and seasonal information. The goal is to predict how many bikes will be rented depending on the weather and the day. The following data are available for each day during 2012: count: Daily count of number of bike rentals. The count is used as the target in the predictive models. season: The season, either spring, summer, fall (autumn) or winter. workingday: Indicator whether the day was a holiday or not. weather: The weather situation on that day. One of: GOOD: for clear to partly cloudy days; MIST: for misty days; RAIN/SNOW/STORM: for days with rain, snow or storms. temp: Temperature in degrees Celsius. windspeed: Wind speed in km per hour. The data was randomly divided into 250 training cases and 116 test cases. a. [6 marks] As a first approach we fitted a multiple linear regression with count as the response and all the other variable as inputs. The results of this regression are shown below: Call: lm(formula = count ~ ., data = trainData) Residuals: Min 1Q Median 3Q Max -3615.8 -441.6 114.5 617.3 3310.0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3575.52 286.09 12.498 < 2e-16 *** seasonSUMMER 1056.63 242.23 4.362 1.91e-05 *** seasonFALL 562.44 321.74 1.748 0.08172 . seasonWINTER 1532.35 198.55 7.718 3.18e-13 *** holidayHOLIDAY -1105.89 358.31 -3.086 0.00226 ** workingdayWORKING DAY 134.75 141.40 0.953 0.34157 weatherMISTY -835.05 138.55 -6.027 6.24e-09 *** weatherRAIN/SNOW/STORM -3894.90 519.06 -7.504 1.21e-12 *** temp 129.66 14.22 9.122 < 2e-16 *** windspeed -41.87 12.82 -3.266 0.00125 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1009 on 240 degrees of freedom Multiple R-squared: 0.7109,Adjusted R-squared: 0.7001 F-statistic: 65.57 on 9 and 240 DF, p-value: < 2.2e-16 The test MSE for this model was 788,118.5. Page 3 of 26 i. [3 marks] Provide an interpretation of the coefficients associated with the variables weather and temp. ii. [3 marks] The plot below of the model residuals against the temperature indicates a non- linear association between the daily count of bike rentals and the temperature. l l l ll l l l ll l l l l l ll l l l l l l l l l l l ll l l ll ll l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lll l l l l l l ll l l l l l l l l l l l l ll l l l l l l l ll l l l l l l l l l l l l l l l l l l l l l ll l l l l l l ll l ll l l l l l l l l l l l −2000 0 2000 0 10 20 30 Temperature R es id ua ls Suggest one way in which the multiple linear regression fitted in part a. could be extended to deal with this non-linear association. Page 4 of 26 b. [4 marks] A regularised regression was also fitted to the bike rental data. the figure below shows the regularisation path as a function of the regularisation parameter λ (in log scale). The dashed vertical line indicates the best value of λ estimated using cross-validation and applying the one-standard error rule. The test MSE for this value of λ was 837,853.3. seasonSUMMER seasonFALL seasonWINTER workingdayWORKING DAY weatherMISTY weatherRAIN/SNOW/STORM temp holidayHOLIDAY windspeed Cross−Validation Solution −500 0 500 1000 0 2 4 6 log(λ) St an da rd ise d co ef fic ie nt s i. [2 marks] Was ridge regression or lasso used to produce the plot above? Justify your answer. ii. [2 marks] What variables are included in the model associated with the cross-validation solution indicated by the vertical line in the above figure? Page 5 of 26 c. [3 marks] A regression tree was fitted to the bike rental data. The test MSE for this regression tree was 762,770. The graphical representation of the tree is shown below. |temp < 11.6225 temp < 5.17958 season: SPRING,SUMMER weather: MISTY,RAIN/SNOW/STORM windspeed < 11.2499 temp < 27.9354temp < 15.97 2541 3981 4967 6588 5150 6414 7247 6189 What are the characteristics of the days for which the regression tree prediction is 7247 bike rentals in the day? Remember that at splits the criterion at the node is to the left branch of the tree. Page 6 of 26 d. [8 marks] In order to improve the prediction performance of the regression tree we applied bagging and boosting to the bike rental data. Figure 1 shows the test MSE for bagging and boosting as a function of the number of trees B. 0 250,000 500,000 750,000 1,000,000 1,250,000 1,500,000 1,750,000 2,000,000 2,250,000 2,500,000 2,750,000 0 500 1000 1500 2000 Number of Trees (B) Te st M SE Bagging Boosting Single Tree Figure 1: Test MSE for bagging, boosting and a single decision tree. i. [4 marks] Bagging and boosting are both ways of ensembling multiple trees together. Briefly discuss how bagging and boosting differ in the way they produce an ensemble of trees. Page 7 of 26 ii. [4 marks] How many trees B would you choose for bagging? How many trees B would you choose for boosting? Justify your answer. e. [4 marks] Which of the five statistical learning methods (multiple regression, regularised regres- sion, regression tree, bagging and boosting) would you recommend for predicting the number of bike rentals? Justify your answer. Page 8 of 26 Question 2 [29 marks] In this question we are going to explore the scale invariance properties of simple linear regression under different estimation methods. Consider a regression problem in which there is an univariate predictor X and a quantitative output Y . We have observed a training set comprising observation pairs {xi, yi} for i = 1, . . . , n. Let Z represent the standardized version of X so that zi = xi/σˆ with σˆ = √ 1 n ∑n i=1(xi − x¯)2. Consider the following two simple linear regression models: Y = β0 + β1X + (1) and Y = θ0 + θ1Z + . (2) Recall that the standard least square estimates of β0 and β1 are given by: βˆLS1 = ∑n i=1 xiyi − nx¯y¯∑n i=1 x 2 i − nx¯2 = ∑n i=1(xi − x¯)(yi − y¯)∑n i=1(xi − x¯)2 , βˆLS0 = y¯ − βˆLS1 x¯. a. [13 marks] As an alternative to least square estimation we can estimate the parameters in Equations 1 and 2 using ridge regression. We get ridge regression coefficient estimates βˆR0 , βˆ R 1 , θˆR0 and θˆ R 1 by minimising n∑ i=1 (yi − β0 − β1xi)2 + λβ21 and n∑ i=1 (yi − θ0 − θ1zi)2 + λθ21 where λ ≥ 0 is a tuning parameter. i. [3 marks] In ridge regression regularisation we add a penalty term to the residual sum of squares. What is the rationale for adding this penalty? What is the connection between regularisation and the bias-variance trade-off? Page 9 of 26 ii. [7 marks] Show that the ridge regression estimates of β0 and β1 are given by: βˆR1 = ∑n i=1(xi − x¯)(yi − y¯)∑n i=1(xi − x¯)2 + λ , (3) βˆR0 = y¯ − βˆR1 x¯. (4) iii. [3 marks] Provide an interpretation of Equation 3. How does the estimated coefficient βˆR1 vary with the regularisation parameter λ? Page 10 of 26 b. [6 marks] We now compare the scale invariance properties of least squares and ridge regression. i. [3 marks] Assume that we estimate the parameters in Equations 1 and 2 using least squares to obtain βˆLS0 , βˆ LS 1 , θˆ LS 0 and θˆ LS 1 . Show that the least square coefficient estimates are scale equivariant. That is, show that for a given value x0 with corresponding standardised value z0 = x0/σˆ we have that the predictions βˆ LS 0 + βˆ LS 1 x0 and θˆ LS 0 + θˆ LS 1 z0 are the same. ii. [3 marks] Show that the ridge regression coefficient estimates are not scale equivariant. That is, show that for fixed λ and a given value x0 with corresponding standardised value z0 = x0/σˆ we have that the predictions βˆ R 0 + βˆ R 1 x0 and θˆ R 0 + θˆ R 1 z0 are different. Page 11 of 26 c. [4 marks] It can be shown that the scale invariance properties of least squares and ridge regression also apply in the context of multiple linear regression. Briefly discuss the practical implications of the fact that least squares estimates are scale equivariant and that ridge regression estimates are not scale equivariant. Page 12 of 26 d. [6 marks] Assume that the predictor in our training dataset has a standard deviation σˆ = 10. Using the training data we have estimated the following four linear regressions: Unstandardised least squares: Y = β0 +β1X+ with parameters estimated using standard least squares. Standardised least squares: Y = θ0 + θ1Z + with parameters estimated using standard least squares. Unstandardised ridge regression: Y = β0 + β1X + with parameters estimated using ridge regularisation with λ = 5. Standardised ridge regression: Y = θ0 + θ1Z + with parameters estimated using ridge regularisation with λ = 5. Providing clear reasons, order these four methods in ascending order according to: i. [3 marks] their bias: ii. [3 marks] their variance: Page 13 of 26 Question 3 [12 marks] For this question, assume that we are training a maxima marginal classifier on n = 4 observations in p = 2 as indicated in the table below: Obs. x1 x2 y 1 0 3 -1 2 h 1 -1 3 0 0 +1 4 2 2 +1 where we treat 0 ≤ h ≤ 3 as a parameter. These observations are depicted in Figure 2, in which cases where y = −1 are plotted as `o' and cases where y = +1 as `x`. For illustration, in this figure we have set h = 0.5. l l (0,3) (h,1) (0,0) (2,2) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x1 x 2 l y=−1 y=+1 Figure 2 a. [2 marks] How large can h be so that the training observations are still linearly separable? Justify your answer. b. [2 marks] Does the orientation (slope) of the maximal margin hyperplane change as a function of h when the observations are separable? Briefly justify. Page 14 of 26 c. [4 marks] Derive an expression for the margin M achieved by the maximal margin classifier as a function of h. Show all your workings. d. [4 marks] Assume that h = 0.5 as in Figure 2. What is the leave-one-out cross-validation error estimate of the test error for the Maximal Margin Classifier in Figure 2? Show all your workings. Page 15 of 26 Question 4 [16 marks] For this question consider a binary classification problem with only two inputs x1 and x2. Your task is to draw datasets for different settings. For each of the settings below present a data point as a circle (`o') for cases where y = −1 and as a cross (`x`) for cases where y = +1. a. [4 marks] Draw a dataset where a support vector machine with a radial kernel would perform better than a support vector classifier. Include in your drawing the decision boundary for the support vector machine with a radial kernel. Justify your answer. x1 x 2 l y=−1 y=+1 b. [4 marks] Draw a dataset where logistic regression would have exactly the same performance as a decision tree. Justify your answer. x1 x 2 l y=−1 y=+1 Page 16 of 26 c. [4 marks] Draw a dataset where linear discriminant analysis would be better than a maximal margin classifier. Justify your answer. x1 x 2 l y=−1 y=+1 d. [4 marks] Draw a dataset where K nearest neighbour with K = 1 would have higher leave-one- out error than a support vector classifier. Justify your answer. x1 x 2 l y=−1 y=+1 Page 17 of 26 Question 5 [18 marks] Assume you are interested in understanding the distribution of blood types around the world. For these you have gathered data on the prevalence of a blood type (A+, A-, B+, B-, AB+, AB-, O+, and O-) in 50 countries around the world. The table below shows the average and standard deviation of the frequencies of each of the 8 blood types. O+ A+ B+ AB+ O- A- B- AB- Average frequency 40.73% 29.20% 16.57% 4.84% 3.98% 3.27% 1.27% 0.51% Standard deviation 11.07% 7.10% 8.85% 2.30% 2.88% 2.83% 1.06% 0.54% Table 1: Average and standard deviation of of blood type frequencies among 50 countries. To explore these blood type data you have used Hierarchical Clustering and Principal Component Analysis (PCA). Figure 3 shows the result of performing Hierarchical Clustering using complete (max- imal) linkage and Euclidean distance. Figure 4 plots the first two principal components of the data and Table 2 gives the loading associated with each of the principal components. We note that both PCA and Hierarchical Clustering have been carried out after centring and scaling each variables. Ch ile Pe ru Et hi op ia Ja m a ic a Eg yp t M au rit an ia Sa ud i A ra bi a Un ite d Ar a b Em ira te s G ui ne a Ke ny a Ba hr a in N ig er ia Ar ge nt in a Ca m er oo n M ex ic o M on go lia Ja pa n Th ai la nd M al ay sia Ph ilip pi ne s In do ne sia N ep al Ch in a H on g Ko n g Ba ng la de sh In di a Ira q Pa ki st an Sw itz e rla nd Po la nd Is ra e l Fi nl an d Tu rk ey Au st ria D en m ar k G er m a ny Se rb ia Sw e de n Ic el an d Ire la nd Be lg iu m So ut h Af ric a Sp ai n Au st ra lia N ew Z ea la nd N or wa y Fr a n ce Ca na da N et he rla nd s 0 2 4 6 8 10 Cluster Dendrogram dist(blood_type_mat_scale) H ei gh t Figure 3: Dendogram of Hierchical Clustering of countries based on blood type prevalence with complete linkage and Euclidean distance. Page 18 of 26 −0.2 0.0 0.2 0.4 − 0. 2 0. 0 0. 2 0. 4 First Principal Component Se co nd P rin ci pa l C om po ne nt Argentina Australia Austria Bahrain Bangladesh Belgium CanadaCameroon Chile China Denmark Egypt Ethiopia Finland France GermanyGuinea Hong Kong Iceland India Indonesia Iraq Ireland Israel Jamaica Japan Kenya Malaysia Mauritania Mexico Mongolia Nepal Netherlands New ZealandNigeria Norway Pakistan Peru Philippines Poland Saudi Arabia Serbia South Africa Spain Sweden Switzerland Thailand Turkey United Arab Emirates −5 0 5 − 5 0 5 O+ A+ B+ AB+ O− A− B− AB− Figure 4: Biplot of the first two principal components for blood type data. The gray country names represent the scores of the first two principal components (with axes on the top and right). The black arrows indicate the first two principal component vectors (with axes on the bottom and left) PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 O+ -0.25 0.54 -0.17 0.29 0.28 -0.14 -0.08 -0.66 A+ 0.35 -0.12 0.71 -0.12 0.12 0.12 0.30 -0.47 B+ -0.30 -0.47 -0.24 -0.07 -0.56 0.18 0.03 -0.53 AB+ -0.11 -0.61 0.07 0.10 0.42 -0.54 -0.35 -0.10 O- 0.43 0.18 -0.21 -0.46 -0.30 -0.65 0.04 -0.15 A- 0.48 0.03 -0.06 -0.03 -0.03 0.36 -0.78 -0.15 B- 0.34 -0.23 -0.59 -0.11 0.49 0.26 0.37 -0.12 AB- 0.43 -0.10 -0.07 0.82 -0.28 -0.16 0.17 0.03 Variance 4.04 2.32 0.90 0.32 ? 0.11 0.09 0.01 Cumulative Variance Explained 0.51 0.80 ? ? 0.97 0.99 1.00 1.00 Table 2: The principal component loadings vectors for the blood type data. The last two rows show the variance and proportion of variance explained by each of the principal components. Page 19 of 26 a. [4 marks] Explain briefly the purpose of clustering and PCA. How are these two techniques different? How are they similar? b. [3 marks] Complete the missing entry in the row "Variance" for PC5 in Table 2. Explain your answer. c. [3 marks] Complete the bottom row of Table 2 which gives the Cumulative Variance Explained. Show all your workings. Page 20 of 26 d. [8 marks] Based on the results from the clustering analysis and the PCA discuss what clusters of countries can be identified and discuss with clear reasons the characteristics of the blood type distribution of the countries in each of these clusters. End of Paper Page 21 of 26 Page 22 of 26 ADDITIONAL PAGE Answer any unfinished questions here, or use for rough working. Page 23 of 26 ADDITIONAL PAGE Answer any unfinished questions here, or use for rough working. Page 24 of 26 ADDITIONAL PAGE Answer any unfinished questions here, or use for rough working. Page 25 of 26 ADDITIONAL PAGE Answer any unfinished questions here, or use for rough working. Page 26 of 26 ADDITIONAL PAGE Answer any unfinished questions here, or use for rough working.