- May 15, 2020

Assignment 2 MAST90083 Computational Statistics and Data Mining Due time: 5PM, Friday October 25th You must submit your report via LMS Data analysis The data set chicago, in the package gamair, contains data on the relationship between air pollution and the death rate in Chicago from 1 January 1987 to 31 December 2000. The seven variables are: the total number of (non-accidental) deaths each day (death); the median density over the city of large pollutant particles (pm10median); the median density of smaller pollutant particles (pm25median); the median concentration of ozone (O3 ) in the air (o3median); the median concentration of sulfur dioxide (SO2 ) in the air (so2median); the time in days (time); and the daily mean temperature (tmpd). We will model how the death rate changes with pollution and temperature. Epidemiol- ogists tell us that risk factors usually multiply together rather than adding, so we will fit additive models to the logarithm of the number of deaths. For fitting additive models, please use the mgcv package. For this exercise, you may find it helpful to refer to Chapters 7 and 8 of Shalizi. 1. Load the data set and run summary on it. (a) Is temperature given in degrees Fahrenheit or degrees Celsius? (b) The pollution variables are negative at least half the time. What might this mean? (c) We will ignore the pm25median variable in the rest of this problem set. Why is this reasonable? 2. Fit a spline smoothing of log(death) on time. (You can use either smooth.spline or gam.) (a) Plot the smoothing spline along with the actual values. (b) There should be four large outliers, right next to each other in time. When are they? 1 3. Use gam to fit an additive model for log(death) on pm10median, o3median, so2median, tmpd and time. Use spline smoothing for each of these predictor variables. Hint: Be- cause of some missing-data issues, some plots later may be easier to make if you set the na.action=na.exclude option when estimating the model. (a) Plot the partial response functions, with partial residuals. Describe the partial response functions in words. (b) Plot the fitted values as a function of time, along with the actual values of log(death). (c) Are the outliers still there? Are they any better? 4. Using the last model you fit, we will consider the predicted impact of a 2◦ Celsius increase in temperature on log(death), taking the last full year of the data as a baseline. (2◦ Celsius is in the middle range of current projections for the global average effect of climate change by the end of this century.) (a) Prepare a data frame containing only the last full year of the data. What is the average predicted value of log(deaths)? (b) Modify this data frame to increase all temperatures by 2◦ Celsius. (c) Find the new average change in the predicted values of log(deaths) associated with a 2◦ Celsius warming. (d) Find a standard error for this average predicted change, using the standard errors for the prediction on each day, and assuming no correlation among them. Include an explanation of why your calculation is correct. Also give the corresponding Gaussian 95% confidence interval. Hint 1: The se.fit option to predict. Hint 2: Appendix C to Shalizi on propagation of error. (e) Find a standard error for the predicted change in the number of deaths (not the change in log(death)) and the corresponding 95% Gaussian confidence interval. Hint: Propagation of error again. EM algorithm For the following exercises, you may find it helpful to refer to Chapter 17 in Shalizi. 5. Code up the EM algorithm for a mixture of K Gaussians. Simulate data from K = 3. How well does your code assign data-points to clusters if you give it the actual Gaussian parameters as your initial guess? If you give it other initial parameters? 6. Read Section 17.4 in Shalizi for the analysis of the Snoqualmie Falls data with a Gaussian mixture. As it turns out, the Gaussian mixture is rather unsatisfactory. Write a function to fit a mixture of exponential distributions using the EM algorithm 2 Does it do any better than a Gaussian mixture at discovering sensible structure in the Snoqualmie Falls data? You can read the dataset into R with the command snoqualmie 3