- September 17, 2020

ENEN90032 – Environmental Analysis Tools Assignment 1 Dongryeol Ryu, Manish K. Patel, and Jie Jian 26 August 2020 Submit an electronic copy (in PDF format) to the Turnitin menu of the subject LMS by 12pm (NOON) on Monday 14 September 2020. Make sure you meet the Infrastructure Engineering submission requirements (include the coversheet with signatures of team members). Include appropriate graphs and tables in your so- lution. The report should contain no more than 2500 words excluding figures. Submit your Matlab codes via your assignment group folder. Name your codes self explanatory (e.g., Q1 1.m, Q3 1.m) and add comments in the code properly. Compress your codes/report and name it with your group number (i.e., Assign- ment 01 Group 01.zip) For the Hypothesis Test problems, you must explain rationale behind the choice of the null hypothesis, test statistic, alternative hypothesis and conclusions. 30% of the marks for each section will be given based on the report quality and quality of figures and tables whenever they are required. Figures should be properly labeled and self explanatory. For each question, please describe individual members’ con- tributions briefly. You are expected to work TOGETHER for all questions, thus issues arising from splitting assignment questions between the assignment group members will not be assisted with. 1 Exploratory Data Analysis – Meteorological Datasets (20 marks) Go to the Climate Data Online1 of Bureau of Meteorology and choose a weather station in Perth, Western Australia (or surrounding area with six-digit station number staring with 009XXX ), one in Brisbane, Queensland (or surrounding area with 04XXXX ) and a station in Melbourne, Victoria (station number staring with 1http://www.bom.gov.au/climate/data/ 1 08XXXX ). Download daily rainfall data of the stations collected in a year between 2010 and 2019 inclusive (the selected year for the Brisbane and the Melbourne stations should be identical). Missing values in the selected year should be fewer than 10. For the rainfall data analysis, we will be using wet-day daily rainfall data, which excludes zero-rainfall events and the values lower than the detection limit (assume that the detection limit is 0.25 mm). 1. Make a table that summarizes the location (sample mean, median and trimean), spread (sample standard deviation, IQR and median absolute deviation) and symmetry (sample skewness and Yule-Kendall index) of the datasets in the cities. Can you infer skewness of the datasets by comparing the mean with the median? Based on the shape of the distribution (refer to the figures pro- duced in the next question), discuss the robustness of the summary statistics calculated above. 2. For the wet-day daily rainfall data, fit i) a Gaussian, ii) a gamma, and iii) a Weibull2 distribution functions to the dataset and compare the fitted distri- bution models with the data distribution. For graphical representation of the probability density of data, use the Gaussian kernel estimates that produce a smoothed curve for the probability density (you may use a Matlab func- tion ksdensity to generate kernel estimates). Also, compare its empirical cumulative distribution with fitted CDFs (Gaussian, gamma, and Weibull) and make a Q-Q plot for evaluation. Judge which model fits your data best based on the graphical examinations. 3. Suggest a probability density model that closely fits the wet-day daily rainfall data (use all datasets from Perth, Brisbane and Melbourne) and compare the performance of your choice with the results in the previous question. You can use any existing models with 1-to-3 parameters. You may refer to published research articles, e.g., Ye et al. (2018)3. 4. For the above fits to the rainfall data (including your suggested model), calculate the log-likelihood values of the fits and quantitatively prove your judgement above. 2https://en.wikipedia.org/wiki/Weibull_distribution; you may use Matlab functions weibcdf, weibfit, weibpdf for the Weibull distribution. 3https://doi.org/10.5194/hess-22-6519-2018 2 2 Newcomb-Michelson Velocity of Light Exper- iments (10 marks) Simon Newcomb of the Nautical Almanac Office (NAO), U.S., published the veloc- ity of light [Newcomb, 1883]4 based on a series of experiments he conducted with Albert Michelson until 1882. The dataset ‘NewcombLight.txt’ contains 66 sam- ples (time in seconds taken for light to travel 7442 meters at sea level) Newcomb collected in 1882. Conduct the t-test and the bootstrap based one-sample tests and provide the population mean of light velocity (in m/s) with your choice of a confidence level. Do the estimates include the widely known speed of light as in HERE5? Do the estimates from the t-test and the bootstrap show any systematic difference? If so, provide possible reasons based on the sampling distributions used by the two approaches. 3 Space Shuttle O-Ring Failures (10 marks) On 27 January 1986, the night before the space shuttle Challenger exploded, en- gineers at the company that built the shuttle warned NASA scientists that the shuttle should not be launched because of predicted cold weather. Fuel seal prob- lems, which had been encountered in earlier flights, were suspected being associated with low temperatures. It was argued, however, that the evidence was inconclu- sive. The decision was made to launch, even though the temperature at launch time was 29 ◦F (∼ −1.67 ◦C). The dataset ‘O Ring Data.XLS’ summarizes the number of O-ring incidents on 24 space shuttle flights prior to the Challenger disaster. Launch temperature was below 65 ◦F for data labeled ‘COOL’ and above 65 ◦F for data labeled ‘WARM’. Conduct a permutation test if the number of O-ring incidents was associated with the temperature using 99% confidence interval with your choice of one-sided or two-sided test options. Use 10,000 permutations to draw conclusion. Justify your choice and show your null distribution as a histogram with a test statistic marked on it. Make your final suggestion about the launch of the space shuttle on the day of accident based on the quantitative evidence that supports your suggestions. 4http://vigo.ime.unicamp.br/~fismat/newcomb.pdf 5https://en.wikipedia.org/wiki/Speed_of_light 3 4 Cloud Seeding Experiment (15 marks) The dataset ‘Cloud Seeding Case Study.XLS’ contains data collected in southern Florida between 1968 and 1972 to test a hypothesis that massive injection of silver iodide into cumulus clouds can lead to increased rainfall (J. Simpson, and J. Eden, “A Bayesian Analysis of a Multiplicative Treatment Effect in Weather Modification,” Technometrics 17 (1975)). An airplane flew for 52 days in total, however, silver iodide was injected on randomly chosen 26 days. The pilot was not aware of whether on any particular day the cloud seed was loaded or not to prevent biases. The rainfall was measured by radar as the total rain volume falling from the cloud base following the airplane seeding. 1. Using a parametric method, conduct a test if the cloud seeding made a significant impact on rainfall using both 95% and 99% confidence intervals. Choose between one-sided and two-sided tests and justify your choice. 2. Repeat the above test now using a permutation test. Use 10,000 permu- tations to draw your conclusion and show your resampled data in the his- togram. Compare your results with those from the parametric test above and explain the differences identified based on the pros and cons of the two methods. 3. You may have noticed that the rainfall data in the cloud seeding experiment are highly skewed. Transform the rainfall values using a logarithm func- tion and repeat the parametric test under the same conditions used for the Question 4.1 above. Does the transformation change your conclusion? If so, discuss about the difference and the implications of the results to the need of data-transformation when the data is highly asymmetric. 5 Atmospheric CO2 Concentration during Global Forced Confinement by COVID-19 (12 marks) Global forced lockdowns caused by fast spreading COVID-196 since late January 2020 reportedly reduced global CO2 emission. A recent report 7 estimates the reduction in CO2 emission as high as 17%. In this section, we examine if the atmospheric CO2 concentration was lower than the level it would have been with- out COVID-19 during the peak forced confinement period, April and May in 2020. 6https://ourworldindata.org/grapher/covid-stringency-index?year=2020-08-24 7https://www.nature.com/articles/s41558-020-0797-x 4 To examine the atmospheric CO2 in April and May 2020, we use the monthly CO2 data 8 maintained by the Scripps Institution of Oceanography at the Univer- sity of California, San Diego in the US. The Mauna Loa CO2 monitoring station in Hawaii provides the longest continuous record of atmospheric CO2 concentration since 1958 and is ideally located to measure globally representative CO2 values. Download the monthly CO2 data HERE 9 and use the unadjusted values in the 5th column. 1. Monthly timeseries of atmospheric CO2 features steady increase from the beginning of monitoring with strong seasonal fluctuations. It is advised that anomaly of the concentration in 2020 should be examined after removing the background trend and cyclic fluctuations. The seasonal fluctuation can be removed by sampling only April values to test concentrations in April and sampling May values to test concentrations in May. Construct sepa- rate batches of CO2 values in April and May for 1958-2020. To remove the long-term trend from the time series, use a quadratic function (2nd-order polynomial) fit to the CO2 concentration values in 1958-2019 for April and May separately. Once the trend is removed, your samples can be viewed as residuals (deviations) from the expected long-term trend of the atmospheric CO2. Conduct a residual analysis to the detrended April and May samples and provide your assessment of the residuals following the steps suggested in the Linear Regression section. 2. Now you conduct a hypothesis test using confidence interval(s) and Student’s t distribution. Provide the null and alternative hypotheses for the test. If you chose to use t statistic as a test statistic and 95% confidence level to determine acceptance/rejection, what are the critical values for the null distributions in April and May? What is the result of your test? 3. What is your assessment of the atmospheric CO2 concentration in April and May of 2020 based on the test statistics in the previous question? Are they within the range of your intuitive expectations? There exist a large number of academic and general articles about the expectations and interpretations around the observed atmospheric CO2 published online this year. Provide your assessment and interpretation of the test statistic supported by your choice of relevant articles. 8https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record.html 9https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/ monthly/monthly_in_situ_co2_mlo.csv 5 6 Was July 2019 the Hottest Month in Recorded History in the Northern Hemisphere? (13 marks) There was a claim that July of 2019 may be the hottest month in recorded history in the northern hemisphere10. In this question, you will be testing if the claim is correct using daily air temperature measured in/near a selection of major cities in the northern hemisphere. Go to the Global Historical Climatology Network (GHCN)11 site of National Oceanic and Atmospheric Administration (NOAA) of the US and download daily temperature of July of two cities chosen from Paris (France), Moscow (Russia), Berlin (Germany), Beijing (China), Tokyo (Japan), and New York (USA). Your chosen weather stations should have records back to the year 1951 or earlier. Exclude the months that include less than 50% of measurements. 1. Suggest a test statistic that can be used to examine the extremity of monthly temperature using input from daily temperature values downloaded. You will choose one or more input variables from daily maximum, daily minimum, and daily average temperature values. Justify your choice of the test statistic based on existing works in science/engineering publications (e.g., articles, reports, online material). Examine if your sample statistic shows that July 2019 was the hottest month in your chosen locations. 2. We want to check how extreme the July temperature in recent 5 years have been within the recorded period of temperature in the cities you chose. Use the test static chosen in the previous question averaged over July in 2015- 2019 and quantitatively assess the extremity of July temperature of 2015- 2019 against all July temperature (defined by your monthly test statistic) since the beginning of the record. You can use either parametric or non- parametric method for the quantitative assessment. Justify your choice of the method based on your analysis of the data distribution. 7 Exploratory Data Analysis and Linear Regres- sion (20 marks) Nutrient/sediment concentrations vs. stream discharge relationships have been widely used as a clue to explore hydro-chemical processes that control runoff chem- istry. Here we examine sediment concentration vs. stream discharge relationships 10https://doi.org/10.1029/2019EO130843 11https://www.ncdc.noaa.gov/data-access/land-based-station-data/ land-based-datasets/global-historical-climatology-network-ghcn 6 using linear regression. This question investigates the correlation between instanta- neous streamflow and the Total Kjeldahl Nitrogen concentration (TKN) collected from the site “222101, Curdies River at Curdie” located in Otway Coast of Vic- toria. The data ‘Q TKN data.csv’ contains three columns: the monitoring date (column 1), catchment-averaged streamflow (mm/d, column 2) and TKN (mg/L, column 3). 1. Calculate the Pearson correlation coefficient and Spearman’s rank correla- tion coefficient for the paired Q vs.TKN data. Then, calculate the same correlation coefficients for the paired natural log(Q) and natural log(TKN) i.e. logarithm with a base e. What do these values tell you about the re- lationship between TKN and Q? Suggest which paired data (raw data pair vs. log(data) pair) is more suitable for constructing linear regression, and justify your selection. In the subsequent questions, Q vs. TKN means their relationship based on your chosen pair. 2. Based on your selection of the paired data, plot Q vs. TKN concentrations and fit a simple linear regression: i) report the regression parameters and the goodness-of-fit for the regression; ii) use the linear regression developed, predict the TKN concentration expected when discharge reaches 2 mm/d. 3. Calculate the 95% confidence intervals for i) conditional mean and ii) pre- diction. Construct a figure showing the linear model and the confidence intervals with the observed data values and discuss the difference between the confidence intervals for conditional mean and prediction (e.g., how to interpret and use the intervals?). 4. What is the pattern of the residuals of your developed Q∼TKN model? Provide your assessment of the residuals. Assessment of autocorrelation (serial correlation) in residuals is needed. 5. Based on the plot you created in Question 6.3, check how much fraction of the observed data values actually falls within the 95% prediction confi- dence interval (e.g., you can create a for loop in Matlab to check if indi- vidual observation falling within 95% CI). According to the results and the pattern/distribution of residuals, do you recommend the application of this linear model for predicting further TKN concentrations? Did you find any specific range of Q where your model struggles to predict TKN (for this last question, compare the predicted TKN with observed TKN values in the orig- inal (raw data) space in case you built your model in the log-transformed space)? 7