- January 8, 2021

QUESTION PAPER TEMPLATE MSc Examination Friday 8th May 2014 14:30 – 17:00 ECS764 Applied Statistics Duration: 2 hours 30 minutes YOU ARE NOT PERMITTED TO READ THE CONTENTS OF THIS QUESTION PAPER UNTIL INSTRUCTED TO DO SO BY AN INVIGILATOR Answer FOUR questions If you answer more questions than specified, only the first answers (up to the specified number) will be marked. Cross out any answers that you do not wish to be marked Calculators are/are not permitted in this examination. Please state on your answer book the name and type of machine used. Complete all rough workings in the answer book and cross through any work that is not to be assessed. Possession of unauthorised material at any time when under examination conditions is an assessment offence and can lead to expulsion from QMUL. Check now to ensure you do not have any notes, mobile phones or unauthorised electronic devices on your person. If you do, raise your hand and give them to an invigilator immediately. It is also an offence to have any writing of any kind on your person, including on your body. If you are found to have hidden unauthorised material elsewhere, including toilets and cloakrooms it will be treated as being found in your possession. Unauthorised material found on your mobile phone or other electronic device will be considered the same as being in possession of paper notes. A mobile phone that causes a disruption in the exam is also an assessment offence. EXAM PAPERS MUST NOT BE REMOVED FROM THE EXAM ROOM Examiners: Steve Uhlig © Queen Mary, University of London, 2013 Page 2 ECS764 (2014) Question 1 – Descriptive statistics & probability distributions a) Consider the following popular centrality statistics: mode, mean, median and mid- range. Explain the strengths and weaknesses of each of them to describe the centrality of a normal random variable, depending on the number of data samples available. Answer: When many samples are available, it does not matter, all of them are more or less equivalent, except the mid-range that is always unstable and therefore a poor cemntrality statistic. With a limited number of sample points, the mode is unlikely to be meaningful at all. The mean will be biased but should be still meaningful. The median is the least biased of all centrality statistics. The mid-range is always the most biased. [10 marks] b) Assume that a set of points, distributed according to a normal distribution, suffers from a few unusually large or small values (referred to as “outliers”). Explain in what way do such “outliers” affect the variance, and why this is the case? Answer: Outliers in the form of very large values increase the variance. Because the variance is a second-order statistic (square of deviations around the mean), any large value about the mean increases the variance. [5 marks] c) Explain the main difference between exponential and heavy-tailed distributions. Illustrate this difference by explaining how well the first (percentile 25) and third quartiles (percentile 75) describe them. Answer: The main difference between exponential and heavy-tailed distributions is the decay of the tail probabilities. Exponential distributions have an exponential tail, meaning that the probability of observing a large value decays exponentially fast. A heavy-tailed distribution on the other hand has a heavier tail, in the sense that the probability of observing a large value x is proportional x-a, where a is a positive integer. Exponential distributions are reasonably well described by the first and third quartile as their deviation around the mean is limited, and these quartiles will likely capture most of the mass of the distribution. Heavy-tailed distributions on the other hand will be described best by high quantiles, e.g., the percentile 95 or 99, so the first and third quartile will not sample the large values of a heavy-tailed distribution. [10 marks] ECS764 (2014) Page 3 Turn Over Question 2 – Fitting distributions a) Explain the three main steps of the methodology through which some statistical variable (e.g., data) will be fit to a given probability distribution? Describe each of the 3 steps and how they relate with each other. Answer: The 3 steps in fitting a probability distribution are: (1) finding the distribution from which the data might be drawn, (2) fitting the parameters of this distribution, and (3) evaluate the quality of the fit. The first step requires prior knowledge about the process that generates the data or some guess about the likely distributions. The second step uses the first and estimates the most appropriate values of the parameters of the considered distribution (which could be one or multiple). Finally, the last step is used to quantify the distance between the data and the fitted distribution. [15 marks] b) Explain the purpose of the QQ-plot in the methodology of fitting a probability distribution to some statistical variable (e.g., data). You may illustrate an answer with a diagram showing an example of a QQ-plot. Answer: The qqplot is a graphical technique to compare how the quantiles of two given distributions relate to each other. It consists in plotting on the x-axis the values of the quantiles of the first distribution, and on the y-axis the values of the quantiles of the second distribution. The purpose of the qqplot is to asses how similar two empirical distributions are, by visually comparing their quantiles. If the two distributions are similar, their quantiles should fall on the diagonal of the plot. [10 marks] Page 4 ECS764 (2014) Question 3 – Hypothesis testing a) Explain the notion of a statistical test in the case of a one-sample test, i.e., when some data is compared to a known population. Describe the respective roles of the null hypothesis (H0), the test statistic, and the p-value in the outcome of the statistical test. In particular, explain when the null hypothesis will be rejected. Answer: A statistical test is a procedure to test a hypothesis about a set of numerical values, i.e., data. A statistical test relies on a null hypothesis (H0), i.e., a statement that is tested about the data. Sometimes, an alternative hypothesis (often the complement of H0) will also be stated that is hoped to be true in the case the null hypothesis is rejected. A one-sample test relies on a “test statistic” that will provide a distance function between the data and the known population that defines H0. The test statistic is specifically selected or defined in such a way as to quantify, within observed data, behaviors that would distinguish H0 from HA. Depending on the size of the data, a given distance of the test statistic will be translated into a likelihood that the data is as extreme as the one observed, assuming that H0 is true, called the p-value. In other words, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis H0 is true. If the p-value is below a pre-defined threshold, the null hypothesis will be rejected. [15 marks] b) A statistical test does not accept the null hypothesis, but either rejects or fails to reject it. Imagine that you believe that the null hypothesis should be rejected, but with a given dataset (of a given size) you fail to reject it. Explain two strategies that you may pursue to reject the null hypothesis, but without changing the null hypothesis. Answer: The first strategy is to increase the size of the dataset and hope that the distance will increase and the p-value will decrease. The second strategy is not to change the size of the dataset, but to rely on a test statistic that is more strict and will therefore require a smaller distance between the data and the known population, e.g., the KS test that is very strict. [10 Marks] ECS764 (2014) Page 5 Turn Over Question 4 – Time-series analysis a) Explain the auto-correlation function, and its use in time-series analysis. Answer: The auto-correlation function is the correlation between values of the process at different times s,t, as a function of the two times s,t or of the time difference t-s. Its formula is the following: R(s, t) = E[(Xt -m)(Xs -m)] s 2 where E is the expectation, mu is the mean, and sigma is the standard deviation. The auto- correlation is used to understand the dependence over time within time-series. Auto- correlation decays (in absolute value) with time lag, so plotting how this decay depends on the time lag gives insight into the properties of the time-series. [10 marks] b) In time-series analysis, one often relies on decomposition, by which the time-series is decomposed into a “trend” and a “remainder”. The remainder component is often expected to be uncorrelated, i.e., close to random noise. Describe the auto-correlation of perfect random noise. Answer: The auto-correlation of a perfect random noise should be 1 at lag 0 (as is the case for all time-series) and should be close to 0 for any non-0 lag. Close to 0 actually means within the 1/sqrt(n) confidence intervals, where n is the length of the time-series. [5 marks] c) Stationarity is a fundamental concept in time-series analysis. Give one of its multiple definitions, and give an example of a stationary process and of a non-stationary process. Answer: Stationarity has multiple definitions (weak, strong, and statistical). Intuitively, stationarity means that a set of statistics of the time-series do not vary over time. More formally, it means that the probability laws that govern the process do not change over time. For example, the mean or the variance should be constant over time for a time-series to be considered second-order stationary. An example of stationary time-series is white noise or a moving average, and an example of non-stationary time-series is the random walk. [10 marks] Page 6 ECS764 (2014) End of Paper 欢迎咨询51作业君