- May 15, 2020

Homework 1: Combinatorics & Empirical Distributions UC Irvine CS177: Applications of Probability in Computer Science Due on October 10, 2019 at 11:59pm Question 1: (20 points) A system is called “k out of n” if it functions reliably when at least k of its n components are working; in other words, the system uses redundancy to ensure robustness to failure. As an example, consider a redundant array of inexpensive disks (RAID) in which one uses n disks to store a collection of data, and as long as at least k are functioning the data can be correctly read. Suppose that disks fail independently, and that the probability of an individual disk failing in a one-year period is p. a) Suppose we have a n = 3 disk array which can survive one failure (k = 2). What is the expected number of disk failures in one year? As a function of p, what is the probability that the whole array will continue to function without any data loss after one year? b) Suppose we have a n = 5 disk array which can survive two failures (k = 3). What is the expected number of disk failures in one year? As a function of p, what is the probability that the whole array will continue to function without any data loss after one year? c) Suppose p = 0.05. Which is more reliable (has greater probability of not losing any data in one year), the RAID from part (a) or part (b)? d) Suppose p = 0.65. Which is more reliable, the RAID from part (a) or part (b)? Question 2: (20 points) Consider a social network that allows accounts to be secured with a 6-digit passcode (any sequence of exactly six digits between 0-9 is valid). Assume the network has m users includ- ing you, and that all users choose one of the valid 6-digit passcodes uniformly at random. A user’s passcode is considered safe if no other user has the same passcode. a) As a function of m, what is the probability that your own passcode is safe? b) How many users must there be for there to be a 50% or greater chance that your own passcode is not safe? c) As a function of m, what is the probability that all users have a safe passcode? d) How many users must there be for there to be a 50% or greater chance that at least one user’s passcode is not safe? 1 Question 3: (20 points) Consider a set of n people who are members of an online social network. Suppose that each pair of people are linked as “friends” independently with probability 1/2. We can think of their relationships as a graph with n nodes (one for each person), and an undirected edge between each pair that are friends. A clique is a fully connected subset of the graph, or equivalently a subset of people for which all pairs are friends. a) A clique of size 2 is simply a pair of nodes that are linked by an edge. Find the expected number of edges as a function of the number of nodes, n. What is the expected number of friend relationships among n = 10 people? b) A clique of size 3 is a triplet of nodes within which all three pairs are linked by an edge. Find the expected number of 3-cliques as a function of the number of nodes, n. What is the expected number of 3-cliques among n = 10 people? c) Larger cliques may occur involving groups of nodes of any size k. Derive a general formula for the expected number of cliques of any size 2 ≤ k ≤ n as a function of the number of nodes, n. What is the expected number of cliques of size k = 4 among n = 10 people? Question 4: (40 points) We will now analyze some data collected by observing the famous “Old Faithful” geyser in Yellowstone National Park. We define random variable S to be the time an eruption lasts, and random variable T to be the “waiting time” until the next eruption. These are clearly continuous random variables, but we do not precisely know their true distribution. Instead we have a dataset with n = 272 independent observations (si, ti), i = 1, . . . , 272, of the eruption time si and subsequent waiting time ti. See Figure 1 for a plot of this data. In the following questions, we compute various quantities using the empirical distribution of the data. The empirical distribution of eruption time and waiting time can be represented by a probability mass function pST (s, t) which places probability 1/n on each of the n data points, and probability 0 on the continuous range of other (s, t) values. Under this distribu- tion, the expected values of S and T then take the following simple form: E[S] = 1 n n∑ i=1 si, E[T ] = 1 n n∑ i=1 ti. a) The variance of random variable S equals Var[S] = E[S2] − E[S]2. Give formulas for computing Var[S] and Var[T ] under the empirical distribution. Use Python’s numpy.sum function to write your own code that computes these variances, and report their values. Hint: Various definitions of the “sample variance” can be found in statistics references, and they are not all equivalent to the variance of the empirical distribution. b) The cumulative distribution of S equals FS(s) = P (S ≤ s), where the probability is under the empirical distribution. Find eruption times s¯1, s¯2, s¯3 such that FS(s¯1) = 0.25, FS(s¯2) = 0.50, FS(s¯3) = 0.75. Using the cumulative distribution of T , also find waiting times t¯1, t¯2, t¯3 such that FT (t¯1) = 0.25, FT (t¯2) = 0.50, FT (t¯3) = 0.75. Hint: One solution would be to use Python’s numpy.argsort function. 2 1.5 2 2.5 3 3.5 4 4.5 5 5.5 40 50 60 70 80 90 100 Eruption Time (minutes) W ai tin g Ti m e to N ex t E ru pt io n (m inu tes ) Figure 1: A “scatter plot” of the observations of Old Faithful’s eruption time (horizontal axis) and waiting time to the next eruption (vertical axis). Each point is one of the n = 272 observations. Consider two new random variables. Let X indicate whether the eruption time S is “short” or “long”: X = 0 if S ≤ 3.5, and X = 1 if S > 3.5. Let Y indicate whether the waiting time T is “short” or “long”: Y = 0 if T ≤ 70, and Y = 1 if T > 70. c) Using the empirical distribution of S and T , determine and report the joint probabil- ity mass function pXY (x, y). Also determine and report the marginal probability mass functions pX(x) and pY (y). d) Are the random variables X and Y independent? If not, is the amount of dependence weak or strong? Clearly justify your answer using the probability mass functions from part (c). 3