CSCI 2520 Data Structures and Applications Assignment Four Deadline: ​23:55, May. 16, 2020 Total Marks: 100 Submission: In this Assignment, you need to answer question 1 and question 2 in one pdf file. For question 3 and 4, you need to provide .c file for each question, which can be compiled and give the correct answer, the name should be q3.c and q4.c respectively. Then compress all 4 files (one pdf file for q1 & q2 and three .c files for q3 and q4 respectively) as one zip named as ​your_student_id_assign4.zip ​ and submit it via Blackboard. 1. Shortest-Path Algorithm (20 marks) Use Dijkstra’s algorithm to calculate the shortest distance from V​S to all vertices and the shortest path from V​S​ to V​F​. Note: Please use the table used in ​lecture​ (Shortest-Path Algorithm) to represent each step. Example: Initial Step Step 1 2. Minimum Spanning Tree (20 marks) Given an undirected graph, find the minimum spanning tree. The number of the edge connecting vertex a and vertex b is ab. Like the edge with weight 1 in the graph, its number is 02. (1) Use Prim’s Algorithm starting from vertex 0 to find the minimum spanning tree, write the number of the edge added into the minimum spanning tree in turn (if there are many edges satisfy requirement, choose the edge with minimum number). Please use the table used in lecture​ (Prim’s Algorithm) to represent each step. Example: Initial Step Step 1 (2) Use Kruskal’s Algorithm and write the number of the valid edge with minimum merging cost in turn (if there are many edges satisfy requirement, choose the edge with minimum number). 3. ​The diameter of a graph​ (30 marks) Definition: ​The distance d(u, v) between two vertices u and v is defined as the length of a shortest path from u to v. The diameter of a graph is the greatest distance between any pair of vertices. Now give you a undirected graph, get the diameter of the graph. The first line of input is the number of nodes ​N and edges ​M. ​Next ​M lines, contains three positive integers ​x, y, w​ means an undirected edge between vertex x and vertex y with length w. (The data guarantees that the graph is connected.) Input: 5 7 1 2 3 1 4 4 4 5 2 2 3 3 3 5 1 1 3 2 3 4 1 Output: 4 4. The power of the team. (30 marks) We want to build the best team to solve a problem. Now there are ​n great people on the waiting list. Everyone has different strengths and weaknesses. The power of the team is the sum of the strength of the teammate multiplied by the minimum weakness of the team.(​Cannikin Law​). Since the budget is not sufficient, we can only choose at most ​k people. You need to solve this problem by finding the maximum power of the team you can have.(Hint: use heap to solve it) The first line of input is​ n and k. (1<= k <= n <= 10000) The following two lines are the strengths and weaknesses of the ​n people. All the values are less than 10000. For example: Input: 5 3 1 2 3 4 5 5 4 3 2 1 Output: 18 (There are two different ways to get the answer. You can choose the 1, 2, 3 and 2, 3, 4 )

The University of Sydney Page 1 QBUS6860 Visual Data Analytics Weekly Assignment 8 Dr Demetris Christodoulou Discipline of Accounting MEAFA Research Group http://sydney.edu.au/business/research/meafa The University of Sydney Page 2 Weekly Assignment 8 o The Weekly Assignment 8 asks to investigate the changing relation over time between income per capita and life expectancy o Theory suggests that as income per capita for a country increases, there would be more investment in education, public health, technology and pro-social initiatives in that country, thus improving living conditions overall and life expectancy at birth o More specifically, the graph objective to be analysed is defined as: “The changing relation between GDP per capita and life expectancy from 1961 to 2017” Hint: think of GDP per capita as an explanatory variable that can help explain the variation in life expectancy The University of Sydney Page 3 Weekly Assignment 8 The data is available from the World Bank Open Data initiative. o The life expectancy at birth by jurisdiction is found here: https://data.worldbank.org/indicator/sp.dyn.le00.in o The GDP income per capita (in constant 2010 UDD) by jurisdiction is found here: https://data.worldbank.org/indicator/NY.GDP.PCAP.KD o These datasets contain jurisdiction-level observations for life expectancy and GDP per capita o From these links you can download the CSV or Excel files – look for these links: o You are required to manage the data to prepare it for analysis. You are also required to validate the data as in every assignment The University of Sydney Page 4 Weekly Assignment 8 o We do not know in advance the answer to the graph objective question and we must investigate the changing relation between the two variables using a visual mining approach. o Thus, you are required to use your knowledge from Week 9 to analyse the graph objective using appropriate visual mining methods. Hint: not all visual methods are useful for this assignment. o Do not use any transformations. The data must be analysed exactly as provided by the World Bank links in their nominal values The University of Sydney Page 5 Weekly Assignment 8 o You are required to produce data just one graph that will show the changing relation between the two variables over time. This data graph must qualify as a high-quality data graph as discussed in weeks 1-6. o The graph objective is multidimensional involving observations on jurisdictions, years, GDP per capita and life expectancy. It is up to you how to design the graph, but you may want to consider using interactive features or using the motion retinal variable o You are not required to do any particular research on this research question, but you may want to do so if it helps you understand the question better The University of Sydney Page 6 Weekly Assignment 8 o You will be evaluated on the quality of data management, data validation, and the quality of the data graph o You will be evaluated on the application of appropriate visual data mining methods o You need to submit the Tableau workbook (.twbx), a Word or PDF report, and the managed Excel or CSV dataset

− 1 : H0] ≤ α, ⇔1− α ≤ P [X ≤ c− 1 : H0] . From the table at the front of the exam booklet we can see that this state- ment holds true for c−1 = 12, 13, . . .. The lowest critical value (leading to the most powerful test) is c = 13. Explicitly, our rejection rule be- comes: ‘Reject the hypothesis that the suspect is innocent if he is seen at 13 or more crime scenes in the past year’. 5 Marks (iii) A suspect seen at 12 crime scenes would not trigger the rejection rule. We do not reject the ‘innocent’ hypothesis. 1 Mark (iv) The p-value is the probability that new data distributed according to the specifications of the null hypothesis would take a value even more ex- treme than the value we have just observed, i.e. p = P [X ≥ 12 : H0] = 1− P [X ≤ 11 : H0] = 1− 0.98 = 0.02 5 Marks 13 SOLUTIONS: MAT00033I/MAT00035I Remarks. A similar question, taking from Dekking et. al., was looked at in a revision class towards the end of term. Question is relevant to TD1 and TD2. Total: 30 Marks 3. (a) (i) The model scores we looked at in lectures rewarded models which did well at predicting previous observations having been optimized to fit those same observations. In practice this was quantified by the maximum value of the likelihood for the model parameters. 3 Marks (ii) The models were punished if they had many parameters, or degrees of freedom. 3 Marks (iii) The R2 statistic arises as the fraction of the variance of the response variable that can be explained by the model. It is computed as R2 = n−1 ∑n i=1(yˆi − y¯)2 n−1 ∑n i=1(yi − y¯)2 . 4 Marks (b) (i) The medic would be interested in the ANOVA because it will help them identify covariates that are useful for predicting the age at which at individual will die. 3 Marks (ii) The ANOVA procedure assumes that the model error terms are iid nor- mal. We also need to provide an order in which the coefficients for co- variates are to be tested. Given such an order, row k contains information relevant to the test between H0 : βk = 0 ∩ βj 6= 0, j = 1, . . . , k − 1, H1 : βk 6= 0 ∩ βj 6= 0, j = 1, . . . , k − 1, i.e. we test the null hypothesis that the kth coefficient is zero against the alternative hypothesis that it is non-zero, where both hypotheses assume that preceding coefficients are non-zero. 4 Marks (iii) The distributions of the test statistics for each test/row are F-distributions given the respective null hypotheses are true. We use quantiles of these distributions to calculate critical values for the tests. The F-distributions are parameterized by two degrees of freedom: one relating to the num- ber of degrees of freedom in the part of the model being tested, the other relating to the number of degrees of freedom for error (given the full model). We denoted these quantities pk and n− p in lectures. 3 Marks 14 SOLUTIONS: MAT00033I/MAT00035I (c) (i) The unbiased estimate for the variance of the residuals can be computed by dividing the sum of squares for the residuals by the number of de- grees of freedom for them. Both of these quantities can be found in the ANOVA table. σˆ2 = 1 n− p n∑ i=1 (yi − yˆi)2 = Sum of squares of residuals degrees of freedom for residuals = 14.2604 59 =0.2417 3 Marks (ii) The R2 statistic is computed by dividing the sum of squares attributable to the model by the total sum of squares. The former is the sum of the sums of squares attributable to each covariate. The latter is the sum of the sums of squares for the the covariates plus the sum of squares attributable to the errors R2 = Sum of squares of fitted values Total sum of squares = 3.0696 + 6.1795 + 0.3129 + 1.5506 3.0696 + 6.1795 + 0.3129 + 1.5506 + 14.2604 =0.438. 3 Marks (iii) The missing values are Mean Sq4 = Sum Sq4 Df4 = 1.5506 1 = 1.5506, Mean Sqresid = Sum Sqresid Dfresid = 0.2417 (which we have already computed), and F = Mean Sq4 Mean Sqresid = 1.5506 0.2417 = 6.4154. 15 SOLUTIONS: MAT00033I/MAT00035I The completed table is as follows Analysis of Variance Table Response: y Df Sum Sq Mean Sq F value Pr(>F) x_1 1 3.0696 3.0696 12.7001 0.0007317 *** x_2 1 6.1795 6.1795 25.5665 4.437e-06 *** x_3 1 0.3129 0.3129 1.2946 0.2597951 x_4 1 1.5506 1.5506 6.4153 0.0139934 * Residuals 59 14.2604 0.2417 — Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 4 Marks Remarks. The ANOVA material tested here is routine. The question requires students to have remembered the meaning and structure of ANOVA calculation without asking for many calculations. Total: 30 Marks 4. (a) The SSE is the sum of squared differences between the model’s predic- tions for data quantities and the values for those quantities that are actu- ally observed. We write this formally as SSE(β˜) = ∑ (yi − xTi β˜)2 = (y −Xβ˜)T (y −Xβ˜). 5 Marks (b) The OLS estimate for the true vector of regression coefficients is βˆ = (XTX)−1XTy. The estimate is optimal insofar as minimizing the SSE βˆ = argminβ˜ ( SSE(β˜) ) . . 5 Marks 16 SOLUTIONS: MAT00033I/MAT00035I (c) (i) Using the vector notation, the difference between the two SSEs is ∆SSE =SSE(βˆ + δ)− SSE(βˆ) =(y −X(βˆ + δ))T (y −X(βˆ + δ)) − (y −Xβˆ)T (y −Xβˆ) =((y −Xβˆ)−Xδ)T ((y −Xβˆ)−Xδ) − (y −Xβˆ)T (y −Xβˆ) =(((( (((( ((( (y −Xβˆ)T (y −Xβˆ)− 2(y −Xβˆ)TXδ + (Xδ)T (Xδ) −((((((( (((( (y −Xβˆ)T (y −Xβˆ) =− 2(y −Xβˆ)TXδ + (Xδ)T (Xδ) 5 Marks (ii) Xδ is a vector of length n just like yˆ = Xβˆ is. The product δTXTXδ − (Xδ)TXδ is the squared modulus of this vector, which cannot be a neg- ative quantity. Equivalently, we can write (Xδ)TXδ = n∑ i=1 (Xδ)2i and see that the product is a sum of squared (real) quantities. 5 Marks (iii) (y −Xβˆ)TX =(y −X(XTX)−1XTy)TX =(yT − yTX(XTX)−1XT )X =yT (X −X(XTX)−1XTX) =yT (X −X) =yT0 = 0 5 Marks (d) The optimality result means that if we went back and tried to predict the previously observed data with our linear model, the model using the OLS es- timates would achieve a smaller sum of squared residuals. We cannot prove that the OLS estimates will lead to a smaller sum of squared residuals for unobserved data in the future. 5 Marks Remarks. The question tests TD3. Final question is more open-ended than most. Total: 30 Marks 17 SOLUTIONS: MAT00033I/MAT00035I 5. (a) (i) The marginal probabilities for the A-level grade categories are (0.100, 0.205, 0.695). The marginal probabilities for the degree classifications are (0.0625, 0.2550, 0.4950, 0.1875). 3 Marks (ii) The joint probabilities consistent with the assumption of independence follow from multiplying the marginal probabilities. The expected counts follow from multiplying the probabilities by n = 400. Doing this, we arrive at the figures 3rd 2:2 2:1 1st F-E 2.50 10.20 19.80 7.50 C-D 5.12 20.91 40.59 15.38 B-A 17.38 70.89 137.61 52.12 3 Marks (iii) Our assumptions are that, collectively, the counts are well described by a multinomial distribution. Pearson’s χ2-test further assumes that the test statistic t(O) = ∑ i,j (Oij − Eij)2 Eij is approximately χ2-distributed with (ni − 1)(nj − 1) degrees of free- dom. Our test hypotheses concern the probabilities of any individual falling into a particular pair of categories. More specifically, our null hy- pothesis states that all the joint probabilities pij are the products of the marginal probabilities pi,· and p·,j . Our alternative hypothesis states that this relationship is violated for at least one pair of categories, i.e. H0 : pij = pi,·p·,j ∀i, j, H1 : ∃(i, j) such that pij 6= pi,·p·,j. The test’s rejection rule tells us to reject the null hypothesis when the test statistic exceeds a critical value c. The result here is that t(O) = 15.12 > c = χ22×3 = 12.59 so we do reject the null hypothesis. 8 Marks (b) (i) Pearson’s χ2-test is based on approximations that improve as the ex- pected counts for every category increase. When the expected counts are low we cannot be as confident that the test is well calibrated in terms of its significance. 4 Marks 18 SOLUTIONS: MAT00033I/MAT00035I (ii) When we are unable to test an
d criticize the null hypothesis in as much detail it becomes harder to reject the null when it is false. It is thus harder to avoid type II error. We would therefore expect the test with the merged categories to have less power than the first test. 4 Marks (iii) A test that rejects the null no matter the values taken by the data will al- ways reject the null, so will always reject the null when the null is false. Its power is 1. 4 Marks (iv) The Neyman-Pearson lemma cannot help your colleague here. It would be able to tell her that the likelihood ratio is more powerful if the other test had the same significance level, but it doesn’t. 4 Marks Remarks. Opportunities to access TDs 1 and 2. Questions (c)(ii) and (c)(iv) require genuine understanding of results covered in lectures. Total: 30 Marks 19

tclair, California 5 Nick Williams RF L/L 6’3″ 195Lbs 26yrs Galveston, Texas Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 1 Enter the number to search: 37 37 Odubel Herrera CF L/R 5’11” 206Lbs 28yrs Zulia, Venezuela Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 1 Enter the number to search: 52 52 Jose Alvarez P L/L 5’11” 190Lbs 30yrs Barcelona, Venezuela Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 1 Enter the number to search: 5 5 Nick Williams RF L/L 6’3″ 195Lbs 26yrs Galveston, Texas Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 1 Enter the number to search: 99 Sorry, no matches! Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 2 Enter the name to search: Vince 21 Vince Velasquez P R/R 6’3″ 203Lbs 27yrs Montclair, California Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 2 Enter the name to search: Maria Sorry, no matches! Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 2 Enter the name to search: Andrew 15 Andrew Knapp C B/R 6’1″ 204Lbs 28yrs Roseville, California 22 Andrew McCutchen RF R/R 5’11” 195Lbs 33yrs Fort Meade, Florida Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 3 Enter the state or (if not US) country: Pennsylvania 9 Phil Gosselin 2B R/R 6’1″ 200Lbs 31yrs West Chester, Pennsylvania Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 3 Enter the state or (if not US) country: Venezuela 52 Jose Alvarez P L/L 5’11” 190Lbs 30yrs Barcelona, Venezuela 16 Cesar Hernandez 2B B/R 5’10” 167Lbs 29yrs Valencia, Venezuela 37 Odúbel Herrera CF L/R 5’11” 206Lbs 28yrs Zulia, Venezuela Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 3 Enter the state or (if not US) country: Dominican Republic 58 Seranthony Dominguez P R/R 6’1″ 185Lbs 25yrs Esperanza, Dominican Republic 7 Maikel Franco 3B R/R 6’1″ 229Lbs 27yrs Azua, Dominican Republic 50 Hector Neris P R/R 6’2″ 227Lbs 30yrs Villa Altagracia, Dominican Republic 12 Juan Nicasio P R/R 6’4″ 252Lbs 33yrs San Francisco de Macoris, Dominican Republic 2 Jean Segura SS R/R 5’10” 220Lbs 29yrs San Juan, Dominican Republic Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 3 Enter the state or (if not US) country: Australia Sorry, no matches! Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 3 Enter the state or (if not US) country: Sorry, no matches! Enter 1 to search on player number, 2 to search on name, 3 to search on player birthplace, or 0 to quit: 0 Goodbye! 