辅导案例-DATA3404

School of Computer Science Uwe Roehm DATA3404: Data Science Platforms 1.Sem./2020 Big Data Analysis Assignment Group Assignment (15%) 06.05.2020 Introduction This is the practical assignment of DATA3404 in which you have to write a series of Apache Spark programs to analyze a air traffic data set and then optimise your programs for scalability on increas- ing data volumes. We provide you with the schema and dataset. Your task is to implement the three given data analysis tasks, to evaluate their performance, and to decide on which optimisations are best suited to improve the task’s performance. You find links to online documentation, data, and hints on tools and schema needed for this assignment in the ’Assignments’ section in Canvas. Data Set Description and Preparation This assignment is based on an Aviation On-time data set which includes information about airports, airlines, aircrafts, and flights. This data set has the following structure: airport_code airport_name city state country Airports tail_number manufacturer model aircraft_type year Aircrafts carrier_code name country Airlinesflight_id carrier_code flight_number flight_date origin destination tail_number scheduled_departure_time scheduled_arrival_time actual_departure_time actual_arrival_time distance Flights You find a set of corresponding data files (as zip archives) on our course website in Canvas in the ”Assignment” module. 1. Download the linked air traffic data archives from the course website and unpack them. 2. Load the contained CSV files into your storage of your AWS Educate account (cf. tutorial Week 9), typically S3 containers. Important: Only do this data load for the two smallest data sets. We will also provide you with a larger data set for the performance evaluation. Due to its size, this one will however only be available as shared resource later in this unit of study. 1 Question 1: Data Analysis with Apache Spark You shall implement three different analysis tasks of the given data set using plain Apache Spark (using the Apache Spark’s RDD API or Dataframe API, either with Java or Python): 1. Task 1: Top-3 Cessna Models Write an Apache Spark program that determines the top-3 Cessna aircraft models with regard to the number of flights, listed in descending order of number of flights. Output the Cessna models in the form ”Cessna 123” as one string with only the initial ’C’ capitalised and the model number having just its three digits. The output file should have the following tab- delimited format, ordered by number of flights in descending order: Cessna XYZ \t numberOfDepartingFlights 2. Task 2: Average Departure Delay In the second task, write a Apache Spark program that determines the average, min and max delay (in minutes) of flights by US airlines in a given year (user-specified year). Only consider delayed flights, i.e. a flight whose actual departure time is after its scheduled departure time, and ignore any canceled flights. The output file should have the following tab-delimited format (ordered alphabetically by airline name): airline_name \t num_delays \t average_delay \t min_delay \t max_delay 3. Task 3: Most Popular Aircraft Types In the third task, you shall write an Apache Spark program that lists per airline of a given country (user-specified) the five most-used aircraft types (manufacturer, model). List the airlines in alphabetical order, and show the five most-used aircraft in descending order of the number of flights as a single, comma-separated string that is enclosed in ’[’ and ’]’ (indicating a list). Format the name of an aircraft type as follows: MANUFACTURER ’ ’ MODEL (for example, ”Boeing 787” or ”Airbus A350”). The output should have the following tab-delimited format (alphabetically by airline name): airline_name \t [aircraft_type1, aircraft_type2, … , aircraft_type5] General Coding Requirements 1. You should solve this assignment with the Apache Spark version 2.4 as installed in AWS EMR. You will need an AWS Educate account for this. 2. If you use any code fragments or code cliches from third-party sources (which you should not need for these tasks…), you must reference those properly. Include a statement on which parts of your submission are from yourself. 3. Always test your code using a small data set before running it on any larger data set. Question 2: Performance Evaluation and Tuning a) Conduct a performance evaluation of your implementations for each task on varying dataset sizes. We will provide you with five different data sizes, the two largest ones to be shared among all groups. You should execute your code on each data size and record the execution times and the sizes of the intermediate results (communication efforts). b) Suggest some optimisations to the the analysis task implementations such that the perfor- mance of your task(s) improve. Show that it works. 2 Question 3: Documentation of Implementation and Tuning Decisions Write a text document (plain text or Word document or PDF file, no more than 5 pages plus optional Appendix) in which you document your implementation and your performance evaluation. Your document should contain the following: 1. Job Design Documentation In your document, describe the Apache Spark jobs you use to implement Tasks 1 to 3. For each job, briefly describe the different transformation functions. If you use any user-defined functions, classes or operators, please describe those too. 2. Justification of any tuning decisions or optimisations; document the changes in the exe- cution plans and the estimated execution costs for each individual analysis tasks before and after your optimisations using the DAG Visualizations of Apache Spark. 3. Briefly justify each tuning decision. 4. Performance Evaluation: Include a chart and a table with the average execution times of your tasks for different data sets. 5. Include as appendix the S3 storage location of your final output files from various executions. Milestones Have the first task ready in the Week 11 tutorials for the tutors to review and to give feedback. Deliverables and Submission Details There are three deliverables: source code, a brief program design and performance documen- tation (up to 5 pages, as of content description above), and a demo in Week 12 via Zoom. All deliverables are due in Week 12, no later than 8 pm, Friday 22 May 2020. Late submission penalty: -20% of the awarded marks per day late. We will make available a marking rubric in Canvas. Please submit the source code and a soft copy of design documentation as a zip or tar file elec- tronically in Canvas, one per each group. Name your zip archive after your UniKey: abcd1234.zip Demo: A few points of the marking scheme will be given to any submission which can be demoed successfully on our own cluster. Students must retain electronic copies of their submitted assignment files and databases, as the unit coordinator may request to inspect these files before marking of an assignment is completed. If these assignment files are not made available to the unit coordinator when requested, the marking of this assignment may not proceed. All the best! Group member participation This is a group assignment. The mark awarded for your assignment is conditional on you being able to explain any of your answers to your tutor or the subject coordinator if asked. If members of your group do not contribute sufficiently you should alert your tutor as soon as possible. The tutor has the discretion to scale the group’s mark for each member as follows, based on the outcome of the group’s demo in Week 12: Level of contribution Proportion of final grade received No participation. 0% Passive member, but full understanding of the submitted work. 50% Minor contributor to the group’s submission. 75% Major contributor to the group’s submission. 100% 3

辅导案例-DATA3404

Related

Previous Post辅导案例-ENG332

Next Post辅导案例-CTEC1903

Author admin