Skip to main content


By May 15, 2020No Comments

7CCSMBDT – Big Data Technologies Coursework 1 Coursework assigned: 7 February 2020. Coursework submission deadline: 4:00pm, 21 February 2020. Late submission deadline (capped at 50%): 4:00pm, 22 February 2020. Overview: The coursework aims to make you familiar with the following concepts: (i) Big Data characteristics and analytics, (ii) Big Data collection, and (iii) programming using the MapReduce framework. This coursework is formally assessed and is worth 10% of your final mark. You will receive feedback as part of the marking of the coursework after 4 weeks from the coursework submission deadline. Submission: Include BOTH files below: (i) A file, Coursework1.PDF, containing your answers. For tasks that require writing code, write your code as part of the answer. For tasks that require showing output of a program, show the output or part of the output if the file is large. (ii) A file, Coursework1_code.ZIP, containing, for each program, the code of the program (.py file) and a file containing the entire output of applying the program to the required dataset. Name the code and output to indicate the task it corresponds to (e.g., for the code and task3.out for the output of Task 3). Evaluation: The maximum number of marks (out of 100) for each task is given in square brackets [] next to each question. Plagiarism: “Plagiarism is passing off someone else’s work as your own, or submitting a piece of your own work that you have already submitted as part of a different programme, module or at a different institution. The penalties for plagiarising by the College can be severe. Uploading work to KEATS is regarded by the Department as a statement by the student concerned, confirming that the work has not been plagiarised.” Late submission: “If you are submitting your coursework after the deadline, you must submit a Mitigating Circumstances Form (MCF) to your Programme Administrator, with evidence to justify why you have not submitted on time. If you do not do this or your reasons are not acceptable, your coursework may be given a mark of zero.” Please speak to your personal tutor about the MCF. Lecturers have no control of submission deadlines, nor can provide extensions. 7CCSMBDT – Big Data Technologies Coursework 1 Task 1. Big Data characteristics (a) Why data from the transportation domain can be classified as Big Data? Justify your answer by referring to the 5Vs (characteristics) of Big Data. [10] (b) Describe the challenges entailed by each characteristic of Task 1(a). [15] Note: Refer to lecture 1 for discussion of the characteristics and an example of a domain of Big Data (game industry). Task 2. Big data collection using Apache Sqoop. (a) Discuss what happens when the following command is executed: scoop export –connect jdbc:mysql://localhost/hadoop –username U –password P –table mytable — export-dir /user/hive/warehouse/mytable -m 1 — input- fields-terminated-by `\001` Your answer should explain step by step how the database table, client, and MapReduce cluster interact during the execution of the command. [15] (b) What are the benefits of using Apache Sqoop to import data from a database table, managed by a Relational Data Base Management System (RDBMS), compared to a manual solution, such as custom code that reads the data from the table, writes them into local files, and then using commands or custom code to copy the files into HDFS? [10] Note: Refer to lecture 2 for details on Scoop. Task 3. MapReduce combiners. Write a program using mrjob, which applies a function f of your choice to a small input file of your choice without using a combiner. Also, write a program using mrjob, which applies f to the same input file and it uses a combiner. f must be inappropriate for being used with a combiner. Please comment your code appropriately to explain what each step does. Provide the output of both programs and explain why the output of the second program is incorrect. [25] Note: You can use redirection (e.g., python3 > myoutput.txt) to get the output. You can execute the program in local mode (i.e., without -r hadoop). 7CCSMBDT – Big Data Technologies Coursework 1 Task 4. Join in MapReduce. Download the datasets id_age_occ.csv and id_educ_marital.csv from KEATS. Write a Python program based on the MapReduce framework, using mrjob, which performs a join between these two datasets. Please comment your code appropriately to explain what each step does. Provide the output of your program on the datasets in a file program_task4.out. Your report should also contain a small part of program_task4.out [25] Notes:  You are asked to join the two files. Solutions that generalize two multiple files are not needed.  You can use two input files in the mrjob program. The following example creates two input files and then applied on the files, which measures how many times each word appears in the files. [cloudera@quickstart Desktop]$ cat file1.txt one two three [cloudera@quickstart Desktop]$ cat file2.txt one four five [cloudera@quickstart Desktop]$ python3 file1.txt file2.txt “five” 1 “four” 1 “one” 2 “three” 1 “two” 1  The join attribute is the id (it is included in the files and is not something you need to calculate). You can see its function in the join from the example output below. I expect to see the example output, based on the example input. 7CCSMBDT – Big Data Technologies Coursework 1  IMPORTANT The order of the attributes must be maintained. That is, every record in the joined table has id, then the attributes age and occupation of id_age_occ.csv and last the attributes education and marital status of id_educ_marital.csv. Example input: (i) sample of id_age_occ.csv 1, 39, State-gov 2, 50, Self-emp-not-inc 3, 38, Private 4, 53, Private (ii) sample of id_educ_marital.csv 1, Bachelors, Never-married 2, Bachelors, Married-civ-spouse 3, HS-grad, Divorced 4, 11th, Married-civ-spouse Example output: “1” [[“39″, ” State-gov”], [“Bachelors”, “Never-married”]] “2” [[“50″, ” Self-emp-not-inc”], [“Bachelors”, “Married-civ-spouse”]] “3” [[“38″, ” Private”], [“HS-grad”, “Divorced”]] “4” [[“53″, ” Private”], [“11th”, “Married-civ-spouse”]] [END of Coursework 1]


Author admin

More posts by admin