辅导案例-ECE 795

ECE 795 — Advanced Big Data Analytics Final Project: Comprehensive Design of Big Data Analyses Assigned: March 3, 2020 Spring 2020 Project Demonstration: April 14 and 16, 2020 Project and Report: April 16, 2020 In this project, you will need to leverage the knowledge and tools discussed in this course to design a comprehensive workflow of big data analysis. Please select one task from the following (first come, first serve) and each task only allow five people to work on (Task 1 allows six people to work on, each aims for a different format conversion path). Please make sure to provide sufficient comments on your code to get full credits. For the sake of space, the references, hints, and some requirements are not included here. Please find the complete descriptions of each task in GitHub. Task 1: Large Scale Web Record Format Conversion 1. Download the provided CSV data from the link and store it in HDFS. 2. Pick one of the data format conversion paths in the following: a. CSV to XML to JSON b. CSV to XML to YAML c. CSV to JSON to XML d. CSV to JSON to YAML e. CSV to YAML to XML f. CSV to YAML to JSON 3. Implement a PySpark application to pre-process the raw data if necessary and convert the original CSV data to the first data format you chose in Step 2. Afterwards, covert the data again to the second data format in Step 2. 4. Repeat Step 3 after increasing the number of workers to 3 and 4 in the cluster. Compare the computing times before and after the changes and plot the figure “Computing time vs. #workers”. 5. Note: there will be two sets of CSV files as the inputs. One is a large number of small CSV files and the other is a single large CSV file as input data. Please make sure your PySpark application can handle both cases. Performance analysis between two input sets should be provided. Task 2: Stack Overflow Data Analysis in PySpark 1. Use Google Cloud BigQuery API to load the provided data into HDFS. 2. Use PySpark to read the data from your clusters. 3. Analyze the data and answer the following questions: a. How many questions are posted from Sept. 1st, 2019 to Dec. 31st, 2019? b. What is the percentage of questions that have been answered over the above period? c. How long on average were the questions answered on website over the above period? 4. Using questions provided in Step 3 as the examples, do more data analyses for the given dataset and try to find different types of useful information. Please implement all the analyses using PySpark codes and justify your conclusion of the analyses with the results of the codes. Your report can design to cover tasks as follows. a. Find one way to improve the answer rate for a question. b. Generate an analysis of the user changes in Overflow over the last twelve years. c. Generate a review of topical trends during the previous twelve years. 5. The complexity and novelty of the analyses will have an impact on the scoring. External data are allowed to be used along with Stack Overflow data. Task 3: Publication Analysis for chosen Universities from Google Scholar 1. Pick a list of universities and search on the profile page of Google Scholar website. 2. Implement a program to identify the top 300 professors (ranked by total citations) from the homepage of each university by a web crawler, find a complete paper list from the homepage of each identified professor, and store all the related web pages in HDFS. a. https://scholar.google.com/citations?view_op=view_org&org=165895925669911 47599&hl=en&oi=io Homepage of University of Miami. b. https://scholar.google.com/citations?hl=en&user=7fQX_pYAAAAJ Homepage of Prof. A Parasuraman, who has the top citations in University of Miami. c. The number of total papers collected is no fewer than 1,000,000 (more than 10 universities). d. Cstart and Pagesize are the parameters in URL to scan the paper list. 3. Find the fastest way to analyze the best co-author of each professor and justify why your method is the fastest one. 4. Use PySpark codes to partition the collected papers in various ways and analyze the collected data. Please justify your conclusion of the analysis with the results of the codes. Your report can design to cover tasks as follows (The complexity and novelty of the analyses will have an impact on the scoring). a. Generate an analysis of the best department in each university. b. Generate a review of popular research keywords in each university during the previous years. Task 4: Word Count on Streaming Tweets 1. Incorporate Cloud Dataflow and configure it correctly in Google Cloud Platform. 2. Using Cloud Dataflow to import Tweets from Twitter API with the keywords of your selection. 3. Use PySpark to do word count for all the newly coming tweets in a configurable interval (as small as possible) and save the results. 4. Please test the word count system and report the smallest interval it supports (e.g., 1 min). Please explain what the bottleneck is to achieve a smaller interval. 5. Write a PySpark application to count the number of tweets with the word count between a given range. 6. Plot the distribution of tweet word count for a given time interval. 7. Compare the performance of computing the word count distribution based on raw data and based on the results. Try it multiple times when the numbers of tweets are different. Plot the figure “computing time vs. the number of tweets”. Please turn in a written report of your project (no more than 6 pages in the same template as in the first project) including:  Instructions on how to compile and run your program  Documented program listings  The design of your implementation  Detailed discussion of your implementation and analyses  Necessary diagram(s), flowchart(s), pseudo code(s), etc. for your implementation  A conclusion, summarizing your understanding and analyses  A list of references, if any. The final report (submitted to blackboard) and codes are due on April 16, 2020. Project demonstration (no more than 8 minutes) is on April 14, 2020 (Tasks 1 & 2), and April 16, 2020 (Tasks 2 & 3 & 4).

辅导案例-ECE 795 –

Related

Previous Post辅导案例-ECS623/730-Assignment 2

Next Post辅导案例-MA4601/MAT061-Assignment 4

Author admin