Skip to main content

程序代写案例-DSC 102-Assignment 0

By January 16, 2021留学咨询

DSC 102: Systems for Scalable Analytics Programming Assignment 0 1 Introduction The goal of this programming assignment is to get you comfortable with datasets that do not fit in single-node memory and are too big for tools like Pandas or NumPy. You will be using Dask library to explore secondary storage aware data access on a single machine. In this assignment, you will be learning to setup dask on AWS and computing several descriptive statistics about the data to build intuitions for feature engineering for the final assignment. 2 Dataset Description You are provided with the Amazon Reviews dataset with the reviews table as CSV file. The schemas are provided in Table 1. The dataset is available on the s3 bucket: s3://dsc102-public. Column name Column description Example reviewerID ID of the reviewer A32DT10X9WS4D0 asin ID of the product B003VX9DJM reviewerName name of the reviewer Slade helpful helpfulness rating of the review [0, 0] reviewText text of the review this was a gift for my friendwho loves touch lamps. overall rating of the product 1 summary summary of the review broken piece unixReviewTime summary of the review 1397174400 reviewTime time of the review (raw) 04 11, 2014 Table 1: Schema of Reviews table 3 Tasks You will use the reviews table to explore features related to users. Specifically, you will create users table with the schema given in Table 2. A code stub with function signature for this task has been provided to you. The input to the function is the reviews CSV file and you will be carrying out a series of transformations to produce the users table as DataFrame. Plug in the DataFrame you obtained as a result in and write this to results PA0.json file. We will time the execution of the function PA0. We have shared with you the “development” dataset and our accuracy results. Our code’s runtime on 1 node is roughly 570s. You can use this to validate your results and debug your code. The final evaluation will happen on separate held-out test sets. The runtime will be different for the held-out test set. 4 Deliverables Submit your source code as .py on Canvas. Your source code must confirm to the function signatures provided to you. Make sure that your code is writing results to results PA0.json. 1 Column name Column description reviewerID (PRIMARY KEY) ID of the reviewer number_products_rated Total number of products rated by the reviewer avg_ratings Average rating given by the reviewer across all the reviewed products reviewing_since The year in which the user gave their first review helpful_votes Total number of helpful votes received for the users’ reviews total_votes Total number of votes received for the users’ reviews Table 2: Schema of users table 5 Getting Started 0. Access your AWS account using single sign-on ID: Credentials for CLI / API usage can be retrieved using a modified URL: 1. We have setup the Dask environment on an AMI with name “dsc102-dask-environment-public.” Go to “AMIs” (under “Images”) in your EC2 dashboard, select public images, and then search by name to find it. Select this AMI. See Figure 1 and Figure 2. Figure 1 2 Figure 2 2. Now, you will be launching one EC2 instance that will be used to run dask locally. Follow the steps below. Launch one EC2 instance of type “t2.xlarge” and “40GB” of storage. In the next page, put 1 instance in the “Number of instances” box and don’t change anything else. Create a new security group. Retain other fields unchanged. Finally, after pressing the “Launch” button, add a key pair and download this locally. This will allow you to SSH into the instance. See Figure 3 to Figure 7. At the end, you should be able to see one instance in your dashboard. Figure 3 3 Figure 4 Figure 5 Figure 6 4 Figure 7 Figure 8 3. Once the EC2 instance is launched, go to the security group of the instance (under “Network & Security” in the left panel) and add the following rule (under “Inbound”). A rule with type “SSH”, port 22, and source as “My IP”. This rule will allow you to SSH to each of the machines. See Figure 8. Note: If you are currently not in the US, you may have trouble trying to SSH to the instance. If you are unable to SSH (step 4.b) to the aws instance, change source in your rule to “Anywhere”, hit save rules and try again. 4. In this step, you will start the jupyter notebook server on the instance. a. Change permission of the ssh keyfile to make sure your private key file isn’t publicly viewable: chmod 400 .pem. Linux and Mac users in particular will need the chmod. b. SSH into one of the nodes using command: ssh -i ‘‘.pem’’ [email protected]. This command is shown in the Figure 9 below. is shown in the red box in Figure 10. Activate the dask environment with command: source dask env/bin/activate. Start jupyter notebook server on one terminal with: jupyter notebook –port=8888. c. Open a new terminal and SSH to jupyter notebook using: ssh -i ‘‘.pem’’ [email protected]EC2-instance> -L 8000:localhost:8888. ‘-L’ will port forward any connection to port 8000 on the local machine to port 8888 on . Type in jupyter notebook list to get the token/password for the 5 Figure 9 Figure 10 jupyter notebook. Open your browser and go to localhost:8000 and paste the token. You can write your code here using jupyter notebook. To see dashboard on localhost port 8001 use command: ssh -i ‘‘.pem’’ [email protected] -L 8001:localhost:8787. Consider using utilities like tmux or nohup for managing terminals. 5. The data and files are available from the s3 bucket (s3://dsc102-public). This contains the function signatures (, dataset (user reviews.csv), schema of expected output (OutputSchema PA0.json), and the expected re- sult on the development dataset (results PA0.json). a. First, setup your aws credentials on all nodes using: export AWS ACCESS KEY ID= export AWS SECRET ACCESS KEY= export AWS SESSION TOKEN= You can find your aws credentials at b. Use command to download the files: aws s3 sync s3://dsc102-public /local-file-path. Download the data files on local disk on all nodes. Make sure that data is available in the same path where the jupyter notebook client is running. 6. Open the dashboard and click on “Workers” to double check if all workers (all threads of the single machine) are connected and you are now ready to code up. 7. Terminate EC2 instances once you are done. Remember when you terminate an EC2 instance you lose all the data, therefore we suggest you use a private GitHub repo to routinely push your work (code, logs and other small files) and pull your repo whenever you create a new instance to resume your work. 6 欢迎咨询51作业君


Author admin

More posts by admin

Leave a Reply