Skip to main content


By May 15, 2020No Comments

COMP20008 Project 1V1.0: August 14, 2019DUE DATEThe assignment is worth 20 marks, worth (20% of subject grade) and is due Friday 6September 2019. Submission is via the LMS. Please ensure you get a submission receiptvia email. If you don’t receive a receipt via email, this means your submission has not beenreceived and hence cannot be marked. Late penalty structure is described at the end of thisdocument.IntroductionRoad fatalities in Victoria in the first half of 2019 increased by 50% compared with the sameperiod in 2018. Across Australia, speeding was the single largest factor contributing to roadfatalities. Over one third of these road fatalities occurred in capital cities.As a data scientist, you would like to understand the patterns surrounding motorists speedingin Melbourne so that you can suggest targeted interventions. To that end, you are pointedto the Traffic Count Vehicle Classification dataset from the City of Melbourne. The datasetcontains survey results showing the number of vehicles of different types that have traversedparticular road segments at a particular time. It also includes the speed limit of the roads anddata related to the speeds of vehicles travelling on the roads. You can see a full explanationof the dataset at the City of Melbourne website1For this project, you will apply various wrangling skills including processing and visualisationtechniques to make sense of the data. As a data scientist, you are also expected to be ableto use library functions which are unfamiliar and which require you to consult additionaldocumentation from resources on the Web. You are also expected to ensure that all of thevisualisations you produce are effective communication tools. In particular, all axes shouldbe labelled, legends included where appropriate, and all visualisations should include an in-formative title.The project is to be coded in Python 3. Three relevant datasets can be downloaded fromLMS:• traffic.csv1• roads.json• special traffic.csv (for visualisation stage only)Stage 1: Understand the dataset (3/20 marks)To begin, we would like to understand the characteristics of the dataset we are working with.We would also like to clean the dataset by removing malformed or missing entries that willmake analysis more difficult.Q 1.1 (1 mark)Print the number of traffic survey entries, number of attributes, attribute names and theirdata types from traffic.csv dataset. The output of this step should look like***Q1.1Number of traffic survey entries: #Number of attributes: #@ [email protected] $…@ $***where # is the value you find, @ is an attribute name and $ is its datatype.Q 1.2 (1 mark)A number of survey entries detected no vehicles of any type. For some of these entries, a max-imum speed of ’-’ has been included. For others, the maximum speed is blank. These entriescan cause problems when analysing data. Create a DataFrame traffic from traffic.csvdataset with all such entries removed and print the number of remaining traffic survey entries.Your output should look like this:***Q1.2Number of remaining traffic survey entries: #***Q 1.3 (1 mark)The vehicle class 1 attribute represents the number of short vechicles, such as cars, detectedin a survey hour. Using traffic DataFrame (from Q1.2), what is the median value ofvehicle class 1? What is the highest maximum speed detected across all survey entries? Yourcode should print out the results with the following format:2***Q1.3Median value of vehicle_class_1: #Highest value of maximum_speed: #***where # is the value you find, rounded to 1 decimal place.Stage 2: Data selection & manipulation (5/20 marks)We are particularly interested in understanding traffic patterns across different types of roads.While the traffic.csv file does not contain any information about the type of road a surveyentry is related to, this information is contained in the dataset roads.json. We can map theroad segment attribute from the traffic DataFrame with the SegID attribute in the roadsJSON dataset to discover which type of road a particular survey entry is related to.Q 2.1 (2 marks)Using the traffic DataFrame as well as roads.json, add the attribute StrType from theJSON dataset to the traffic DataFrame. Print the first 3 rows.The output of this step should look like***Q2.1The first three rows of traffic DataFrame with the attribute StrType are:@@@***where @ is the entire row in the DataFrame.Q 2.2 (1 mark)We would now like to understand which roads have the most serious incidents of speeding.Add a new attribute max speed over limit to traffic DataFrame. This column should rep-resent the difference between the maximum speed and the speed limit attributes; that ismax speed over limit = maximum speed- speed limitPrint the first 3 rows. The output of this step should look like***Q2.2The first three rows of traffic DataFrame with the new max_speed_over_limit attribute are:@@[email protected]***where @ is the entire row in traffic DataFrame.Q 2.3 (2 marks)We are particularly concerned with instances of speeding on arterial roads, as arterial roadsare the responsibility of VicRoads rather than local councils.Create a new DataFrame, arterials, which contains only the survey entries from trafficthat relate to Arterial roads. Group the survey results in your arterials DataFrame byroad name. Print the names of the three roads with the highest maximum max speed over limit.(1 mark)Comment on how you think this information could be useful for VicRoads. (1 mark)The output of this step should look like***Q2.3Three Arterial roads with the highest maximum max_speed_over_limit:@: #@: #@: #***where @ is the name of the road and # is the maximum max speed over limit for that road.Stage 3: Visualisation and Clustering (11/20 marks)We now start to get a sense of the data characteristics through various types of visualisation.Q 3.1 Plotting groups (3 marks)Using the traffic DataFrame, draw a the following two plots: (2 marks)a A bar plot showing suburb (x-axis) versus mean average speed (y-axis).b A Tukey boxplot showing the distribution of vehicle class 1 (the number of short vehi-cles) within the traffic survey results.Comment on what you can observe from this output. (1 mark)Note: You may need to do additional data cleaning to handle other invalid survey entriesbefore plotting the data4Q 3.2 Dimension reduction and visualisation (4 marks)The special traffic.csv dataset contains entries from traffic.csv with some added in-formation. Although there are fewer entries than in the original traffic.csv dataset, thereare still too many to visualise efficiently. For this set of visual analysis, we wish to cre-ate a DataFrame special traffic by randomly sampling 1000 entries from the datasetspecial traffic.csv. The attribute idx is the identifier of a traffic survey entry and Str-Type is the road type of the survey entry.Using this special traffic DataFrame and all attributes except for StrType and idx asfeatures:a Perform Principal Component Analysis to determine the first and second principal com-ponents of this dataset. Produce a scatter plot of the first 2 principal components,colouring the points by their StrType value. (1 mark)b Interpret the scatter plot. (1 mark)c Perform two VAT visual analyses: one on all features and the other on the two compo-nents from task Q 3.2.a (1 mark)d Interpret the two VAT plots and explain the differences. (1 mark)Q 3.3 Clustering and visualisation(4 marks)Perform K-means clustering on all data in the special traffic DataFrame using all at-tributes except for StrType and idx as features.a Recall that the SSE (sum of squared errors) can be used to measure the quality ofclustering. The SSE is the sum of distances of objects from their cluster centroids:SSE =k∑i=1∑x∈cidistance(x, ci)2Produce a plot of SSE vs the number of clusters k as you vary the number of clusters.Use the ‘elbow method’ to identify the optimal value of k from your plot. Is this expectedgiven your previous results in Q3.2? Why or why not? (1.5 marks)b Perform K-means clustering on the data for k = 3.
Show the size of each cluster witha bar plot. (0.5 marks)We are interested in knowing if the model groups data well based on the attribute StrType.c Suggest a plot to help one visually evaluate the K-means model against the measurementstatement. Justify your suggestion. (1 mark)d Draw the suggested plot and explain your finding. (1 mark)5Marking schemeCorrectness (19 marks): For each of the questions, a mark will be allocated for level ofcorrectness (does it provide the right answer, is the logic right), according to the numberin parentheses next to each question. Note that your code should work for any data inputformatted in the same way as traffic.csv, roads.json, and special traffic.csv. E.g.If a random sample of 1000 records was taken from traffic.csv:, your code should providea correct answer if this was instead used as the input.Correctness will also take into account the readability and labelling provided for any plotsand figures (plots should include title of the plot, labels/scale on axes, names of axes, andlegends for colours symbols where appropriate).Coding style (1 mark): A Mark will be allocated for coding style. In particular the followingaspects will be considered:• Formatting of code (e.g. use of indentation and overall readability for a human)• Code modularity and flexibility. Use of functions or loops where appropriate, to avoidhighly redundant or excessively verbose definitions of code.• Use of Python library functions (you should avoid reinventing logic if a library functioncan be used instead)• Code commenting and clarity of logic. You should provide comments about the logic ofyour code for each question, so that it can be easily understood by the marker.ResourcesThe following are some useful resources, for refreshing your knowledge of Python, and forlearning about functionality of pandas.• Python tutorial• Python beginner reference• pandas 10min tutorial• Official pandas documentation• Official mathplotlib tutorials• Python pandas Tutorial by Tutorialspoint• pandas: A Complete Introduction by Learn Data Sci• pandas quick reference sheet• sklearn library reference• NumPy library reference• Python Data Analytics by Fabio Nelli (available via University of Melbourne sign on)6Submission InstructionsVia the LMS, submit a jupyter notebook (A template notebook ”notebook-for-answers.ipynb”is provided in the folder with the datasets) containing your Python 3 code.OtherExtensions and Late Submission Penalties: If requesting an extension due to illness, pleasesubmit a medical certificate to the lecturer. If there are any other exceptional circumstances,please contact the lecturer with plenty of notice. Late submissions without an approvedextension will attract the following penalties• 0 < hourslate <= 24 (2 marks deduction)• 24 < hourslate <= 48 (4 marks deduction)• 48 < hourslate <= 72: (6 marks deduction)• 72 < hourslate <= 96: (8 marks deduction)• 96 < hourslate <= 120: (10 marks deduction)• 120 < hourslate <= 144: (12 marks deduction)• 144 < hourslate: (20 marks deduction)where hourslate is the elapsed time in hours (or fractions of hours).This project is expected to require 20-25 hours work.Academic HonestyYou are expected to follow the academic honesty guidelines on the University website InformationA project discussion forum has also been created on the subject LMS. Please use this in thefirst instance if you have questions, since it will allow discussion and responses to be seen byeveryone. There will also be a list of frequently asked questions on the project page.7


Author admin

More posts by admin