- May 15, 2020
The Australian National University Research School of Computer Science, CECS COMP3430 – Data Wrangling – 2019 Record linkage project Due 11:55pm Sunday 20 October 2019 Worth 20% of the final grade for COMP3430 Draft – Last update September 11, 2019 Overview and Objectives For this project you will be having another look at the record linkage program that you developed in the lab sessions. Specifically, we provide you with two new data sets and ask you to work with the programs we have developed in the labs, and report on your findings. As with the previous assessments, the emphasis is on your understanding, descriptions, and justification as much as the raw (numerical) record linkage evaluation results that you are able to achieve. Important • Submit one zip archive file, named uNNNNNNN record linkage project.zip, where uNNNNNNN is your ANU ID. For example if your ANU ID is u1234567 you should submit the file u1234567 record linkage project.zip. Only use underscores and not spaces, and only lower-case letters in your file name (as this will greatly help our marking efforts – thanks). You receive a -1 mark penalty if you do not follow this naming convention. • Make sure that your student ID is included on the first page of your submitted report. You receive a -1 mark penalty if you do not include your student ID. • Do NOT include your name anywhere in your submission. All marking will be done anonymously. You receive a -1 mark penalty if you do include your name. • The zip file must contain: 1. Your report, a .pdf document named uNNNNNNN record linkage project report.pdf 2. Your output file for the best linkage results you were able to obtain (see task 3 below), a .csv file named uNNNNNNN record linkage project result.csv • The allowed total maximum length of your report is four (4) A4 pages (single pages, not 4 double pages!) and around 1,500 words. We expect you to use at least 12 point font size with a standard font (such as Times New Roman or Liberation Serif ) for all text in your submitted report. We encourage you to use larger font size or bold font for titles, section headers, etc. Include the total word-count of your report on the first page of your report. The 4 page maximum length does include any figures, tables, references and appendices. • Your submitted report does not need to have a cover page. • Word documents or any other formats besides PDF are not accepted and will not be marked. • Hand-written submissions are not accepted and will not be marked. • Make sure you submit the final version of your project before the submission deadline. Submission Submission will be done using Wattle. Click on the link COMP3430 record linkage project submission (to be made available) in week 11 to upload your ZIP file. You may submit as many draft versions of your project as you wish. However, you must make sure you submit a final version before the submission deadline. We will mark the final version present at the due date. Note that Wattle does not allow us to access earlier submitted versions of your project, therefore check carefully what you submit as the final version! We cannot accept submissions via email. Penalties The following will attract penalties: -1 mark if you do not follow the file naming convention discussed above. -1 mark if you do not include your student ID on the first page of your submitted report. -1 mark if you do include your name in your submitted report. -1 mark for every page over the maximum 4 page limit (so a 6 page report will attract a -2 penalty). -1 mark if you use a font size smaller than 12 points, or a difficult to read font type. Deadlines, Extensions and Late Submissions The record linkage project is due 11:55pm, Sunday 20 October 2019. Students will only be granted an extension on the submission deadline in extenuating circumstances, as de- fined by ANU policy (http://www.anu.edu.au/students/program-administration/assessments-exams/deferred-examinations). If you think you have grounds for an extension, you must notify the course convener as soon as possible and provide written evidence in support of your case (such as a medical certificate). The course convener will then decide whether to grant an extension and inform you as soon as practical. In accordance with the CECS and ANU late submission policy, no late submissions will be accepted, except where an extension has been approved by the course convener. Plagiarism No group work is permitted for this project. We do encourage you to discuss your work, but we expect you to do the project work by yourself. If you are unsure about what constitutes plagiarism, make sure you carefully read the ANU Academic Honesty Policy (http://academichonesty.anu.edu.au/). If you do include ideas or material from other sources, then you clearly have to make attribution by providing a reference to the material or source in your submitted project report. We do not require a specific referencing format, as long as you are consistent and your references allow us to find the source, should we need to while we are marking your report. Marking This project will be marked out of 20, and it will count for 20% of your final course mark. Note that not all project tasks are equally difficult. For some of the tasks there is no single right or wrong answer. Marks will be awarded based on your reasoning and the justification of your decisions and explanations, as well as clarity and correctness of writing. We will endeavour to release your marks and feedback within two teaching weeks after the submission deadline. If you feel we have made an error in marking, you have two weeks following the release of marks to raise any issues with the course convener, after which time your mark will be considered final. If you request that we re-mark your project, we will re-mark the entire project and your mark may go up or down as a result. Project Structure This project consists of four (4) tasks as described below which can be worth different numbers of marks. Make sure you answer all aspects of each task. If you have any questions on the project please post them on Wattle – however do not post any partial solutions, program codes, equations, calculations, URLs, etc. or any hints on how to solve any of the project tasks. Project Tasks For this project, we provide you with the following two new data sets, dataset A.csv and dataset B.csv, as well as a truth data set true matches.csv, available for download from Wattle in week 8. The tasks for this project are similar to what you had to do in lab 7 in week 9. You are required to run your record linkage program (including any modifications you have made to this program) on the two data sets provided, and write a report which addresses the following questions: 1. Blocking (6 marks): • How does blocking affect your results? Specifically, describe your choice of blocking method and choice of blocking keys. Discuss which attributes and/or attribute combination(s) in the given data sets were useful as blocking keys and which were not, and why. • If there is a trade-off between performance (reduction ratio, pairs completeness and pairs quality) and the quality of the final record linkage results, where do you think the optimal balance is, and why? • Do you think this trade-off would change on different data sets with different levels and characteristics of data quality? If so, how and why? 2. Comparison and Classification (6 marks): • How do different comparison techniques affect linkage results? Discuss and justify how you selected appropriate comparison functions for different attributes, and why these selected functions are suitable while others were not. • How do different classification techniques using different parameter settings affect linkage quality? Discuss and justify how you selected an appropriate classification function to obtain high linkage quality. • As discussed in the lectures in week 8, for suitable linkage quality measures, describe how the final record linkage quality changes with the choice of parameters and techniques? • Is the record linkage quality particularly sensitive to certain parameters or choice of comparison or classification techniques? If so, why is this the case? • Provide the numerical linkage evaluation results for other (not optimal, see below) parameter settings that you have used (you only have to provide the output file for your best obtained linkage results – see next task). Ideally you include tables or plots to show linkage quality results for different parameter settings. 3. Optimal Settings (4 marks): • What is the best linkage quality result you are able to achieve, both in the blocking and the classification steps? Why do you think this combination of parameters and techniques works well? • Are the results good for all evaluation measures discussed in the lectures in week 8, or only for some? If the results are good only for some measures, why do you think the results are not good for other measures? In addition to answering this task in your report, you must also submit the output file which contains the linked and classified matching record pairs (as a CSV file) for the best linkage result you were able to obtain. Use the Python program saveLinkResult.py which we use in lab 7 to write linkage output into a file. Your submitted output file must exactly follow this CSV file format! We will use a program to check linkage quality using this file to validate what you write in your report. If our program does not work with your submitted file because it does not follow the required file structure you will loose marks. 4. Data Quality (4 marks): • How dirty are these new data sets compared to all the data sets you have worked with in labs 3 to 7? Describe your impression after having conducted the linkage project. • How can you determine this? Describe the methodology you used to assess the data quality of the data sets we provided for this project (such as any calculations you used, or how you determined the data quality using data exploration and profiling). Visualisations: You should use appropriate data visualisations such as tables, plots, etc. for your descriptions to the above tasks. Marks will be awarded for good visualisations and appropriate tables. Assume you are presenting your record linkage project to an audience without a strong technical background, so make sure you adequately explain any visualisations you use (i.e. describe what tables and figures show and interpret the content of the obtained results).