• May 15, 2020

The University of Melbourne • COMP90056 2019s2 • Assignment ACOMP90056 Stream Computing and ApplicationsAssignment ASecond (Spring) Semester 2019Posted on LMS: Monday, 2 September 2019Due: Monday, 16 September 2019 [07:30]Important: Each student must submit their own code and report, written individually.This Assignment contributes 10% towards your total mark for this subject. As areminder, there is a hurdle on the non-final-examination component for this subject.Part 1: Theory questionsThis Part is worth 1.5 marks.(a) [1.5 marks] Bloom Filter variant. The analysis of the Bloom filter in the slidesassumes that the available hash functions are uniform and fully independent. It is knownthat a hash function with the latter properties requires O(n log n) bits to store, makingit larger in size than the Bloom filter. In a theoretical sense, the solution can be seenas “incorrect”. That being said, we do have access to other hash functions that fallwithin the memory bound. This leads to the following Bloom filter variant. To store asubset S, with |S| = m, of a universe U = [n], initialise a bitmap of width r and choosea single hash function drawn from a 2-universal hash family. The update and outputprocedures remain the same. With what width r should we initialise the Bloom filterbitmap so that the false positive rate is no more than a parameter ε > 0?Part 2: Count-min sketch and its variationsThis Part is worth 8.5 marks.The aim of this part of the Assignment is to perform an experimental evaluationof the count-min sketch and its variants. We focus on the trade-offs expressed by thesevariants of the scheme. Since this is a rather involved task, we have broken it up intoseveral components, with marks available for partial progress.BackgroundFormally, given a universe U and a stream S = 〈(s1,41), . . . (sm,4m)〉, with si ∈ U, thefrequency of item x is defined as:fx = ∑i:si=x4x. (1)The count-min sketch is a data structure that supports estimates of fx, ∀x ∈ U, onturnstile streams. It provides the following guarantee: for parameters ε, δ > 0, onquery x, the count-min sketch returns fˆx such that:fx ≤ fˆx ≤ fx + εF1, with probability (1− δ). (2)Pseudo-code is provided in Algorithm 1.1The University of Melbourne • COMP90056 2019s2 • Assignment AAlgorithm 1: CountMinSketchRequire : fx ≥ 0, ∀x ∈ [n]1 Procedure initialise(n, w, d)2 initialise pairwise independent hash functions h1, . . . , hd on [n]→ [w];3 create a d× w matrix of counters C initialised to 0;4 return;1 Procedure update(x, 4)2 for i ∈ [d] do3 C[i][hi(x)]← C[i][hi(x)] +4;4 return;1 Procedure output(x)2 return min{C[i][hi(x)] | i ∈ [d]};Algorithm 2: CMSConservativeUpdateRequire : fx ≥ 0, ∀x ∈ [n]1 Procedure update(x, 4)2 fˆx ← output(x); // get the current estimate for x3 for i ∈ [d] do4 C[i][hi(x)]← max{C[i][hi(x)], fˆx +4};5 return;The tasksYou are required to complete the implementation of two variants of the count-minsketch. Most importantly, you need to analyse the variants and the trade-offs. We willprovide a basic implementation of the count-min sketch in Java, and scaffolds of thetwo respective variants. Each variant expresses at least one kind of trade-off.You must test your implementations in a comparative experimental environment andwrite a report that discusses the results. The report, rather than the code, is the mainpiece of assessment. We will also provide a scaffold for the report (see below), to giveyou more time to focus on the analysis and discussion. But first, we introduce thevariants.Count-min sketch variants(a) conservative update The conservative update variant is designed to mitigate theeffect of collisions. It only changes the update procedure: initialise and outputremain the same. The pseudo-code is provided in Algorithm 2.(b) Morris counters A count-min sketch that replaces ordinary frequency counts withMorris counters as its counting primitive.Template We will provide you with code templates to get you started. This includesa full count-min sketch program in Java and scaffold code for both the conservative2The University of Melbourne • COMP90056 2019s2 • Assignment Aupdate and Morris counter variants. However, your implementation can be in theprogramming language of your choice. We will not be testing code or allocating marksfor program quality. If you prefer to use a language like C or C++, for greater controlover memory allocation, you are encouraged to do so.Preparing the ReportSubmit a brief report (around 1200 words) that provides a rich comparison of the threedata structures (Count-min sketch and its two variants). To assist in presenting yourresults in a clear and organised manner, we recommend breaking the report down intofive sections: Introduction, Theory, Implementation, Experimental Set-up (includingdata sets), Results and Discussion.Introduction Establish the aims and purpose of the report. Provide some context andinclude a use-case for the count-min sketch (and variants).Theory Introduce the three data-structures that form the study. Offer some intuitionbehind the conservative update and Morris counter variants. Compare, in a theoreticalsense, the accuracy of each data structure and provide a complexity analysis. You coulddo the latter in the form of a table:Standard Conservative Update Morris CounterMemoryUpdate timeQuery timeA good theory section will help your discussion. For example, your experimentsmay ‘reflect’ what is stated in the complexity table.Implementation Detail some of the key decisions you made in your implementation.Which programming language did you use and why? Provide a justification for yourchoice of hash function.Experimental Set-up Detail your data-collection process. Did you generate your owndata? If so, how and why? Did you use actual-world data-sets? State the names andattributes of the data-sets used in the experiments. Describe the metrics and processesused for measuring memory, time and accuracy.Results and Discussion Present the results to support your discussion points, forexample, via plots and tables. Discuss the results. Do your results express any trade-offs1? Are your results consistent with what you stated in the theory section?You may deviate from this template if you think a different structure suits yourarguments and comparisons. Regardless, your report must be in the following format:A4 page size, minimum 11-point font, minimum 2cm margins, and must be a pdf file!1Hint: they should!3The University of Melbourne • COMP90056 2019s2 • Assignment ACriteriaThere are 8.5 marks available for Part 2 of this Assignment. The indicative markingcriteria for this Part are:• What is the standard of your testing regimen, as described in the report? [1.5 marks]– Have you tested your code at different scales (input sizes)?– Do your datasets vary in the distribution of items?– Do your datasets vary in the kinds of data they can handle?– Are you testing corner/tricky cases for your code?• How have you demonstrated good design choices in line with the goals of streamcomputing? [1.5 marks]– Are your data structures compact?– Do they support fast update per stream input?– Do they support fast response to queries?• Is the theory well understood and presented clearly? [1.5 marks]– Have you presented the ideas with consistent logic?– Have you comprehensively covered all aspects of your algorithms and datastructures?• Are the experimental results clearly presented and organized? [1 mark]– Can the experiments be replicated?– Have you used plots, graphics, tables appropriately? Don’t just repeat what’sin text: amplify and summarize it.• Does the discussion align the results with the goals and purpose of the report?[1 mark]– Is there a systematic critical engagement and analysis of the design decisionsand experimental results in line with the aims of streaming algorithms anddata structures?– Were there relevant hypotheses and how were these reflected upon?• Is the report well structured and written? [1.5 marks]– Are there appropriate paragraph and section headings?– Have all aspects of the task been covered?– Are arguments presented in a logical way and sequence?• Does the report display some particular creativity or insight? [0.5 marks]– Is independent thinking demonstrated?4The University of Melbourne • COMP90056 2019s2 • Assignment AExpectationsWe want you to demonstrate that you understand the differences between the threedata structures within a stream computing framework. This includes knowing how totest the implementation in a way that reveals these differences.SubmissionsYou should lodge your submission for Assignment A via the LMS. You must identifyyourself in each of your source files and the report. Poor-quality scans of solutions writtenor printed on paper will not be accepted. There are scanning facilities on campus,not to mention scanning apps for smartphones etc. Solutions generated directly on acomputer are of course acceptable. Submit three files:• A part1.pdf file containing your answer to the Part-1 theory question.• A report.pdf file comprising your report for Part 2.• A cms.zip file containing all your source files for Part 2, but not including“standard” .jar files (refer to the section on Libraries).Do not include the testing files, as these might be large. REPEAT: DO NOT INCLUDETESTING FILES! It is very important, so that you can justify ownership of your work,that you detail your contributions in comments in your code, and in your report.Administrative issuesWhen is late? What do I do if I am late? The due date and time are printed on thefront of this document. The lateness policy is on the handout provided at the firstlecture. As a reminder, the late penalty for non-exam assessment is two marks perday (or part thereof) overdue. Requests for extensions or adjustment must follow theUniversity policy (the Melbourne School of Engineering “owns” this subject), includingthe requirement for appropriate evidence.Late submissions should also be lodged via the LMS, but, as a courtesy, pleasealso email Tony Wirth when you submit late. If you make both on-time and latesubmissions, please consult the subject coordinator as soon as possible to determinewhich submission will be assessed.Individual work You are reminded that your submission for this Assignment is to beyour own individual work. Students are expected to be familiar with and to observe theUniversity’s Academic Integrity policy http://academicintegrity.unimelb.edu.au/.For the purpose of ensuring academic integrity, every submission attempt by a studentmay be inspected, regardless of the number of attempts made.Students who allow other students access to their work run the risk of also beingpenalized, even if they themselves are sole authors of the submission in question. Bysubmitting your work electronically, you are declaring that this is your own work.Automated similarity checking software may be used to compare submissions.5The University of Melbourne • COMP90056 2019s2 • Assignment AYou may re-use code provided by the teaching staff, and you may refer to resourceson the Web or in published or similar sources. Indeed, you are encouraged to readbeyond the standard teaching materials. However, all such sources must be cited fullyand, apart from code provided by the teaching staff, you must not copy code.Finally Despite all these stern words, we are here to help! There is information aboutgetting help in this subject on the LMS pages. Frequently asked questions about theAssignment will be answered in the LMS discussion group.William Holland and Tony Wirth, with assistance from the COMP90056 team.September 10, 20196