comp90049 课业解析

题意：开发和评估检测Twitter数据中频繁词汇的单词混合解析：先预处理Twitter数据集中出现的混合词，并使用字典中的单词参考集tokens来进行近似的字符串匹配，再参考tokens集合，在一组tokens标记中识别单词混合候选词。再使用一个或者多个匹配真实单词混合的列表来评估算法。涉及技术点：数据分析，数据处理，字符串匹配实现代码：进行中。
Department of Computing and Information SystemsThe University of MelbourneCOMP90049 Knowledge Technologies, Semester 2 2019Project 1: Word Blending in TwitterReleased: Friday 16 AugDue: Research Paper: Friday 13 Sep – 5PMReviews: Wednesday 18 Sep – 5PMMarks: The project will be marked out of 20 (according to the givencriteria), and will contribute 20% of your total mark.OverviewThe goal of this project is to develop and critically assess methods for detecting word blendsamong frequent terms in Twitter data, and to express the knowledge that you have gained aboutthis task in a short research paper. Twitter users use language innovatively, and coining newterms by blending two existing words is a common phenomenon, known as lexical blending.Consider the following examples:Component 1 Component 2 Blend wordBritain exit Brexitspoon fork sporkbreakfast lunch brunchYou will detect occurrences of blend words among a pre-processed list of tokens from aTwitter data set, using a reference set of English words from a dictionary, and using methodsfor approximate string matching as encountered in the lectures. We will also provide you with aset of tweets the token list was extracted from, which you may (but are not expected to) use.You will evaluate the output of your algorithm(s) against a list of true word blends. The projectaims to reinforce concepts in approximate matching and evaluation, and to strengthen yourskills in data analysis and problem solving.The goal of this assignment is not to develop a system which achieves near-perfect precision(in fact, this is impossible – we are developing knowledge technologies after all!).Deliverables1. One or more programs, implemented in the programming language(s) of your choice, whichmust: Process the data input file(s), to identify word blend candidates Identify word blend candidates among a set of tokens, with the help of a referencecollection of tokens (dictionary) Evaluate the matches, with respect to the list of true word blends, using one or moreevaluation metrics2. A README that briefly details how your program(s) work(s). You may use any externalresources for your program(s) that you wish: you must indicate these, and where you obtainedthem, in your README. The program(s) and README are required submission elements,but will not typically be directly assessed.3. An anonymous short research paper of 1100–1350 words (±10%), as a single file in PDFformat, which should include: A short description of the problem and data set A brief summary of some relevant literature A brief explanation of the approximate matching techniques used Presentation of your results in terms of the evaluation metrics discussed and illustrativeexamples A discussion on the knowledge you have gained about the problem at hand, and about the(un)suitability of the approaches you have adopted4. Reviews of two research papers written by your peers, each of 250-350 words (±10%),comprising 4 out of the 20 marks and a critical self-reflection on your own work.Terms of UseAs part of the terms of use of Twitter, in using the data you agree to the following: The Twitter dataset is based on the data set presented inJacob Eisenstein, Brendan O’Connor, Noah A. Smith,and Eric P. Xing. 2010. Alatent variable model forgeographic lexical variation. In Proceedings of the 2010Conference on Empirical Methods in Natural Language Processing (EMNLP 2010),pages 1277–1287You need to cite this paper in your research paper. The list of blend words was compiled using resources presented in the followingpublicationsDeri, A. and Knight, K. (2015) How to Make a Frenemy: Multitape FSTs forPortmanteau Generation. In Human Language Technologies: The 2015 AnnualConference of the North American Chapter of the ACL, pages 206–210Das, K. and Ghosh, S. (2017) Neuramanteau: A Neural Network Ensemble Model forLexical Blends. In Proceedings of the The 8th International Joint Conference onNatural Language Processing, pages 576–583Cook, P. and Stevenson, S. (2010) Automatically Identifying the Source Words ofLexical Blends in English. In Computational Linguistics, Volume 36(1)You need to cite these papers in your research paper.You are strictly forbidden from reproducing documents in the document collection in anypublication, other than in the form of isolated examples.Additionally note that the document collection is a sub-sample of actual data posted to Twitter,without any filtering whatsoever. As such, the opinions expressed within the documents in noway express the official views of The University of Melbourne or any of its employees, and myusing them does not constitute endorsement of the views expressed within. We recognize thatsome of you may find certain of the documents in bad taste and possibly insulting, but pleaselook beyond this to the task at hand. The University of Melbourne accepts no responsibility foroffence caused any content contained in the documents.Assessment Criteria(1) Short research paper: (15 marks out of 20) Method: (30% of the paper mark)You will make one or more suitable hypotheses regarding the coinage of blend words, anddesign experiments using one or more approximate matching methods which couldplausibly test your hypotheses. You will use the data to evaluate the method(s) logicallyand formally. You will describe your implementation in a manner that would make yourwork reproducible. Critical Analysis: (40% of the paper mark)You will analyze the effectiveness of your system(s), referring to the underlyingtheoretical behavior where appropriate. You will attempt to confirm or reject yourhypotheses, using supporting evidence in terms of illustrative examples and evaluationmetrics. You will derive some knowledge about the problem of identifying the causes oftypographical errors. Report Quality: (30% of the paper mark)You will produce a report which is commensurate in style and structure with a (short)research paper. You will express your ideas clearly and concisely, and remain within theword limits. You will include a short summary of related research.NOTE: A marking rubric is available on LMS to indicate what we will be looking for in eachof these categories when marking.(2) Reviews and self-reflection (5 marks out of 20)You will have 250–350 words to respond to three “questions” for two research papers of yourpeers (2 marks each) and for your own paper (1 mark):• Briefly summarize what the author has done• Indicate what you think the author has done well, and why• Indicate what you think could have been improved, and whyCompleting the reviews is expected to take about 3–4 hours in total.Changes/Updates to the Project Specifications•If we require any (hopefully small-scale) changes or clarifications to the project specifications,they will be posted on the LMS. Any addendums will supersede information included in thisdocument.Academic MisconductFor most people, collaboration will form a natural part of the undertaking of this project.However, it is still an individual task, and so reuse of ideas or excessive influence in algorithmchoice and development will be considered cheating. We will be checking submissions fororiginality and will invoke the University’s Academic Misconduct policy(http://academichonesty.unimelb. edu.au/policy.html) where inappropriate levels ofcollusion or plagiarism are deemed to have taken place.Late Submission PolicyYou are strongly encouraged to submit by the time and date specified above, however, ifcircumstances do not permit this, then the marks will be adjusted as follows:Each business day (or part thereof) that this project is submitted after the due date (and time)specified above, 10% will be deducted from the marks available, up until 5 business days (1 week)has passed, after which regular submissions will no longer be accepted.

comp90049 课业解析

Related

Previous PostCOMP1521

Next Postcomp90015 课业解析

Author admin