辅导案例-IFN647-Assignment 2

IFN647 ASSIGNMENT2.201 cont/… IFN647 – Assignment 2 Requirements Weighting: 35% of the assessment for IFN647 Items required to be submitted through IFN647 Blackboard: 1. A PDF or word file includes both • Statement of completeness and your name(s) and student ID(s) in a cover page. • Solutions to questions Q1, Q2, Q4 and Q7, and a paragraph README description for how to execute your python code in terminal or in IDLE, the structure of your data folder setting and import packages as well. 2. Your source code for all other questions, containing all files necessary to run the solutions and perform the evaluation (source code only, no executables) and a main python file (“script.py”) to run all source code you defined for all questions (using a zip file “code.zip” to put them together). 3. A zip file “result.zip” contains all “result” data files (in text). Please note you do not need to include the dataset folder generated by “dataset101-150.zip” in your submission. Zip all the above file as your “student ID_Surname_Asm2.zip” and submit it in BB before 11.59pm on 29 May 2020. Due date of Blackboard Submission: Friday week 12 (29th May 2020) Individual working/pair: You may work on this assignment individually or in a pair (please note the different requirements for individual and pairs as indicated in the questions). Currently, a major challenge is to build communication between users and Web search systems. However, most Web search systems use user queries rather than user information needs due to the difficulty of automatically acquiring user information needs. The first reason for this is that users may not know how to represent their topics of interest. The second reason is that users may not wish to invest a great deal of effort to dig out relevant pages from hundreds of thousands of candidates provided by a Web search system. In this assignment, you are expected to design a system, “Weak Supervision Model (WSM)”, to provide a solution for this challenging issue. The system is broken up into three parts: Part I (Training Set Discovery), Part II (IF model) and Part III (Evaluation). In Part I, the major task is to present an approach in order to automatically discover a training set for a specified topic (we will provide you 50 topics), which includes both positive documents (e.g., labelled as “1”) and negative documents (e.g., labelled as “0”). You may need to use the topic title, description or narratives, Pseudo-Relevance Feedback technique (or clustering technique) and an IR model for this part to find a training set D which includes both D+ (positive – likely relevant documents) and D-(negative – likely irrelevant documents) in a given un-labelled document set U. Part II is to select more terms in D and discover weights for them; and then use the selected terms and their weights to rank documents in U. Part III is the evaluation, you are required to prove your solution is better than the query-based method (“the baseline model”) which uses only the topic titles to rank U. IFN647 ASSIGNMENT2.201 cont/… 2 Example of topic102 – “Convicts, repeat offenders” is described as follows: Number: R102 Convicts, repeat offenders Description: Search for information pertaining to crimes committed by people who have been previously convicted and later released or paroled from prison. Narrative: Relevant documents are those which cite actual crimes committed by “repeat offenders” or ex-convicts. Documents which only generally discuss the topic or efforts to prevent its occurrence with no specific cases cited are irrelevant. Part I: Training Set Discovery It requires obtaining a complete training set D which consists of a set of positive documents D+; and a set of negative documents D-. In this part, you attempt to present an approach (or two approaches for a pair) finding a complete training set D in U (a given unlabelled document set, e.g., the set of documents in Training102 folder), which includes at least some likely relevant documents (positive part) and some likely irrelevant documents (the negative part). The proposed approach depends on your knowledge acquired from this unit. You could discuss your approach with your tutor before you do the implementation. Q1) (6 marks) Write an algorithm (or two algorithms for a pair) in plain English to show your approach for the discovery of a complete training set for 50 topics and the corresponding 50 datasets (Training101 to Training 150). Your approach should be generic that means it is feasible to be used for all (or most) topics. For each topic, e.g., Topic102, you should use the following input and generate the output. Inputs: query Q = a topic (you may use title e.g., ‘Convicts repeat offenders’ or all information including the and ); and U = folder “Traning102”. Output: D = D+ È D-, where D+ Ç D- = Æ and D Í U. The following is the possible outputs in D (not the answer) for topic 102: R102 73038 1 R102 26061 1 R102 65414 1 R102 57914 1 R102 58476 1 R102 76635 1 R102 12769 1 R102 12767 1 IFN647 ASSIGNMENT2.201 cont/… 3 R102 25096 1 R102 78836 1 R102 82227 1 R102 26611 1 R102 15200 0 R102 13320 0 R102 54745 0 R102 15082 0 R102 53523 0 R102 65306 0 R102 68419 0 R102 29920 0 R102 30456 0 R102 75563 0 R102 28657 0 R102 65394 0 R102 85372 0 Q2) (6 marks) Implement the algorithm (two algorithms for a pair) by using Python. You also need to discuss the output to justify why the proposed algorithm likely generates high quality training sets. You may use figures to show the justification. Q3) (3 marks) BM25 based baseline model implementation (see week 8 workshop) – please use the titles as queries to rank documents for each topic, and save the result into 50 files; e.g., BaselineResult1.dat, …, BaselineResult50.dat; where each row includes the document number and the corresponding relevance degree or ranking (in descendent order). The following is the possible result (not the answer) for topic 102 (in BaselineResult2.dat): 73038 5.898798484774149 26061 4.273638903483098 65414 4.1414522450167475 57914 3.967136888209526 58476 3.708467957856744 76635 3.5867337114200843 12769 3.4341129093591456 12767 3.352170358051889 25096 2.7646308089876177 78836 2.6823617071618404 82227 2.6056189593652537 26611 2.3595327588643613 24515 2.2258395867976226 33172 2.218657303566887 33203 2.2027873338265396 29908 2.188504022701605 … IFN647 ASSIGNMENT2.201 cont/… 4 Part II: Information Filtering Model Q4) (5 marks) Design an information filtering model (your WSM) that includes both a training algorithm and a testing algorithm (for an individual person) or two information filtering models (for a pair) in plain English, which illustrates your idea for using your discovered training set in Part I to learn the model. Please note your selected keywords (terms) in the discovered training set should be very important for each given topic. You will use the following input for the training algorithm to select some useful features Input: D = D+ È D- Output: Features For the testing algorithm, you will have the following input and output Input: U (e.g., folder “Traning102”). Output: sorted U Q5) (5 marks) Implement your WSM (or two models for a pair) in Python. You need to find useful features (e.g., terms) and their weights for every topic using the proposed training algorithm (in Q4) and store them in a data structure or a file. For all documents in U, you also need to calculate the relevance score for each document using the proposed testing algorithm; and sort the documents in U for each topic according to their relevance scores and save the results into “result1.dat” to “result50.dat” files for 50 topics, where each row includes the document number and the corresponding relevance score or ranking (in descendent order). The following is the possible result (
not the answer) for topic 102 (in result2.dat): 73038 5.898798484774149 26061 4.273638903483098 65414 4.1414522450167475 57914 3.967136888209526 58476 3.708467957856744 76635 3.5867337114200843 12769 3.4341129093591456 12767 3.352170358051889 25096 2.7646308089876177 78836 2.6823617071618404 … IFN647 ASSIGNMENT2.201 cont/… 5 Part III: Evaluation Q6) (5 marks) Implement a python program to calculate top10 precision, recall and F1 (you may use extra measures, e.g., average precision) for both the baseline model and your WSM on all topics by using the provided relevant judgements for each topic and save the results into “EvaluationResult.dat”. Please note you can use the evaluation result to update your WSM. For each topic, e.g., Topic102, you should use the following inputs for your WSM, the output includes all evaluation results for the 50 topics: Input: “result2.dat” and “Training102.txt” Output: EvaluationResult.dat The following is the possible result (not the answer) in a csv file: Topic precision recall F1 101 0.130435 0.428571 0.20 102 0.020100 0.029630 0.023952 103 0.046875 0.214286 0.076923 … Q7) (5 marks) You will get the 5 marks if you can approve your WSM is significantly better than the baseline model (you can choose any measure used in Q6); otherwise, you will lose the 5 marks. Please use “t-test” to help you answering this question. IFN647 ASSIGNMENT2.201 Please Note • Your programs should be well laid out, easy to read and well commented. • All items submitted should be clearly labelled with your name and student number. • Marks will be awarded for programs (correctness, programming style, elegance, commenting) and evaluation results, according to the marking guide. • You will lose marks for missing or inaccurate statements of completeness, and for missing files or items. END OF ASSIGNMENT 2

辅导案例-IFN647-Assignment 2

Related

Previous Post辅导案例-MAST30013-Assignment 3

Next Post辅导案例-INF6028

Author admin