辅导案例-CO832

CO832: Data Mining and Knowledge Discovery Assessment 2: Practical Data Analysis Instructions and General Marking Scheme Marek Grzes´ March 6, 2020 This document is complemented by a video that is available on the module’s Moodle page. You are strongly encouraged to watch the video before you start working on this assessment. You can do this assessment either individually or in a small group with just two people. In the latter case, the group must hand in a single assessment, and the two students in the group will get the same mark. This assessment is worth 10% of the total marks for this module. 1 Objectives Section 3 presents the dataset that you will analyse in this assessment. The dataset defines a standard classification problem. Your task can be summarised as follows: 1. Select a classification algorithm that you will use for your analysis. Decision trees and classi- fication rules were introduced in lectures, and you are encouraged to use them, but you may want to explore other types of classification algorithms. You should not use instance based classification algorithms (e.g. k-NN) because they will make parts of your analysis challenging. Na¨ıve Bayes is strongly biased, and you may not be able to reduce its bias easily. For this reason, you should avoid Na¨ıve Bayes in this assessment too. 2. Having the algorithm selected in the previous step, you will explore the performance of the algorithm on the data. The goal of your exploration is to tune the parameters of the algorithm such that the 10-fold cross-validation accuracy is maximised. By doing that, you will also explore the training error of the method. 3. Having the results obtained in the previous step, you will analyse the results referring to the bias-variance trade-off that was presented in lectures. Specifically, your challenge is to identify results and their parameter settings that lead to low bias, low variance, and a good trade-off between bias and variance. In order to receive the highest mark for this assessment, you will 1 need to argue about the properties of the data. This discussion should be supported by the results of your bias-variance trade-off analysis. For example, you can argue whether the data that you are analysing is noisy or not, or whether a complex decision boundary is required. It will be helpful for you to read (James et al., 2013, Secs. 2.2.1–2.2.2)1 and study their Fig. 2.12 to see how the properties of the data can influence the bias-variance trade-off. 4. The last step of your investigation is to analyse the models learned from the data. In partic- ular, you will identify attributes that influence the class attribute. For example, inspecting the decision trees learned from the data, you will look for attributes that are important. Im- portance will depend on the position of the attribute in the decision tree or on the number of nodes in which the attribute appears. If your model is Na¨ıve Bayes, you will look for attributes that have high probability P (attribute|class). Your goal in this step is basically to look into the models that you will obtain, and to extract domain specific knowledge about the data from the models. 2 What to submit? Submit a technical report that will have the following structure: • A title page with your name and Kent login. • One-page report that will present your findings, observations, and technical analysis of your results. Note that it is not sufficient to present observations. You need to analyse your observations using technical terms. Mentioning observations without explaining and justifying them will not give you the highest marks. • An appendix that will contain figures and tables that you want to include to support your discussion in the technical report. The figures and tables in the appendix should have infor- mative and self-explanatory captions. This means that the figures and tables can to a large extent be self-contained. The size of the appendix is unlimited, and you can have as many figures and tables as you want (but you don’t have to add many if you don’t need to). Note that the main part of the technical report has a strict limit of one page, and your marks will be reduced for violating this restriction. All pages should be of the usual size A4, margins of at least 1 inch, and the font size 12pt or larger. If you will produce any code for this assessment in any programming language, the code should be submitted as well. The code and the report will be separate parts in the submission page. 3 Dataset The dataset is called Pima Indians Onset of Diabetes. Each instance represents medical details for one patient and the task is to predict whether the patient will have an onset of diabetes within the next five years. There are 8 numerical input variables all of which have varying scales. You can 1e-book is available at https://www.kent.ac.uk/library/ 2 learn more about this dataset on the UCI Machine Learning Repository (http://www.ics.uci. edu/~mlearn/MLRepository.html). Top results on this dataset are in the order of 77% accuracy. The dataset is available in a file called diabetes.arff on the module’s web page on Moodle. The file extension “.arff” means it is a file in a format specifically suitable for WEKA. The class attribute in the diabetes.arff dataset is called “class”. Make sure that you use this attribute as a class attribute in your investigation. 4 Data Format After a few lines at the top of the file defining the dataset name, its attributes and corresponding values, each line in the file represents a data example/instance, and it consists of several attribute values. One of those attributes is then chosen as a class attribute, whose value is to be predicted by a classification algorithm. Missing values (if present in a dataset) are represented using the “?” symbol. The arff format is a standard data format in Weka, but it can imported in Python easily. The following script can read an example dataset. Listing 1: load-arff-data.py import os from scipy.io import arff import pandas as pd data = arff.loadarff(’/home/mgrzes/Documents/Teaching/datasets/diabetes.arff’) df = pd.DataFrame(data [0]) print(“Original arff data:”) print(df.head ()) # above we can see that we need to decode the last column df[’class’] = df[’class’].str.decode(“utf -8”) # strings are fine now print(“Data with fixed strings:”) print(df.head ()) # let’s save this data frame in the CSV format df.to_csv (r’/tmp/diabetes.csv’, index=False , header=True) print(“The CSV file:”) os.system(“head -n 5 /tmp/diabetes.csv”) # you can load the csv data easily df = pd.read_csv(“/tmp/diabetes.csv”) # preview the first 5 lines of the loaded data print(“Read from the CSV file:”) print(df.head ()) The output of this program is below. 3 1 Original arff data: 2 preg plas pres skin insu mass pedi age class 3 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 b’tested_positive ’ 4 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 b’tested_negative ’ 5 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 b’tested_positive ’ 6 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 b’tested_negative ’ 7 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 b’tested_positive ’ 8 Data with fixed strings: 9 preg plas pres skin insu mass pedi age class 10 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 tested_positive 11 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 tested_negative 12 2 8.0 183.0 64.0 0.0 0.0 23.3 0.672 32.0 tested_positive 13 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 tested_negative 14 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 tested_positive 15 The CSV file: 16 preg ,plas ,pres ,skin ,insu ,mass ,pedi ,age ,class 17 6.0 ,148.0 ,72.0 ,35.0 ,0.0 ,33.6 ,0.627 ,50.0 , tested_positive 18 1.0 ,85.0 ,66.0 ,29.0 ,0.0 ,26.6 ,0.351 ,31.0 , tested_negative 19 8.0 ,183.0 ,64.0 ,0.0 ,0.0 ,23.3 ,0.672 ,32.0 , tested_positive 20 1.0 ,89.0 ,66.0 ,23.0 ,94.0 ,28.1 ,0.167 ,21.0 , tested_negative 21 Read from the CSV file: 22 preg plas pres skin insu mass pedi age class 23 0 6.0 148.0 72.0 35.0 0.0 33.6 0.627 50.0 tested_positive 24 1 1.0 85.0 66.0 29.0 0.0 26.6 0.351 31.0 tested_negative 25 2 8.0 183.0 6
4.0 0.0 0.0 23.3 0.672 32.0 tested_positive 26 3 1.0 89.0 66.0 23.0 94.0 28.1 0.167 21.0 tested_negative 27 4 0.0 137.0 40.0 35.0 168.0 43.1 2.288 33.0 tested_positive If you decide to use Python for this assessment, you don’t have to use the arff format. You can find this dataset in alternative formats, and you can use those. 5 Software You can choose one of the two options: 1. Weka, which won’t require any programming 2. Machine learning packages in Python, which will require writing code in Python 5.1 Option 1—Weka WEKA is a freely-available data mining tool that is installed on university PCs. It implements the J48 algorithm to learn decision trees that was introduced in our lectures. You can use this algorithm for your investigation, or you can choose another classification algorithm. Note that J48 is known in the machine learning literature as C4.5 and C5.0. The video that complements this document contains a brief introduction to WEKA, and it shows the major features of this package that are required to do the assessment. The module’s Moodle page includes a few additional documents and tutorials on WEKA. 4 5.2 Option 2—Python There is no dedicated Python tutorial associated with this assessment. However, and example script is provided that could be a good start for you. If you don’t want to use Python, you can simply choose the first option and do this assessment in Weka. The following script runs the k-NN algorithm on the Iris dataset, and plots a validation curve in which the parameter k is varied. Note that the validation curves were explained in our lecture on the bias-variance decomposition and inductive bias. Listing 2: validation-curve.py # https ://scipy -lectures.org/packages/scikit -learn/index.html # Validation curve is computed using the sklearn function validation_curve. # Note that only one parameter can be varied when this function is used. # If you decide to vary more than one parameter , you may need to plot # a 3D surface using matplotlib or to generate several univariate plots # (like the one generated by this program) for different values of other # parameters. import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.model_selection import validation_curve from sklearn.neighbors import KNeighborsClassifier # we use the standard Iris dataset iris = load_iris () X = iris.data y = iris.target # range of k in K-NN nrange = np.arange(1, 31) # model = make_pipeline(PolynomialFeatures (), LinearRegression ()) model = KNeighborsClassifier () # weights=’uniform ’ by default # we vary k in the k-NN algorithm train_scores , validation_scores = validation_curve( model , X, y, param_name=’n_neighbors ’, param_range=nrange) # Plot the mean train score and validation score across folds # plot error plt.plot(nrange , validation_scores.mean(axis=1), label=’cross -validation ’) plt.plot(nrange , train_scores.mean(axis=1), label=’training ’) plt.legend(loc=’best’) plt.title(“Accuracy – NOTE HIGH BIAS ON THE RIGHT -HAND SIDE”) plt.show() # plot accuracy plt.figure () plt.plot(nrange , 1 – validation_scores.mean(axis=1), label=’cross -validation ’) plt.plot(nrange , 1- train_scores.mean(axis=1), label=’training ’) plt.legend(loc=’best’) 5 0 5 10 15 20 25 30 k in KNN 0.00 0.01 0.02 0.03 0.04 0.05 Er ro r Error – NOTE HIGH BIAS ON THE RIGHT-HAND SIDE cross-validation training Figure 1: Output of the Python example included in this document plt.title(“Error – NOTE HIGH BIAS ON THE RIGHT -HAND SIDE”) plt.xlabel(’k in KNN’) plt.ylabel(’Error’) plt.show() The output of this program is shown in Fig. 1. 6 Implementation Make sure the test mode for evaluating predictive accuracy is set to 10-fold cross-validation. This will give you an estimate of the generalisation error on the unseen data. Note that when cross- validation is used in WEKA, there is no information in the output about the error on the training data. The % of correctly classified instances that is printed for cross-validation is the average across k-folds of cross-validation. In order to obtain the % of correctly classified instances on the training data in WEKA, you will need to repeat every experiment selecting “Use training set” instead of “Cross-validation” in the “Test options” panel in the “Classify” tab. This means that you will need to run every experiment (i.e., for every set of the parameters) twice to record accuracy on the training data and on cross-validation. In order to implement your analysis and to address objectives stated in Sec. 1, you will need to run your classification algorithm several times specifying different values of the algorithm’s parameters. The parameters that you will adjust should be those that influence the bias-variance trade-off of the algorithm that you will use. For example, if your algorithm is J48, you may want to tune the 6 Parameter Description minNumObj The minimum number of instances per leaf unpruned Whether pruning is performed confidenceFactor The confidence factor used for pruning (smaller values incur more pruning) reducedErrorPruning Whether reduced-error pruning is used instead of C4.5 (J48) pruning. numFolds Determines the amount of data used for reduced-error pruning. One fold is used for pruning, the rest for growing the tree. Table 1: J48 parameters that can be explored in this assessment. parameters that are shown in Tab. 1 because those parameters control the size of the decision trees. Note that the last objective in Sec. 1 may require different values of the parameters than the pre- vious objectives because its goal is to interpret the model instead of purely increasing its predictive accuracy. 7 Deadline The printed technical report has to be handed in to the Student Administration Office by the deadline for this assessment, which is specified in the Student Data System. It is your responsibility to find out what time the Student Administration Office closes on the day of the deadline. If you prefer to submit your report electronically, there will be a Turnitin link on Moodle that will allow you to so. For electronic submissions, the preferred format is PDF. If you submit ODT or DOCX, your file will be automatically converted to PDF using my bash script that will run libreoffice on your file. If you submit electronically, please include your name and your Kent login on the title page of your document. It takes a lot of time for us to deal with anonymous submissions after we have printed them. 8 Time Estimated to Complete the Assessment The time that students take to write a short report as required in this assessment varies significantly across students; but as a rough estimate, students can be expected to spend about 20 hours to do this assessment. This estimate refers to the total time to do the assessment, i.e., including the time to read documentation about the parameters of the algorithms and their software implementation, carry out the experiments and analyse the results, and write the technical report. Note that this time estimate assumes that the students have been learning the module material on a regular basis. If they did not engage in intensive self-study and reflection on the material provided in the lectures, they may need to spend considerably more time on this assessment. 7 9 Notes on Plagiarism Senate has agreed the following definition of plagiarism: Plagiarism is the act of repeating the ideas or discoveries of another as one’s own. To copy sentences, phrases or even striking expressions without acknowledgement in a manner that may deceive the reader as to the source is plagiarism; to paraphrase in a manner that may deceive the reader is likewise plagiarism. Where such copying or close paraphrase has occurred the mere mention of the source in a bibliography will not be deemed sufficient acknowledgement; in each such instance it must be referred specifically to its source. Verbatim quotations must be directly acknowledged either in inverted commas or by indenting. The work you submit must be your own, except where its original author is clearly referenced. We reserve the right to
run checks on all submitted work in an effort to identify possible plagiarism, and take disciplinary action against anyone found to have committed plagiarism. When you use other peoples’ material, you must clearly indicate the source of the material. 10 General Marking Scheme Your technical report will be assessed based on two main criteria: (1) technical quality, and (2) the comprehensibility of the report. Technical quality, which is the most important criterion, involves the correct use of technical terms, concepts and arguments. In general, the more advanced (and correct) the technical concepts and arguments that you used in your report, the higher the mark. The comprehensibility of the report involves the use of well-written sentences, which are understandable and meaningful, as well as grammatically correct. It also involves the use of clear figures to illustrate your arguments—for instance, you will lose marks if your figure includes text in a very small font size which is hard to read. The more clearly (and correctly) written your text is, and the clearer the figures are, the higher the mark. Your report will be assigned a mark based on a categorical marking scale used by the University, which includes a range of a few discrete numerical marks for each categorical mark, as follows: Mark range: 100, 95, 85, 78, 75, 72 Marks within that range are allocated based on the extent to which your technical report has the following characteristics: The analyses are of excellent technical quality, reporting all the information required in the assessment’s instruction and with many arguments that involve advanced technical concepts and are clearly and correctly explained— with no technical mistakes. Mark range: 68, 65, 62 Marks within that range are allocated based on the extent to which your technical report has the following characteristics: The analyses are of very good technical quality, reporting all the information required in the assessment’s instructions and with several arguments that involve advanced technical concepts and are in general clearly and correctly explained—possibly with a few relatively minor technical mistakes or a few hard-to-understand sentences. 8 Mark range: 58, 55, 52 Marks within that range are allocated based on the extent to which your technical report has the following characteristics: The analyses are not of good quality, in general, but at least the report contains most of the information required in the assessment’s instructions. The technical arguments are not clearly and correctly explained—there are some significant technical mistakes and possibly many hard-to-understand sentences. If the report is of reasonable technical quality with arguments that show legitimate (although not advanced) technical knowledge, then a higher mark in this range (e.g. 58) will be allocated. The marks below (i.e., marks < 50) correspond to a “fail” mark since the pass mark for this module is 50%. Mark range: 48, 45, 42 Marks within that range are allocated based on the extent to which your technical report has the following characteristics: The analyses are of poor quality, in general, and/or the report contains a relatively small part of the information required in the assessment’s instructions. The technical arguments are not clearly and correctly explained—there are many significant technical mistakes and many hard-to-understand sentences, and/or too few technical arguments. Mark range: 38, 35, 32, 20, 10, 0 Marks within that range are allocated based on the extent to which your technical report has the following characteristics: The analyses are of very poor quality, in general, and/or the report lacks most parts of the information required in the assessment’s instructions. The technical arguments cannot be understood, and/or the exiting arguments are invalid. References James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning with applications in R, volume 112. Springer. 9

辅导案例-CO832

Related

Previous Post辅导案例-UTORIAL 4

Next Post辅导案例-COMP0130

Author admin