辅导案例-CS 383-Assignment 2

CS 383 – Machine Learning Assignment 2 – Classification Introduction In this assignment you will perform classification using Logistic Regression, Naive Bayes and Decision Tree classifiers. You will run your implementations on a binary class dataset and report your results. You may not use any functions from a ML library in your code unless explicitly told otherwise. Grading Part 1 (Theory) 15pts Part 2 (Logistic Regression) 20pts Part 3 (Spam Logistic Regression) 25pts Part 4 (Naive Bayes) 30pts Report 10pts TOTAL 100 1 Datasets Iris Dataset (sklearn.datasets.load iris) The Iris flower data set or Fishers Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. The iris data set is widely used as a beginner’s dataset for machine learning purposes. The dataset is included in the machine learning package Scikit-learn, so that users can access it without having to find a source for it. The following python code illustrates usage. from sk l e a rn . da ta s e t s import l o a d i r i s i r i s = l o a d i r i s ( ) Spambase Dataset (spambase.data) This dataset consists of 4601 instances of data, each with 57 features and a class label designating if the sample is spam or not. The features are real valued and are described in much detail here: https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.names Data obtained from: https://archive.ics.uci.edu/ml/datasets/Spambase 2 1 Theory 1. Consider the following set of training examples for an unknown target function: (x1, x2)→ y: Y x1 x2 Count + T T 3 + T F 4 + F T 4 + F F 1 – T T 0 – T F 1 – F T 3 – F F 5 (a) What is the sample entropy, H(Y ) from this training data (using log base 2) (2pts)? (b) What are the information gains for branching on variables x1 and x2 (2pts)? (c) Draw the deicion tree that would be learned by the ID3 algorithm without pruning from this training data (3pts)? 2. We decided that maybe we can use the number of characters and the average word length an essay to determine if the student should get an A in a class or not. Below are five samples of this data: # of Chars Average Word Length Give an A 216 5.68 Yes 69 4.78 Yes 302 2.31 No 60 3.16 Yes 393 4.2 No (a) What are the class priors, P (A = Y es), P (A = No)? (2pt) (b) Find the parameters of the Gaussians necessary to do Gaussian Naive Bayes classification on this decision to give an A or not. Standardize the features first over all the data together so that there is no unfair bias towards the features of different scales (2pts). (c) Using your response from the prior question, determine if an essay with 242 characters and an average word length of 4.56 should get an A or not (3pts). 3. Consider the following questions pertaining to a k-Nearest Neighbors algorithm (1pt): (a) How could you use a validation set to determine the user-defined parameter k? 3 2 Logistic Regression Let’s train and test a Logistic Regression Classifier to classify flowers from the Iris Dataset. First download import the data from sklearn.datasets. As mentioned in the Datasets area, The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. We will map this into a binary classification problem between Iris setosa versus Iris virgincia and versicolor. We will use just the first 2 features, width and length of the sepals. For this part, we will be practicing gradient descent with logistic regression. Use the following code to load the data, and binarize the target values. i r i s = skdata . l o a d i r i s ( ) X = i r i s . data [ : , : 2 ] y = ( i r i s . t a r g e t != 0) ∗ 1 Write a script that: 1. Reads in the data with the script above. 2. Standardizes the data using the mean and standard deviation 3. Initialize the parameters of θ using random values in the range [-1, 1] 4. Do batch gradient descent 5. Terminate when absolute value change in the loss on the data is less than 2−23, or after 10, 000 iterations have passed (whichever occurs first). 6. Use a learning rate η = 0.01. 7. While the termination criteria (mentioned above in the implementation details) hasn’t been met (a) Compute the loss of the data using the logistic regression cost (b) Update each parameter using batch gradient descent Plot the data and the decision boundary using matplotlib. Verify your solution with the Logisti- cRegression sklearn method. from sk l e a rn . l i n ea r mode l import L o g i s t i c R e g r e s s i o n l g r = L o g i s t i c R e g r e s s i o n ( pena l ty =’none ’ , s o l v e r =’ l b f g s ’ , max i ter =10000) l g r . f i t (X, y ) In your writeup, present the thetas from gradient descent that minimize the loss function as well as plots of your method versus the built in LogisticRegression method. 4 3 Logistic Regression Spam Classification Let’s train and test a Logistic Regression Classifier to classifiy Spam or Not from the Spambase Dataset. First download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary class label (0=not-spam, 1=spam). There is no header information in this file and the data is comma separated. Write a script that: 1. Reads in the data. 2. Randomizes the data. 3. Selects the first 2/3 (round up) of the data for training and the remaining for testing (you may use sklearn train test split for this part) 4. Standardizes the data (except for the last column of course) using the training data 5. Initialize the parameters of θ using random values in the range [-1, 1] 6. Do batch gradient descent 7. Terminate when absolute value change in the loss on the data is less than 2−23, or after 1, 500 iterations have passed (whichever occurs first, this will likely be a slow process). 8. Use a learning rate η = 0.01. 9. Classify each testing sample using the model and choosing the class label based on which class probability is higher. 10. Computes the following statistics using the testing data results: (a) Precision (b) Recall (c) F-measure (d) Accuracy Implementation Details 1. Seed the random number generate with zero prior to randomizing the data 2. There are a lot of θs and this will likely be a slow process In your report you will need: 1. The statistics requested for your Logistic classifier run. 5 4 Naive Bayes Classifier Let’s train and test a Naive Bayes Classifier to classifiy Spam or Not from the Spambase Dataset. First download the dataset spambase.data from Blackboard. As mentioned in the Datasets area, this dataset contains 4601 rows of data, each with 57 continuous valued features followed by a binary class label (0=not-spam, 1=spam). There is no header information in this file and the data is comma separated. As always, your code should work on any dataset that lacks header information and has several comma-separated continuous-valued features followed by a class id ∈ 0, 1. Write a script that: 1. Reads in the data. 2. Randomizes the data. 3. Selects the first 2/3 (round up) of the data for training and the remaining for testing 4. Standardizes the data (except for the last column of course) using the training data 5. Divides the training data into two groups: Spam samples, Non-Spam samples. 6. Creates Normal models for each feature for each class. 7. Classify each testing sample using these models and choosing the class label based on which class probability is higher. 8. Computes the following statistics using the testing data results: (a) Precision (b) Recall (c) F-measure (d) Accuracy Implementation Details 1. Seed the random number
generate with zero prior to randomizing the data 2. You may want to consider using the log-exponent trick to avoid underflow issues. Here’s a link about it: https://stats.stackexchange.com/questions/105602/example-of-how-the-log-sum-exp- trick-works-in-naive-bayes 3. If you decide to work in log space, realize that python interprets 0log0 as inf. You should identify this situation and either add an EPS (very small positive number) or consider it to be a value of zero. In your report you will need: 1. The statistics requested for your Naive Bayes classifier run. 6 Submission For your submission, upload to Blackboard a single zip file containing: 1. PDF Writeup 2. Python notebook Code The PDF document should contain the following: 1. Part 1: (a) Answers to Theory Questions 2. Part 2: (a) Requested Logistic Regression thetas and plots 3. Part 3: (a) Requested Classification Statistics 4. Part 4: (a) Requested Classification Statistics 7

辅导案例-CS 383-Assignment 2

Related

Previous Post辅导案例-ELEC3115

Next Post辅导案例-CS 383-Assignment 1

Author admin