辅导案例-DSC 190

DSC 190: Midterm May 7, 2020 Guidelines • Exam duration: 9:30 AM, May 7 to 9:30 AM, May 8 PT. • There are 9 problems in total. • You are not supposed to code anything for this midterm. • Submission must be made on Gradescope. • Some of the problems may not have unique answers. Your justification and reasoning will be the most important part. 1. SVM (10 points): Suppose you have the following dataset with 6 observations and 2 classes (Green, Blue). x1 x2 y 2 4 Green 4 4 Blue 2 1 Green 2 2 Green 4 3 Blue 4 2 Blue Table 1: Dataset for SVM (a) Draw a rough plot of the 2D observations and also draw the maximal margin hyperplane separating the two classes. What is the equation for the hyperplane? (b) What are the support vectors for the maximal margin classifier? (c) What is the margin for the maximal margin hyperplane? You have to solve the question manually. Code will not be accepted as a valid answer. 2. Logistic Regression (10 points): Assume we collect some data on NFL players with two variables X1 = Number of hours of training per week, X2 = Player Rating on a scale of 1 to 5, and Y = The player scores a touchdown (Yes = 1 and No = 0). We fit a logistic regression and find the estimated weights, w0 = −6, w1 = 0.05, w2 = 1. Assume the regression model is accurate to answer the following questions: (a) Estimate the probability that a player who trains for 40 hrs per week and has a rating of 3.5 scores a touchdown in the upcoming game. (b) How many hours per week would the player in part (a) need to train to have a 50% chance of scoring a touchdown in the next game? 1 DSC 190: Midterm May 7, 2020 3. Overfitting, Bias vs. Variance (10 points): (a) List three common strategies to address overfitting. (b) Draw a graph of bias and variance vs. model complexity to show how bias and variance change as the model complexity increases. Briefly explain the graph. 4. Comparing Data Mining Concepts/Methods (10 points): Briefly describe two major differences between the following pairs of concepts or methods. Please be concise in your explanations. (a) Linear Regression and Logistic Regression (b) Linear Regression and Linear SVM (c) Linear SVM and Kernel SVM (d) k-Means vs. Expectation-Maximization (EM) (e) PCA and t-SNE (f) Frequent Patterns and Association Rules 5. Naive Bayes (10 points): In this question, please give the final value as well as the necessary intermediate steps. Suppose you have the following training set with three Boolean input variables: x, y and z, and a Boolean output variable U . x y z U 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 1 Table 2: Input-Output features Suppose you have trained a Naive Bayes classifier to predict U with x, y, and z as features. Once the learning is finished: (a) What would be the predicted probability of P (U = 0|x = 0, y = 1, z = 0)? (b) What would be the predicted probability of P (U = 0|x = 0)? For the next two parts, assume that a Naive Bayes classifier is learned by considering the combination of (x, y, z) as one feature instead of x, y, z being three different features: (c) What would be the predicted probability of P (U = 0|x = 0, y = 1, z = 0)? (d) What would be the predicted probability of P (U = 0|x = 0)? 2 DSC 190: Midterm May 7, 2020 6. Evaluation Measurement (10 points): Consider the sentiment detection task where the possible labels are Positive and Negative. Suppose you are evaluating on two datasets: Dataset-A and Dataset-B. The dataset statistics are mentioned in Table-3, 4. Label Number of Samples Positive 900 Negative 100 Table 3: Dataset-A Statistics Label Number of Samples Positive 550 Negative 450 Table 4: Dataset-B Statistics Actual Positive Actual Negative Predicted Positive 800 70 Predicted Negative 100 30 Table 5: Confusion Matrix on Dataset-A Actual Positive Actual Negative Predicted Positive 450 130 Predicted Negative 100 320 Table 6: Confusion Matrix on Dataset-B (a) Which of the following performance metric would be a better representative of the performance of any classifier on Dataset-A? For example, high performance on Dataset-A according to a metric M implies that the classifier is very good. Explain your reasoning. (a) Precision (b) Recall (c) F1-score (d) Accuracy 3 DSC 190: Midterm May 7, 2020 (b) Which of the following performance metric would be a better representative of the performance of any classifier on Dataset-B? For example, high performance on Dataset-B according to a metric M implies that the classifier is very good. Explain your reasoning. (a) Precision (b) Recall (c) F1-score (d) Accuracy (c) Suppose we evaluated a classifier on Dataset-A and Dataset-B and the confusion matrices are shown in Table-5, 6. For each case, compute Precision, Recall, F1-score, Accuracy. Comment on the performance of the classifier, i.e., on which dataset, the classifier is more effective, and why? (d) Consider the following metric M described as follows: M = TP×TNFN×FP where TP represents number of True Positives, TN represents number of True Negatives, FN represents number of False Negatives and FP represents number of False Positives. Come up with a scenario in which this metric is useful and is a representative of the performance of the classifier and justify why. 7. Model Choices for Binary Classification (10 points): (a) Given a binary classification training set of 1,000,000 instances, suppose 1% of training instances were wrongly labeled. Which classifier will you prefer to conduct training on this dataset? Decision Tree or Random Forest? Why? (b) Given a binary classification training set of 1,000,000 instances, suppose there are a few outliers as we observed in the training data visualization. Which classifier will you prefer to conduct training on this dataset? Logistic Regression or SVM? Why? (c) To build a classifier on high-dimensional features using small training data, one will need to consider the scenario where many features are just irrelevant noises. To train a generalizable classifier, do you want to use Naive Bayes or Logistic Regression? If you choose Logistic Regression, which regularization setting do you want to use and why? 8. Arya Mixture Model (20 points): Suppose there exists a distribution called Arya (this is a distribution created by us) whose probability density function is described below. Suppose there are K clusters and each cluster Zi is characterized by an Arya distribution and cluster priors are mentioned below. Let’s assume each data point Xi ∈ R+ (i = 1…n) is drawn as follows: P (Zi) = pii for i = 1, 2, . . .K Xi ∼ Arya(2, βZi) The probability density function of Arya(2, β) is: P (X = x) = β2xe−βx (a) Suppose K = 3 and β1 = 1, β2 = 2, β3 = 4. What is P (Z = 1|X = 1)? (b) Describe the E-step. Write an equation for each value that is computed. 4 DSC 190: Midterm May 7, 2020 9. K-means Clustering (10 points): Given the (x, y) pairs in table-7, you have to cluster them into 2 clusters using the k-means algorithm. Assume k-means uses Euclidean distance. Data Point Index x y 1 1.90 0.97 2 1.76 0.84 3 2.32 1.63 4 2.31 2.09 5 1.14 2.11 6 5.02 3.02 7 5.74 3.84 8 2.25 3.47 9 4.71 3.60 10 3.17 4.96 Table 7: Dataset for k-means clustering Figure 1: Dataset for k-means 5 DSC 190: Midterm May 7, 2020 The plot of points is shown in Figure-1. Let the first cluster center be the tenth data point and the second cluster center be the first data point. Please run the k-means algorithm for 1 iteration with the number of clusters= 2. What are the cluster assignments after 1 iteration? What are the cluster assignments after convergence? Fill in the table below. You need not code this. You can just do this manually by calculating the distance (Euclidean distance). Data Point Index Cluster Assignment after One Iteration Cluster Assignment after Convergence 1 2 3 4 5 6 7 8 9 10 Table 8: Cluster Assignments 6

辅导案例-DSC 190

Related

Previous Post辅导案例-DATA2001/DATA2901

Next Post辅导案例-COMP 110

Author admin