辅导案例-COMPSCI 671D

COMPSCI 671D Fall 2019 Homework 4 1 EM Algorithm for Topic Modeling (35 points) In this question we will try to design an algorithm for discovering the abstract topics that occur in a set of documents. In the problem, we have a set of M abstract documents in the universe, D, which are mixtures of latent topics. We also have a dictionary of words, W, with size N . Intuitively, W is the list of all unique words that occur at least once within D. Using these two sets we can denote the number of times word wj occurs in document di as n(wj , di). It follows that we can represent a document di as a vector of size N , each entry of which corresponds to how many times word wj appears in it. The topics z are latent variables in a topic set Z sized K. Intuitively, if a document is about a certain topic, it will include some particular words with higher probability. For example, “music”, “show” and “film” appear frequently in documents about Arts while “school”, “student” and “teachers” usually occur if the document is about Education. Meanwhile, a document can be a mixture of several topics, e.g., a document can be about arts education for high school students. Inspired by the intuition, a reasonable approach to model the generative process is as follows. Pr(di, zk, wj) = Pr(di) Pr(zk|di) Pr(wj |zk) which means the topics only depend on the documents and the words only depend on topics. For computational convenience, let Pr(di) be a uniform distribution and Pr(zk|di) and Pr(wj |zk) be Multinomial distributions. i.e. Pr(di) = 1 M Pr(zk|di) = αik, K∑ k=1 αik = 1 Pr(wj |zk) = βkj , N∑ j=1 βkj = 1. We are interested in learning αik and βkj because they are substantively interesting: αik represents the proportion of topic k that makes up document i, and βkj the probability of seeing word j under topic k. Therefore, a single word wj in a document di is generated in this way: (1) Choose a topic zk from the topic distribution Pr(zk|di), the probability of choosing topic k is αik; (2) Choose a word from the vocabulary distribution p(wj |zk) in this topic, the probability is βkj . We want to learn these parameters by maximizing the likelihood of the data using the EM algorithm. 1 a) For our set of N documents and W word of choices, write down the log-likelihood of the model above, i.e. log(p(D = di,W = wj |α, β)). Remember that in the EM (Expectation-Maximization) algorithm, we can figure out the parame- ters and the hidden variables by iteratively computing (1) the distribution of latent variables using the old parameters and (2) find new parameters that maximize the likelihood using the old latent variable distribution. So, you can do the following: b) E-step: Derive the distribution of latent variable p(zk|wj , di, αold, βold) by fixing the old pa- rameters. c) M-step: Find the αnew and βnew that optimize the log likelihood using p(zk|wj , di, αold, βold) as the distribution of z. You would iterate these until convergence to get the final parameters in practice. Just so you know, the model you have just been working with in this problem is a very famous and very popular model. 🙂 2 Neural Networks (65 points) In this problem you will get the chance to construct a neural network with architecture 784−H − H − 1, where H is the number of hidden nodes you choose. This neural network can be used for binary classification, such 0, 1 digits classification for MNIST. The model can be represented as f(x; θ) : R784 → [0, 1] where θ = {W1 ∈ R784×H ,b1 ∈ RH ,W2 ∈ RH×H ,b2 ∈ RH ,w3 ∈ RH , b3 ∈ R, }. Explicitly, f(x) is h1 = σ ( W>1 x+ b1 ) h2 = σ ( W>2 h1 + b2 ) f(x) = σ ( w>3 h2 + b3 ) where σ(z) = 1 1+e−z , and it is an element-wise operator, which means if x = (x 1, x2, …, xd) ∈ Rd, we have σ(x) = (σ(x1), σ(x2), …, σ(xd)). The loss function for this model (or the negative log-likelihood) is the usual one: L(θ) = − N∑ i=1 (yi log f(xi) + (1− yi) log(1− f(xi))) a) Backpropogation Derive the gradients ∇w3L(θ), ∇b3L(θ), ∇W2L(θ), ∇b2L(θ), ∇W1L(θ), ∇b1L(θ) by backprop- agation. Hints: The problem would be much easier if you define zl = W T l hl−1 + bl, and derive the relationship between ∇zlL(θ) and ∇zl−1L(θ). Also, you can give the final result in a recursive way as long as it is correct and consistent. 2 b) Weight Initialization Neural Networks are normally trained by gradient descent based optimization algorithms, for which we have to give initial values for our parameters (weight and bias). The bias parameters are usually initialized as 0’s, while the weight parameters are usually initialized by independently sampling from N(0, σ2). (i) What are the potential problems if we choose σ to be either too small or too large? (ii) One nice property we would like our weights to have is that by proper initialization, we want the values before activation to be similar in distribution for each layer. It turns out we can achieve this by choosing σ properly. Now suppose we use ReLU activation in our neural network and the inputs are normalized. We want to choose the proper σl for the l-th hidden layer such that V ar(zl) = V ar(zl−1). We assume the neurons in each layer are mutually independent and share the same distribution and we use zl to represent the random variable that generates zl, where zl = W T l hl−1 + bl and hl−1 = max(0, zl−1) and Wl ∈ Rnl−1×nl . Find σt such that V ar(zl) = V ar(zl−1) and prove your result. c) Implement your Neural Network Use the preprocessed MNIST data provided in the attachment. MNIST is a dataset of images of handwritten digits that also contain labels for the number they correspond to. We want to train a neural network to learn to recognize these handwritten digits. Use gradient descent with the gradients you have derived in part a) to train a neural network on the data. Report the accuracy for the network on the train and test data. Since the sample size is large, we can approximate the true gradient with stochastic gradient for computational convenience: ∇L(θ) ≈ 1 B B∑ j=1 ∇L(θ)j which means at each step, instead of calculating the gradients with all samples in the dataset, we randomly pick a mini-batch of B samples and calculate the gradient using these samples. In implementing the neural network, it would be beneficial to build and train the model in a general way. That is, build a NN with d hidden layers such that we can easily modify d. d) Vanishing Gradient (i) In our current model, we have 2 hidden layers. What is the magnitude of ∇W1L(θ) (you can use Frobenius norm or induced matrix norm)? Now try to modify the the number of hidden layers d, from 1 to 10. What happens to the magnitude of ∇W1L(θ) when we increase d? Plot the magnitude against d on log scale. In this question, you don’t need to train the whole neural network, just initialize it with random weights, e.g. N (0, 1), and calculate the gradient. (ii) Suppose that all weights are finite and the weight matrices W2,W3, …,Wd are diagonal- izable matrices in which the absolute value of the eigenvalues are upper-bounded by 4 − . Prove that lim d→∞ ∇W1L(θ) = 0 Hints: Try to use the the relationship between ∇zlL(θ) and ∇zl−1L(θ) and try to apply a common inequality. 3 (iii) The result above is known as the vanishing gradient problem in the feedforward neural network. Will there still be such a problem if we use a tanh activation? What if we use a ReLU activation? Give an intuitive explanation of why vanishing gradient happens in light of the shapes of acti- vation functions and provide 2 ways of dealing with this problem. e) Sigmoid Activation One potential problem of using the sigmoid activation function for hidden layers is that it maps from R to [0, 1], which means the output of the sigmoid function is always positive. What happens to the signs of the gradients of hidden layer weights in this situation if we only have one example? What if we are adding gradients up across a batch of data? 4

辅导案例-COMPSCI 671D

Related

Previous Post辅导案例-GU4206/GR5206

Next Post辅导案例-E 109

Author admin