辅导案例-COMP2610/COMP6261

COMP2610/COMP6261 — Information TheoryAssignment 1Robert C. WilliamsonAugust 18, 2019Due 10:00AM 16 SeptemberFollow the submission and format requirements spelled out on the Wattle page.COMP2610 students, do questions 1 and 2 (worth 50 and 30 marks respectively); you willbe marked out of 80. COMP6261 students, do all questions.Most of the questions require you to show a result which I state. And most build upon theclaims in earlier parts. If you are unable to show a particular part, you may still use the result(even iif you have not shown it) in showing subsequent parts. Thus you can start anywhere!The assignment has a philosophical theme, which is somewhat contrary to the flavourof our text, namely that the standard Shannon information is not sufficient for theoreticallyunderstanding data analysis (machine learning) problems. The good news is that there are othernotions of information (which you will meet in the assignment) which are the “right” notions forsuch a task. The justification for this claim is in the deceptively simple looking boxed expressionin question 2(g); almost all preceeding questions are steps in its derivation.1Question 1 (50 marks total) Suppose g : R+ → R is convex and g(1) = 0 (such a g is called agenerator henceforth). The generalised divergence between two distributions p and q onsome set X is defined byDg(p ‖ q) B EX∼q[g(p(X)q(X))]=∑x∈Xq(x)g(p(x)q(x)).Here it is assumed that g(0) = limx↓0 g(x), 0g(0/0) = 0 and 0g(a/0) = limx↓0 xg(a/x) fora > 0.(a) Prove that for all generators g, and for all p, q Dg(p ‖ q) ≥ 0. (2 marks)(b) For any convex function g : R+ → R, the perspective of g is a function g˘ : R+×R+ →R given byg˘(x, y) = yg(x/y).It is known that the convexity of g guarantees the (joint)-convexity of g˘(x, y) (mean-ing it is convex in the pair (x, y) and consequently convex seperately in each of xand y). Show that the map (p, q) 7→ Dg(p ‖ q) is convex. (2 marks)(c) Show that g˘(x, y) is one-homogeneous: that is, for all x, y, for any α > 0, g˘(αx, αy) =αg˘(x, y). Hence show thatlimα→∞1αg˘(αx, αy) = g˘(x, y).(2 marks)(d) Using the previous result, show that for any generator g, Dg(p ‖ q) satisfiesDg(p ‖ q) =∑x∈Xg˘(p(x), q(x)).(5 marks)(e) Suppose pXY and qXY are two joint distributions with marginals pY and pY respec-tively. Show that for any generator g,Dg(pXY ‖ qXY ) ≥ Dg(pY ‖ qY ).(5 marks)(f) Given two distributions pX , qX and a channel pY |X , definepY =∑x∈XpY |X(· | x)pX(x)qY =∑x∈XpY |X(· | x)qX(x).Show the data processing theorem for generalised divergences: for any generator g,Dg(pY ‖ qY ) ≤ Dg(pX ‖ qX).Hint: use the expression in part (d) for Dg in terms of g˘, and exploit the positivehomogeneity and convexity of g˘. (15 marks)2(g) Given a generator g, let g(x) := xg(1/x) = g˘(x, 1), which is thus guaranteed convex(by the property of the perspective), and satisfies g(1) = 0, and thus g is also agenerator. Show that for any generator g, for all p, q,Dg(p ‖ q) = Dg(q ‖ p),and show that (g) = g. (5 marks)(h) Consequently, determine the form of the generator gKL suchDgKL(p ‖ q) = DKL(p ‖ q).(5 marks)(i) Let gTV(x) = |x − 1|. Show that for all p, q,DgTV(p ‖ q) =∑x∈X|p(x) − q(x)|.(5 marks)(j) Show that for any p, q, DgTV(p, q) ≤ 2. Thus prove that there cannot exist a functionΦ : R→ R such that for all p, q:DKL(p ‖ q) ≤ Φ(DgTV(p ‖ q)).That is, one can not upper bound DKL in terms of DgTV , or equivalently, onecannot lower bound DgTV in terms of DKL (in a manner that holds uniformly forall distributions p, q). (4 marks)3Question 2 (30 marks total) Let X be a random variable on X and let Y be a binary randomvariable. Let η(x) B P(Y = 1|X = x) = E(Y |X = x). Letf ∗(x) B I{η(x)>1/2}(x),where IA(x) is the indicator function of A: IA(x) = 1 if x ∈ A, and IA(x) = 0 if x < Aand when we write I{Φ(x)} for some proposition, we mean I{x∈X : Φ(x)}. Let us assumehenceforth, for simplicity, that η(x) , 1/2 for all x ∈ X. (This assumption can easily beremoved by defining f ∗ as above, except when η(x) = 1/2, f ∗ is defined to be a Bern(1/2)random variable. All of the results below actually hold for that more general case.)(a) Suppose f : X → {0, 1} is an arbitrary decision function. Show thatP( f (X) , Y |X = x) = 1 − (I{ f (x)=1}(x)η(x) + I{ f (x)=0}(x)(1 − η(x))) .(4 marks)(b) Show that for any f : X → {0, 1}, and all x ∈ X,P( f (X) , Y |X = x) − P( f ∗(X) , Y |X = x) ≥ 0.(4 marks)(c) Thus show that for any f ,P( f ∗(X) , Y ) ≤ P( f (X) ≤ Y ),and hence that f ∗ is the optimal decision function. (3 marks)(d) Show that for all x ∈ XP( f ∗(X) , Y |X = x) = min(η(x), 1 − η(x),and thus thatP( f ∗(X) , Y ) = EX min(η(X), 1 − η(X)).(3 marks)(e) The quantity P( f ∗(X) , Y ) denotes the error of the optimal decision function; thisis the smallest error one can make in predicting Y from X (regardless of method orcomputational resources). Show thatP( f ∗(X) , Y ) = 12− 12EX[|2η(X) − 1|].Hint: use min(a, b) = 12 (a + b) − 12 |a − b|. (3 marks)(f) Let p(x) B P(X = x |Y = 1) and q(x) B P(X = x |Y = 0), for x ∈ X. SupposeP(Y = 1) = 1/2. Show thatDgTV(p ‖ q) = 2EX |2η(X) − 1|(8 marks)4(g) With f ∗, p, q, X and Y as in part (f), thus show thatP( f ∗(X) , Y ) = 12− 14DgTV (p ‖ q) ,and thus that the optimal performance of a statistical classifier is governed by theDgTV divergence between the “class-conditional distributions” p and q. (2 marks)(h) Using question 1(j), argue that no such control of the error performance of a statisticalclassifier is possible in terms of DKL, and thus that for the purposes of statisticalclassification (machine learning) the standard notion of information (i.e. Shannoninformation) does not suffice. (3 marks)5Question 3 (20 marks total) Suppose X is a random variable on a set X with |X| = M . LetpX denote its distribution. Let pmax B maxx∈X pX(x).(a) Let p = (pmax, p2, . . . , pM), where pmax ≥ pi, for i = 2, . . . ,M . Letq =(pmax,1 − pmaxM − 1 , . . . ,1 − pmaxM − 1).Show thatDKL(p ‖ q) = H(X) − (1 − pmax) log(M − 1) − h2(pmax)and thus that for any random variable X on X,H(X) ≤ (1 − pmax) log(M − 1) + h2(pmax) =: FM(pmax).(10 marks)(b) Show that the function FM defined above is concave. (5 marks)(c) Suppose |X| = M and that X −→ Y −→ Xˆ forms a Markov chain. The idea is thatone is wanting to predict X by using Xˆ . Let Xˆ = argmaxx pX |Y (x |y) (i.e. chose themost likely value of x as one’s estimate Xˆ). Let pe B P(X , Xˆ) (the probability oferror in predicting X by using Xˆ). Show thatH(X |Y ) ≤ FM(P(X = Xˆ)) = pe log(M − 1) + h2(pe).(5 marks)6

辅导案例-COMP2610/COMP6261

Related

Previous Post墨尔本大学COMP10001课业解析

Next Post辅导案例-ELEC4620/7462

Author admin