程序代写案例-PPHA 30545

  • January 31, 2021

Machine Learning – Lab Mini-Project 1 PPHA 30545 – Professor Clapp Winter 2021 This assignment must be handed in via Gradescope on Canvas by 11:45pm Central Time on Monday, February 1st. You are welcome (and encouraged!) to form study groups (of no more than 3 students) to work on the problem sets and mini-projects together. But you must write your own code and your own solutions. Please be sure to include the names of those in your group on your submission. You should submit your code as a single Python (*.py) file and the write up of your solutions as a single PDF. For the former, please also be sure to practice the good coding practices you learned in PPHA 30535/6 and comment your code, cite any sources you consult, etc. For the latter, you may type your answers or write them out by hand and scan them (as long as they are legible). You are allowed to consult the textbook authors’ websites, Python documentation, and websites like StackOverflow for general coding questions. You are not allowed to consult material from other classes (e.g., old problem sets, exams, answer keys) or websites that post solutions under the guise of tutoring. 1 Overview After graduating from Harris, you are quickly hired to work for the President’s Council of Eco- nomic Advisors (CEA).1 The CEA is an agency within the Executive Branch that provides the President with objective advice to inform both domestic and international policy. According to its webpage, the “[CEA] bases its recommendations and analysis on economic research and empirical evidence, using the best data available to support the President in setting our nation’s economic policy.” Your boss has asked you to conduct research using data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS) to predict the returns to education and inform policy. Your analysis will help shape your office’s recommendations to the President and help set her education agenda with a specific focus on the expansion of access to higher education.2 The project has three parts: (1) obtaining data from the Internet, (2) cleaning that data, and (3) performing data analysis and answering questions. 1Your family is very proud and all of your friends are jealous of your great gig. You tell them you’re so glad that you took Machine Learning, as it really helped you land the job. 2The ACS contains information similar to the Decennial Census Long Form Questionnaire that it replaced after the 2000 Census. It is an annual sample of one in 40 households in the country. For reference, every decade the Long Form sampled one in 6 households. See https://www.census.gov/programs-surveys/acs/technical-documentation/pums.html for more information. 1 2 Obtaining the Data 1. First, navigate to the IPUMS USA website: https://usa.ipums.org/usa/index.shtml.3 2. Choose “Browse and Select Data” from the menu on the left. 3. Choose “Select Samples” by clicking the light blue box. 4. Select the most current year of ACS data only. Do not include the 3 and 5-year versions of the data.4 Then “Submit sample selections.” 5. Now you get to go shopping for data.5 Under “Select Variables” -> (a) “Person” -> “Demographics,” add the following to your cart i. AGE ii. SEX iii. MARST (b) “Person” -> “Family Interrelationship,” add the following to your cart i. NCHILD ii. NCHLT5 (c) “Person” -> “Race, Ethnicity and Nativity,” add the following to your cart i. RACE ii. HISPAN (d) “Person” -> “Education,” add the following to your cart i. EDUC (e) “Person” -> “Work,” add the following to your cart i. EMPSTAT (f) “Person” -> “Income,” add the following to your cart i. INCWAGE (g) “Person” -> “Veteran Status,” add the following to your cart i. VETSTAT 6. Click on the “View Cart” button. Check to make sure you got everything. Click on the “Create data extract” button. 3Census Bureau datasets are notoriously difficult to download in usable forms. In order make the data more accessible, the wonderful people at the Institute for Social Research and Data Innovation at the University of Minnesota created the Integrated Public Use Microdata Series (IPUMS) which is an awesomely streamlined way to get your hands on the data you want. Note that that they make many additional datasets available for download via (for example) IPUMS International, IPUMS Global Health, and IPUMS Time Use, among others. 4In order to ensure large enough sample sizes to maintain confidentiality, the Census pools data over multiple years for geographic units with fewer people. 5This is like opening birthday presents for a data scientist! 2 7. Click on “Customize sample sizes.” (a) Since we’re dealing with a large sample from the national population, we have far more observations than we can easily process. Under “Households,” enter “10” so the dataset you create has 10,000 households. This make working with the data easier, but will still give us a “big data” dataset.6 (b) Click “Submit.” 8. Click on “Select cases.” (a) Since we’re looking at wages as a function of education, we’re only going to keep those involved in the labor force. Select EMPSTAT. Just to be safe, let’s also restrict our sample by age. Select AGE. (b) Click “Submit.” (c) Check “Include only those persons meeting case selection criteria.” i. Under EMPSTAT, check the box for “Employed” workers. ii. For AGE, select ages from 18 to 65. (d) Click “Submit.” 9. To the right of “Data Format,” click on “Change.” (a) Select “Comma delimited (.csv)” or whatever your preferred format is. (b) Select “Rectangular, person (default).” (c) Click “Submit.” 10. Give your extract a brief description. 11. Click on the box that says “Submit extract.” 12. Request an account or sign in. 13. Finally, hit “Submit extract.”7 14. Once your extract has been created, navigate to the IPUMS download page: https://usa.ipums.org/usa- action/data_requests/download. (a) Click on the “Download CSV” link (in the first column). Save the file to your hard drive. 6When you “Customize sample sizes,” IPUMS will randomly draw 10,000 observations for you. Since this is a random process and the “Select cases” occurs after the random draw of observations, don’t be worried if a study partner has a slightly (up to a few hundred) different number of observations. 7It will take the IPUMS system a little while to create your extract, so go take a break or work on something else. The IPUMS system will email you once your extract has been created. Try to contain your excitement over the fun data that you’ll soon get to play with, lest friends and family think you’re weird. 3 (b) Right-click on the “Basic” codebook file and save the *.cbk (text) file to your hard drive. (c) Unzip the data file and load the data in Python. For help unzipping a *.gz file (Unix’s version of *.zip), check out “Step 2: Decompress the data file” here: https://usa.ipums.org/usa/extract_instructions.shtml (just note that the instructions are for the *.dat (text) file and you want the *.csv file). 3 Preparing the Data 1. First, take a few minutes to become familiar with the data. 2. For our analysis, we’ll need to use the codebook we saved to clean and create a few variables. (a) Education – We have a categorical measurement of education (educd). For some of our analysis, we need a continuous variable. Use the educd variable to create a continuous measure of education called educdc using the crosswalk at the end of this document. A *.csv version of the crosswalk is available on Canvas. (b) Dummy Variables – Create the following dummy variables: i. A dummy, hsdip, equal to 1 if the individual has a high school diploma (but not a bachelors or higher degree). Note: in general, how one codes individuals with a GED or associates degree is a decision the researcher has to make based on the context of his/her research question. To keep things standard for the project, code these individuals as having a high school diploma. ii. A dummy, coldip, equal to 1 if the individual has a four-year college diploma (or a higher degree that required earning a college diploma first). iii. A dummy, white, equal to 1 if the individual is white. iv. A dummy, black, equal to 1 if the individual is black. v. A dummy, hispanic, equal to 1 if the individual is of Hispanic origin. vi. A dummy, married, equal to 1 if the individual is married. vii. A dummy, f emale, equal to 1 if the individual is female. viii. A dummy, vet, equal to 1 if the individual is a veteran. (c) Interaction Terms – Create an interaction between each of the education dummy vari- ables (A-B) and education. (d) Created Variables – Create the following i. Age squared. ii. The natural log of incwage. 4 Data Analysis 1. Compute descriptive (summary) statistics for the following variables: year, incwage, lnincwage, educdc, f emale, age, age2, white, black, hispanic, married, nchild, vet, hsdip, coldip, and the interaction terms. In other words, compute sample means, standard deviations, etc. 4 2. Scatter plot ln(incwage) and education. Include a linear fit line. Be sure to label all axes and include an informative title. 3. Estimate the following model: ln(incwage) = β0 +β1educdc+β2 f emale+β3age+β4age2 + β5white+β6black+β8hispanic + β9married+β10nchild+β11vet+ ε, and report your results. (a) What fraction of the variation in log wages does the model explain? (b) Test the hypothesis that H0 : β1 = β2 = . . . = β11 = 0 HA : β j 6= 0 f or some j with α = 0.10. (c) What is the return to an additional year of education? Is this statistically significant? Is it practically significant? Briefly explain. (d) At what age does the model predict an individual will achieve the highest wage? (e) Does the model predict that men or women will have higher wages, all else equal? Briefly explain why we might observe this pattern in the data. (f) Interpret the coefficients on the white, black, and hispanic variables. (g) Test the hypothesis that race has no effect on wages. Be sure to explicitly state the null and alternative hypotheses and show your calculations. 4. Graph ln(incwage) and education. Include a three distinct linear fit lines specific to individ- uals with no high school diploma, a high school diploma, and a college degree. Be sure to label all axis and include an informative title. 5. Since the President is considering new education legislation, she asks you to determine whether a college degree is a strong predictor of wages. Write down a model that will allow the returns to education to vary by degree acquired (use the three categories in the previous question).8 Be sure to include the controls from question 3. Explain/justify why you think your model is the best possible representation of the way the world works. 6. Estimate the model you proposed in the previous question and report your results. (a) Predict the wages of an 22 year old, female individual (who is neither white, black, nor Hispanic, is not married, has no children, and is not a veteran) with a high school diploma and an all else equal individual with a college diploma. Assume that it takes someone 12 years to graduate high school and 16 years to graduate college. 8These are known as “sheepskin” effects. 5 (b) The President wants to know, given your results, do individuals with college degrees have higher predicted wages than those without? By how much? Briefly explain. (c) The President asked you to look into this question because she is considering legislation that will expand access to college education (for instance, by increasing student loan subsidies). She will only support the legislation if there are cost offsets (if college education increases wages and therefore, future income tax revenues that help reduce the net cost of the subsidy). Given that criteria, how would you advise the President? 7. There are many ways that this model could be improved. How would you do things dif- ferently if you were asked to predict the returns to education given the data available on IPUMS? 6 Table 1: Crosswalk educd educdc 2 0 10 0 11 2 12 0 13 2.5 14 1 15 2 16 3 17 4 20 6.5 21 5.5 22 5 23 6 24 7.5 25 7 26 8 30 9 40 10 50 11 61 12 62 12 63 12 64 12 65 13 70 13 71 14 80 14 81 14 82 14 83 14 90 15 100 16 101 16 110 17 111 18 112 19 113 20 114 18 115 18 116 22 7 欢迎咨询51作业君