辅导案例-STAT1003

  • June 5, 2020

STAT1003 Take-Home Project 1 STAT1003 Introduction to Data Science Take-Home Project Semester 1, 2020 Objective This take-home project is one of three assessments (along with Tests 1 & 2) in this unit. It is worth 40% of the overall mark. The main objective is to allow the participants in the unit to demonstrate their grasp of the fundamentals of data science that have been discussed during the unit: getting data that will help answer one or more substantive questions; cleaning, wrangling, and then exploring it using interesting and informative visualizations; carrying out statistical modelling or machine learning in order to make predictions; and communicating results in a compelling way. The project should focus on the analysis of a substantive dataset that participants may obtain from online sources. Up to now, in both lectures and tutorials, we have analyzed data and fitted predictive models as if the steps to do so were clear, well-laid out, and led invariably to a ‘correct’ answer. Reality, however, is messier. There is not a linear path from problem and data to solution, and one of the pedagogical objectives of the project is to allow participants to get some sense of that. Participants should work in teams of 3 people. Analysis and reporting are to be carried out in R/RStudio using R Markdown. Assessment Task Mark Due Proposal (2–3 pp.) 5% TW8 Written Report (10 – 15 pp. excluding appendices) 25% TW12 Oral Presentation 5% TW12 Reflections on Workshops 5% TW12 Details of the how to go about getting data and the Project Proposal are shown below, and rubrics and other reference material will be available on Blackboard shortly. STAT1003 Take-Home Project 2 Data There are many public sources of data available, including open data websites such as OpenDataSoft. The appendix also contains a list of data websites compiled by an academic at the University of Idaho. The idea is to find a dataset that is sufficiently complex to allow you to demonstrate your familiarity with the methods studied in the unit, and those that we have not. You will find yourself more motivated if you select a dataset from a field that is of interest to you. If you’re not sure about the dataset, please see me to discuss it. Project Proposal The project proposal is a short (2-3 page) Word document produced using R Markdown that contains: 1. Title 2. Data & Analyses a. Objective: What do you plan on analyzing/predicting/classifying and why? b. Where do the data come from? Have these data been analyzed before? c. Describe context and variables and their types; show one or two representative plots/tables d. What is the extent of data wrangling and cleaning that will be required? What analyses do you propose to carry out? e. How will you evaluate the predictive/classification models? Project Report The project report should be written as a formal technical report. It can be written wholly in R Markdown and then converted to Word, or some combination of R Markdown for technical appendices and Word for the main body. There is no prescribed structure, but it should contain the following elements: 1. Introduction: Problem Statement and Background • What is the problem you are trying to solve? Where do the data come from? Include background material as appropriate. What other analyses have been carried out on this data? • What was the broader context of the questions you were trying to answer? Are you mainly carrying out a detailed/complex exploratory analysis, or are you also going to try to predict an outcome variable? 2. Methods • What are the methods you used for exploratory analysis and for prediction/classification? Provide background information on methods that we did not cover in the unit. • What data cleaning/wrangling/tidying did you have to do before analysis? Were there outliers that you had to remove? • What worked, and what didn’t work? 3. Results • What’s the story you can tell about your dataset based on the analyses you carried out? STAT1003 Take-Home Project 3 • Provide a detailed description of your results. • If the data have been analyzed before, how well did your results/methods compare to those that others used? • Use informative and interesting visualizations for EDA and for displaying your results and allowing the reader insight into your results. 4. Conclusions and Lessons Learned • Summarize the main results here. • What would you have done differently? What other methods could you have used? What worked and didn’t work? 5. Appendices • Put all your R code/intermediate results here Depending on the complexity of the problem you have decided to tackle, the main body of the report will be 10–15 pages long, including a handful of important plots and tables. The appendix should contain the R Markdown file and the resulting output from your data wrangling, exploratory data analysis, and quantitative analysis. If you use any external resources such as books or websites – and you are encouraged to do so! – please make sure that you cite them appropriately. The appendix contains the rubric that will be used for marking the report. STAT1003 Take-Home Project 4 You will be required to add the following statement to your report: 1. This assignment is my/our own original work, except where I/we have appropriately cited the original source (appropriate citation of original work will vary from discipline to discipline). 2. This assignment has not previously been submitted in any form for this or any other unit, degree or diploma at any university or other institute of tertiary education. 3. I/we acknowledge that it is my responsibility to check that the file I/we have submitted is: a) readable, b) the correct file and c) fully complete. Oral Presentation The last lecture/workshop slot will be devoted to oral presentation of your work. Depending on the number of presentations, each presentation will be between 8 – 12 minutes long plus some time for questions. A rubric for assessing oral presentations can be found in the appendix. STAT1003 Take-Home Project 5 Appendix A Rubric for Project ReportsStudent Names: Does not meet expectations Meets expectations Exceeds expectations Score Introduction – No or very limited background information – Very hard to grasp what the problem is and why the data analysis is important – No outline for remaining report is given. No idea what to expect in the remainder of the report – Some background information is given – Some information is provided that may not be related to the problem at hand – Has some idea of what the problem is and its importance – Outline for remaining report is provided – Presents a clear and concise lead-in to the remainder of the report – Appropriate background information (including references) presented in organized fashion. – Problem is well developed. Hypotheses, if appropriate, are clearly stated. – Outline of the remaining report present and easily understandable /10 Description of Data and Methods – Far too brief or vague in describing data analysis methods to be used – No emphasis put on the description of the data set and/or key aspects associated with the data – No apparent link between question and methods used – Adequately describes selection of data and provides a general explanation on source of data – Adequately explains and justifies methods to be used – Minor issues related to the data and analysis may be omitted or not well explained – Presents a clear, concise description of the data, sources, and statistical methods used in the report – Important aspects of data set are clearly explained – Summary tables (if presented) are understandable and self-contained /10 Methodology – Methods incorrect and/or very poorly implemented – Incorrect use of methods or using methods that address a different question – Clear violation of assumptions – Key aspects of the data set and or methodological limitations are ignored – Methods appear to be sound but others could have been used – Adequately explains and justifies the methodology used and assumptions made – A more straightforward, concise approach to address question at hand could have been used – Utilizes sound exploratory and analysis methods to address the problem – Methods are not too complicated or simple for the problem at hand – Possibly questionable underlying assumptions are explained and/or checked – Uses methods/analyses not discussed in class /10 Analysis – Very little description of what was found or far too much extraneous information presented. – Summary tables and/or results are difficult to read and/or understand – Presentation of results is incomplete and there is no organization – No interpretation of the results relative to the problem is given – Description of the analysis is mostly clear but may be too brief or too long – Presentation of results in tabular and graphical form is adequate and clear – Appropriate and adequate interpretation of the results is carried out – Clear, concise description of what was found and how it was found – Results are presented in an easy to understand way with only the necessary pieces of information presented. – Results are interpreted and linked back to the problem at hand – Tables and figures are easy to understand and self-contained /40 Conclusions – Completely illogical explanation of the findings and/or how they impact the problem – Does not highlight important results – Does not make recommendations for alternative/future analyses – Adequately reviews the problem and highlights main results – Interpretation and impact on results are explained, but there may be some omissions – Provides some recommendations for future – Section satisfactorily reviews the problem – The results are well- summarized – Interpretation and impact of results on the problem are clearly explained – Recommendations for further/alternative studies /10 STAT1003 Take-Home Project 6 work are understandable and well thought out Structure/ Documentation & References – Report not structured in logical fashion – Paragraphs are poorly organized; use of sections is illogical and hinders document navigation – R code/intermediate output not included in appendix, or appendix is incomplete – Fails to correctly document any sources or to utilize appropriate citation forms – Report is structure in a logical manner with some improvements possible – Paragraphs are usually well-organized; use of sections is logical and generally allows easy navigation of the document – R code and intermediate output appears in appendix – Most sources are correctly documented; appropriate citation forms are generally utilized – Report is organized in a clear, logical manner – All paragraphs are well-organized; use of sections is logical and allows easy navigation through the document – R code and intermediate output appears in appendix and is easy to navigate – All sources are correctly and thoroughly documented; appropriate citation forms are utilized throughout /10 Grammar – Sentences are poorly written; there are numerous incorrect word choices and errors in grammar, punctuation and spelling – Sentences are generally well-written; there are a few incorrect word choices and errors in grammar, punctuation and spelling – Sentences are well-written; there are no incorrect word choices and the text is free of errors in grammar, punctuation and spelling /10 Total /100 A Phatak, April 2017: Adapted from University of Colorado, Electrical, Computer & Energy Engineering and Purdue University, Department of Statistics STAT1003 Take-Home Project 7 Appendix B Rubric for Oral Presentations Rubric for Oral Presentations Names: Does not meet expectations Meets expectations Exceeds expectations Mark Introduction of topic Topic introduced. Topic introduced clearly, and purpose of talk was made clear. Topic introduced clearly and in an interesting way. Purpose of talk was made clear. Outline of points was given. /5 Development of topic Some understanding of topic shown. Some links and connections made between ideas. Points are usually developed with minimum detail. Information is usually relevant. Good understanding of topic shown. Links and connections between ideas made clear. Information was relevant and expressed in own words. Points were developed with sufficient and appropriate details. A very good understanding of the topic shown. Links and connections between ideas made clear. Information was relevant and well expressed in own words. Points were well- organised and developed with sufficient and appropriate details. /20 Technical accuracy There were some inconsistencies within the information presented and no explanation was provided. Method, assumptions and analysis were consistent with no apparent anomalies. Method, assumptions and analysis were consistent. Limitations were recognised. /15 Voice: clarity, pace, fluency Pronunciation Presenter occasionally spoke clearly and at a good pace. Pronunciation occasionally correct, but often hesitant and inaccurate. Presenter usually spoke clearly to ensure audience comprehension. Delivery was usually fluent. Pronunciation and intonation is usually correct. Presenter spoke clearly and at a good pace to ensure audience comprehension. Delivery was fluent and expressive. Pronunciation and intonation is correct and confident. /10 Visual aids Slides were unclear and/or wordy. Clear, concise and effective. Clear, concise and effective. Good use of diagrams or other visuals. /15 Conclusion of topic An attempt was made to conclude the presentation. The presentation was summed up clearly. The presentation was summed up clearly and effectively, with key points emphasised. /10 Answering questions Not all questions could be answered. Questions answered with difficulty, and little knowledge of the topic was demonstrated. Most questions answered. Answers showed good knowledge and understanding of the topic. Language was mainly correct. Questions answered with little difficulty. Very good knowledge of the topic was demonstrated. Language was correct and fluent. /15 Timing Too short or Stopped by Chair On time (+/- 1min) /10 Total /100 STAT1003 Take-Home Project 8 Appendix C Additional Data Sources (Hyperlinked and compiled by Stephen Sauchi Lee, University of Idaho.) 1. 200,000+ Jeopardy questions 2. Awesome Public Datasets on github, curated by caesar0301. 3. AWS (Amazon Web Services) Public Data Sets, provides a centralized repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. 4. Canada Open Data, pilot project with many government and geospatial datasets. 5. Causality Workbench data repository. 6. CDC Data — Medical data from the Centers for Disease Control and Prevention 7. Census.gov — US government source of data about the nation’s people and economy 8. CKAN — Open-source data portal platform 9. Corral Big Data repository at Texas Advanced Computing Center, supporting data-centric science. 10. CrowdFlower Data for Everyone library. 11. Data Market — Portal for shared business data 12. Data Planet, The largest repository of standardized and structured statistical data, with over 25 billion data points, 4.3 billion datasets, 400+ source databases. 13. Data Source Handbook, A Guide to Public Data, by Pete Warden, O’Reilly (Jan 2011). 14. Data.gov — Source of machine readable datasets generated by the US government 15. Data.gov.uk, publicly available data from UK (also London datastore.) 16. Data.gov/Education, central guide for education data resources including high-value data sets, data visualization tools, resources for the classroom, applications created from open data and more. STAT1003 Take-Home Project 9 17. Datacatalogs.org, open government data from US, EU, Canada, CKAN, and more. 18. DataFerrett, a data mining tool that accesses and manipulates TheDataWeb, a collection of many on-line US Government datasets. 19. DataMarket, visualize the world’s economy, societies, nature, and industries, with 100 million time series from UN, World Bank, Eurostat and other important data providers. 20. DataSF.org, a clearinghouse of datasets available from the City & County of San Francisco, CA. 21. Dataverse Network — Repository for research datasets 22. Delve, Data for Evaluating Learning in Valid Experiments 23. Donors Choose: data related to their projects 24. EconData, thousands of economic time series, produced by a number of US Government agencies. 25. Enron Email Dataset, data from about 150 users, mostly senior management of Enron. 26. Europeana Data, contains open metadata on 20 million texts, images, videos and sounds gathered by Europeana – the trusted and comprehensive resource for European cultural heritage content. 27. FEDSTATS, a comprehensive source of US statistics and more 28. FIMI repository for frequent itemset mining, implementations and datasets. 29. Financial Data Finder at OSU, a large catalog of financial data sets. 30. FiveThirtyEight: data and code related to their articles 31. Free SVG Maps — Website for free geographic maps 32. GDELT: The Global Data on Events, Location and Tone, described by Guardian as “a big data history of life, the universe and everything.” 33. GeoDa Center, geographical and spatial data. STAT1003 Take-Home Project 10 34. Google ngrams datasets, text from millions of books scanned by Google. 35. Google Public Data Explorer — Google’s public data portal to explore, visualize, and communicate large datasets 36. Grain Market Research, financial data including stocks, futures, etc. 37. Guardian DataBlog — Data journalism and data visualization from the Guardian 38. HitCompanies Datasets, comprehensive data on random 10,000 UK companies sampled from HitCompanies, updated automatically using AI/Machine Learning. 39. ICWSM-2009 dataset contains 44 million blog posts made between August 1st and October 1st, 2008. 40. IMDb Datasets — Webpage for access to IMDb datasets 41. Infochimps, an open catalog and marketplace for data. You can share, sell, curate, and download data about anything and everything. 42. Investor Links, includes financial data 43. Jake Hofman Data Links — Jake Hofman’s bookmarked computational social science data resources 44. Jerry Smith dataset collection, with Finance, Government, Machine Learning, Science, and other data. 45. Kaggle – home of Data Science 46. KDD Cup center, with all data, tasks, and results. 47. KDnuggets Data Repositories List — Data repository list maintained by KDnuggets, a popular data mining website 48. Kevin Chai list of datasets, for text, SNA, and other fields. 49. KONECT, the Koblenz Network Collection, with large network datasets of all types in order to perform research in the area of network mining. 50. Last.fm Datasets — Webpage for access to Last.fm datasets 51. Linked Data — Linkage site for distributed data STAT1003 Take-Home Project 11 52. Linking Open Data project, at making data freely available to everyone. 53. Million Song Dataset 54. MIT Cancer Genomics gene expression datasets and publications, from MIT Whitehead Center for Genome Research. 55. ML Data, the data repository of the EU Pascal2 networks. 56. mldata.org — A public repository for machine learning data 57. NASDAQ Data Store, provides access to market data. 58. National Government Statistical Web Sites, data, reports, statistical yearbooks, press releases, and more from about 70 web sites, including countries from Africa, Europe, Asia, and Latin America. 59. National Space Science Data Center (NSSDC), NASA data sets from planetary exploration, space and solar physics, life sciences, astrophysics, and more. 60. NetworkRepository: Interactive Data Repository, has many collections of graph and networks from social science, machine learning, scientific computing, and other areas. 61. Open Data Census, assesses the state of open data around the world. 62. Open Source Sports, many sports databases, including Baseball, Football, Basketball, and Hockey. 63. OpenData from Socrata, access to over 10,000 datasets including business, education, government, and fun. 64. Peter Skomoroch (LinkedIn) Data Links — Peter Skomoroch’s bookmarked machine learning data resources 65. PubGene(TM) Gene Database and Tools, genomic-related publications database 66. Quandl, a collaboratively curated portal to millions of financial and economic time-series datasets. 67. qunb, a platform to find and visualize quantitative data. 68. RealClimate Data — Aggregator for selected sources of code and data related to STAT1003 Take-Home Project 12 climate science 69. Reddit Open Data — Forum on the social news site reddit for open APIs and datasets 70. Reddit Top 2.5 Million: all-time top 1,000 posts from each of the top 2,500 subreddits 71. Robert Schiller data on housing, stock market, and more from his book Irrational Exuberance. 72. SMD: Stanford Microarray Database, stores raw and normalized data from microarray experiments. 73. SourceForge.net Research Data, includes historic and status statistics on approximately 100,000 projects and over 1 million registered users’ activities at the project management web site. 74. StateMaster — Reference site for data on US states 75. StatLib, CMU Datasets Archive. 77. The Upshot: data related to their articles 78. Time Series Data Library 79. UCI Datasets — The UC Irvine Machine Learning Repository, a popular source of machine learning datasets 80. UCI KDD Database Repository for large datasets used in machine learning and knowledge discovery research. 81. UCR Time Series Data Archive, offering datasets, papers, links, and code. 82. UFO reports: geolocated and time-standardized UFO reports for close to a century 83. UK’s Met Office Data — Climate station records from the UK’s National Weather Service 84. UK’s Office for National Statistics — Source of datasets generated by the UK’s Office for National Statistics 85. United States Census Bureau. STAT1003 Take-Home Project 13 86. Visual Analytics Benchmark Repository. 87. Web Data Commons, structured data from the Common Crawl, the largest public web corpus. 88. Wikipedia Database — Webpage for access to complete Wikipedia database dumps 89. Wikiposit, a (virtual) amalgamation of (mostly financial) data from many different sites, allowing users to merge data from different sources 90. Wolfram Alpha disease and patient level data. 91. Wolfram|Alpha — Computational knowledge engine or answer engine 92. World Bank Catalog — World Bank data 93. Yahoo Sandbox datasets, Language, Graph, Ratings, Advertising and Marketing, Competition 94. Yelp Academic Dataset, all the data and reviews of the 250 closest businesses for 30 universities for students and academics to explore and research. 95. Yelp Dataset Challenge: Yelp reviews, business attributes, users, and more from 10 cities

LATEST POSTS
MOST POPULAR

ezAce多年来为广大留学生提供定制写作、留学文书定制、语法润色以及网课代修等服务,超过200位指导老师为您提供24小时不间断地服务。