- May 15, 2020

悉尼大学BUSS6002Assignment1课业解析题意：处理房价数据，搭建一个模型来预测房价解析：1）使用计算机对数据处理，需要找出数据中可能存在的不合理项或错误项并论证，解释变量之间的关系。 2）回答以下问题：如果要建立房价的回归模型，是否应该包含截距项；多变量是否是数据集的潜在问题；如果只能用三个变量，用哪三个能最好的预测房价；建立以这三个变量构成的回归模型。 3）将得出的模型进行校正；说明选择使用EDA的意义，展示成果；比较新旧模型，解释为什么使用校正系数比较模型；说明新的模型为什么合理。 4）说明如何利用科学数据处理来建模和评估，选择一个过程模型进行回答；如果有另一家公司考虑在某地投资，说明得出的模型能不能选择另一块地方。涉及知识点：数据分析，EDA，回归模型更多可加微信讨论微信号：IT_51zuoyejunpdf

2019S2 BUSS6002 Assignment 1Due Date: Friday 27 Sep 2019Value: 15% of the total markInstructions1. Required Submission Items:1. ONE written report (PDF format). submitted via Canvas.• Assignments > Report Submission (Assignment 1)2. ONE Jupyter Notebook .ipynb submitted via Canvas.• Assignments > Upload Your Code File (Assignment 1)2. The assignment is due at 12:00pm (noon) on Friday, 27 Sep 2019. The latepenalty for the assignment is 5% of the assigned mark per day, starting after12:00pm on the due date. The closing date Friday, 4 Oct 2019, 12:00pm(noon) is the last date on which an assessment will be accepted for marking.3. As per anonymous marking policy, please include your Student ID only in thereport and do NOT include your name. The name of the report and code filemust follow: SID_BUSS6002_Assignment1. Failing to name your submittedfiles correctly would incur a penalty.4. Your answers should be provided as a final report giving full explanation andinterpretation of any results you obtain. Output without explanation will receivezero marks. You are required to also submit code that can reproduce yourreported results, as reproducibility is a key component to data science. Notsubmitting your code will lead to a loss of 50% of the assignment mark.5. Be warned that plagiarism between individuals is always obvious to themarkers of the assignment and can be easily detected by Turnitin.6. Presentation of the assignment is part of the assignment. There will be 10marks for the presentation of your report and code submission.7. The report should be NOT more than 10 pages including text, figures, tables,small sections of inserted code etc. Think about the best and most structuredway to present your work, summarise the procedures implemented, supportyour results/findings and prove the originality of your work. You will provideyour code as a separate submission to the report; however, you may insertsmall sections of your code into the report when necessary.8. Your code submission has no length limit, however marks are assigned forcode presentation, so make your code as concise as possible and addcomments when necessary to explain your logic and the purpose of eachcode segment. Make sure to remove any unnecessary code and ensure thatyour code can be run without error.9. Numbers with decimals should be reported to the third-decimal point.Project Description and DatasetSuppose you are working as a Data Scientist for a real estate investment firm. Thefirm is assessing locations for investing in housing redevelopment in the UnitedStates. For this purpose, the firm has identified several potential locations in Seattleto purchase existing houses, which would be demolished to make space for theredevelopment.In order to estimate the costs involved the firm needs to know the current marketvalue of the houses that it needs to purchase. You are working on a project that aimsto build a model to estimate the house prices.Seattle’s Department of Assessments has been collecting data since 2014 on housesale prices and the characteristics of each house that was sold. You have beengiven access to a copy of original database “house.db”, which is an SQLite file, aswell as a data dictionary file “house_dict.txt”. You can download the dataset anddetailed dataset description from the BUSS6002 Canvas site.Hint: To list all tables in the database you can use the following querySELECT name FROM sqlite_master WHERE type=’table’ ORDER BY name;Task 1To start your analysis, you wish to perform a thorough EDA to help you betterunderstand the given datasets. The results you obtain in this task will be used toinform your modelling choice.Requirements:a. Check and deal with any missing data (if any) in the given dataset.b. Look for and remove any potential outliers (if any) that would possibly affectyour modelling. Justify your answer.c. Visualise the relationships between explanatory variables and the targetvariable through appropriate plotting. Report your analysis and findings.Task 2Suppose now you want to build a prototype model to predict house sale prices, whichwill be demonstrated to a wider team. Therefore, it needs to be easily understood bynon-experts, meaning that you can only use a few variables in your model as a startingpoint.In order to make informed decisions on your modelling choices, you need to answerthe following questions:a. Suppose you would like to build a linear regression model to predict house saleprices, do you wish to include an intercept term in your model? Carefully explainyour answer.b. Do you think multicollinearity could be a potential problem on the given dataset?Use your understanding of variables to justify your answer and verify yourhypothesis using appropriate numeric measures. Explain your decisions toproceed based on your findings.c. If you wish to use only three variables to predict house sale prices, which threevariables would you choose? Carefully justify your choice and explain yourselection criterion.d. Build a linear regression model using the three variables you have chosen (Useoriginal, i.e. not engineered, variables for this task). Report and interpret yourregression results.e. Perform residual diagnostics to measure the goodness of fit. Report yourfindings.Task 3The model you have built so far provides an approximate estimate of house prices.However, to accurately estimate the costs of the redevelopment plan you must beable to estimate house prices as accurately as possible.Your goal is now to improve your model as much as you can through featureengineering and feature selection. You may consider all variables and applyappropriate transformation to the variables as necessary.Requirements:a. Your model should have a minimum adjusted R-Squared of 75%. If yourmodelling cannot achieve an adjusted R-Squared of 75%, report the bestmodel you can obtain.b. Justify your choice of feature engineering strategies using EDA and presentyour results.c. Compare your new model with the model you have built in Task 2 with respectto Adjusted R-Squared. Explain why you should use Adjusted R-Squared hereto compare the two models.d. Provide residual analysis to justify why your new model is more reasonable.Task 4Suppose you have finished your analysis, now you need to report to your managerand reflect on what you have experimented with in your project:a. Provide a reflection of how you have utilised the data science process modelto arrive at modeling and model evaluation based on how you answered theprevious three questions. Choose only one process model (CRISP-DM orSnail Shell) to answer this question. Explain how each part of the questionsaligns with the different phases of the process model.b. The firm is also considering redevelopment projects in other locations.Comment on whether the model you have built can or cannot be applied inother locations. Justify your answer.Marking Outline20 marks30 marks30 marks10 marks10 marks