BUAN 4310 Project 3: Predicting home prices in King County

Objectives:

Regarding this problem, we will analyze a sample from the larger dataset 'kc_house_data_2.csv'. Based on the sample, we will build and select an appropriate model for the business, predicting the home prices in King County.

Data Description:

The house_2.csv dataset represents ten of thousands information from customers, which includes many variables, like year, prices, number of bathroom, number of bedrooms, zipcode, square feet, etc.

Data Import:

For the data, we decided to remove all variables that were deemed unecessary. For that, we used scatterplot, correlogram, and domain knowledge.

Some of the removed variables are day of the week, and day of the sale, since we usually don't know exactly when a person will want to sell. We also removed sqft_living and sqft_lot, because the sqft_living15 and lot15 variants has post-renovations numbers, so are more up-to-date. We kept variables like condition and zipcode, because customers usually care about a house's condition and location.

Outside of removing variables, we turned all the categorical variables into factors and remove all the missing values.

Building & evaluating the model

For the model, we chose a linear regression model because we feel that it is a good fit for this problem, and that it will tell us which variables are the most significant in predicting housing prices.

For actual training, we used seed number 669, and 60/40 training validation split. This means that 60% of the data is used to train the model, while 40% is used for validating whether the model is still good when used on an unfamiliar set of data.

After building the model and checking for errors, we determined that:

The model is a good fit, with Adjusted R-squared of 80.87%.
There are 10 variables that are significant in predicting price : Year, bathroom, floors, waterfront, grade, sqft_basement, yr_built, yr_renovated , zipcode and sqft_living15
Comparing the training set with the validation set, we didn't find any evidence of overfitting.

Discussion & Evaluation

The final model was fairly accurate, as the RMSE is relatively low, and the adjusted R-squared is also relatively high.

In the real world, many of these variables can be utilized, such as grade. Having a high grade means that the house can sell for higher prices. However, A grading system might not be available in all counties, but we also have other variables, such as condition, bathroom, and zipcode, which are all significant and can affect housing prices.

Name	Name	Last commit message	Last commit date
Latest commit spiwar Update README.md Jul 22, 2022 a1708a6 · Jul 22, 2022 History 12 Commits
Project Description & Documentation	Project Description & Documentation	Added dataset and project descriptions	Dec 1, 2021
Project 3 script_final.R	Project 3 script_final.R	Updated comments & HTML	Dec 4, 2021
Project-3-script_final.html	Project-3-script_final.html	Updated comments & HTML	Dec 4, 2021
README.md	README.md	Update README.md	Jul 22, 2022
house_2.csv	house_2.csv	Updated comments	Dec 4, 2021
house_test_2.csv	house_test_2.csv	Updated comments	Dec 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BUAN 4310 Project 3: Predicting home prices in King County

Objectives:

Data Description:

Data Import:

Building & evaluating the model

Discussion & Evaluation

About

Releases

Packages

Contributors 3

Languages

spiwar/housing-prices-king-county

Folders and files

Latest commit

History

Repository files navigation

BUAN 4310 Project 3: Predicting home prices in King County

Objectives:

Data Description:

Data Import:

Building & evaluating the model

Discussion & Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages