-
Notifications
You must be signed in to change notification settings - Fork 0
/
hsp-presentation.Rmd
152 lines (106 loc) · 5.61 KB
/
hsp-presentation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
title: "House Price Prediction using Locally Weighted Linear Regression"
author: Girish Palya
output:
beamer_presentation:
theme: "Berkeley"
colortheme: "beaver"
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## TL;DR
Does the value of house depend on the neighborhood it is located in? Real estate agents would say yes. To explore the effect of historical selling prices of houses in the neighborhood on the selling price of a house, locally (geo-spatially) weighted linear regression (LWR) model is used to predict selling prices, and results are compared to prices predicted by parametric linear regression model. LWR makes more accurate predictions compared to linear regression.
## Methodology
Locally weighted linear regression (LWR) is used to minimize the following cost function.
$$
\sum_{i = 1}^{m}{w^{(i)}(y^{(i)} - \theta^TX^{(i)})^2}
$$
Where,
- $m$: number of training observations
- $w^{(i)}$: weight of the i^{th} observation
- $y^{(i)}$: target/output variable (scalar)
- $\theta$: parameter to be estimated (vector)
- $X^{(i)}$: features/input vector
## Weights
Weight function is given by
$$
w^{(i)} = \exp{\left(-\frac{\left(X^{(i)} -X\right)^2}{2\tau^2}\right)}
$$
$w^{(i)} \approx 1$ if $|X^{(i)} - X|$ is small
$w^{(i)} \approx 0$ if $|X^{(i)} - X|$ is large
$X$ is the feature/input vector of test example
$\tau$ is the bandwidth (window size)
Net effect is that cost function sums over only $X^{(i)}$'s that are closer to $X$, and ignores far away $X^{(i)}$'s.
## Shape of $w^{(i)}$
Shape of $w^{(i)}$ is a bell curve
```{r echo=FALSE, fig.height=3}
X <- 5
X_i <- seq(100) / 10
tau <- 1
plot(X_i, exp(-(X_i - X)^2)/(2 * tau^2), "l", axes=FALSE, frame.plot=F,
xlab = "", ylab = "weight")
title(xlab="X(i)", line=1)
Axis(side=1, labels=FALSE)
Axis(side=2, labels=FALSE)
abline(v=5, col="gray")
text(x=5, y=0, labels= c("X(i)=X"), pos=3)
```
Width of the bell curve is determined by the value of $\tau$. Smaller values $\tau$ results in narrower peak of the bell curve. When value of $\tau$ is small, $X^{(i)}$'s closet to test example $X$ have higher influence on the prediction function.
## Evaluation Metric
Value of $\tau$ is approximated through trial-and-error by minimizing mean squared error (MSE) using K-Fold cross validation (k=5). In the regression setting, MSE is the most commonly-used measure.
$$
MSE = \frac{1}{m}\sum_{i=1}^m{\left( Y^{(i)} - h(X^{(i)})\right)^2}
$$
where $h(X^{(i)})$ is the prediction that $h$ gives for the ith observation. The MSE
will be small if the predicted responses are very close to the true responses, and will be large if the predicted and true responses differ substantially.
## Dataset
Historical market data of real estate valuation is collected from Xindian district of New Taipei City, Taiwan, during a 10 month period in 2012 and 2013.
There are 6 input [variables](https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set), and 414 observations.
* Transaction date
* House age
* Distance to the nearest MRT station
* Number of convenience stores in the living circle on foot
* Geographic coordinate, latitude
* Geographic coordinate, longitude
Output variable is house price per unit area.
## Data Preparation and EDA
**Transaction date (X1)** is a time series. All transactions happened in years 2012 and 2013, during which time [Taiwan's housing market](https://www.globalpropertyguide.com/home-price-trends/Taiwan) was in a secular uptrend.
Price appreciation:
```{r echo=FALSE}
data.frame(year = c(2012, 2013), Q1 = c("2.76%", "5.33%"), Q2 = c("2.27%","6.05%"), Q3 = c("0.18%", "-0.42%"), Q4 = c("2.55%", "3.13%"))
```
Time series is not stationary, and therefore cannot be included directly in regression. Out variable (house price per unit area) is scaled to adjust for the secular trend.
## (contd.)
```{r include=FALSE}
library(data.table)
library(rio)
library(caret)
library(ggplot2)
require(gridExtra)
url <- paste0("https://archive.ics.uci.edu/ml/machine-learning-databases/00477",
"/Real%20estate%20valuation%20data%20set.xlsx")
data <- setDT(rio::import(url))
setnames(data, new = sapply(names(data), function(x) gsub(" ", "_", x)))
```
**Geographic coordinates** of houses are provided as latitude and longitudes. All houses in the dataset are located within 11 kilometer diameter.
Density plot reveals that variables **House age**, **Distance to the nearest MRT station**, and **Number of convenience stores within foot distance** are more-or-less normally distributed.
```{r echo=FALSE, fig.height=2}
grid.arrange(ggplot(data, aes(x = X2_house_age)) + geom_density(),
ggplot(data, aes(x = X3_distance_to_the_nearest_MRT_station))
+ geom_density(),
ggplot(data, aes(x = X4_number_of_convenience_stores))
+ geom_density(), ncol = 3)
```
## Bivariate Correlation
Correlation between input variables is not significant.
```{r echo=FALSE}
correlation_matrix <- cor(data[, 3:5])
prmatrix(correlation_matrix, collab = c("X2", "X3", "X4"))
```
## Results
LWR can give better prediction results compared to linear regression as measured by lower MSE. Weight window parameter $\tau$ is kept at 1000. Since distance is measured in meters, this roughly equates to giving more weight to houses within half a kilometer radius (recall that houses in the training set are located within 11 kilometer diameter). LWR appears to be a better choice for a learning algorithm for predicting real estate prices.
```{r echo=FALSE}
data.table(Method = c("LWR", "Linear regression"), MSE = c(95, 120))
```
The $\sqrt{MSE}$ corresponds to average prediction error of around 24%.