Predicting House Prices Using k-NN

Table of Contents
Update History
- 2022-04-29: First published
- 2022-09-05: Corrected bug in
visualise()
function
Introduction
The aim of this article is to show how easy is to use the k-NN regressor and classifier provided by Scikit Learn, a popular Python machine learning library, for predicting house prices in England, based on their location. This is an intuitive and highly didactic problem, for which public data sets are readily available.
As a bonus, we also include a classification use case, in which we try to predict a house type (Flat, Terraced, etc.), based on its location and price.
Imports
The underlying libraries should be installed first using pip3 install <library>
.
1import pandas as pd
2import numpy as np
3import matplotlib
4import matplotlib.colors as colors
5import matplotlib.pyplot as plt
6
7from sklearn.model_selection import train_test_split
8from sklearn.neighbors import KNeighborsClassifier
9from sklearn.neighbors import KNeighborsRegressor
10from sklearn.model_selection import GridSearchCV
The Data Sets
Land Registry’s ‘Sold’ Prices 2021
Our main data set is the Land Registry’s ‘sold’ data set for 2021, which contains the house transactions for that year, including each property’s address, type, and price.
We are only interested in the price, postcode, the property type (D - Detached, S - Semi, F - Flat, T - Terraced, O - Other) and whether it is a new build or not (Y/N). We are not interested in month to month and neither year to year trends.
1house_prices_df = pd.read_csv("pp-2021.csv",
2 usecols=[1,3,4,5],
3 names=["price","postcode","type","new"])
4house_prices_df.head()
price | postcode | type | new | |
---|---|---|---|---|
0 | 246000 | NN5 6SQ | S | N |
1 | 238000 | NN4 6LS | T | N |
2 | 320000 | NN29 7SF | D | N |
3 | 560000 | NN11 3TD | S | N |
4 | 265000 | NN5 6AH | S | N |
Postcode to Latitude and Longitude
The next data set is a mapping of postcodes to geographical coordinates, in the form of (postcode, latitude, longitude) tuples.
1postcodes_df = pd.read_csv("ukpostcodes.csv",usecols=[1,2,3])
2postcodes_df.head()
postcode | latitude | longitude | |
---|---|---|---|
0 | AB10 1XG | 57.144165 | -2.114848 |
1 | AB10 6RN | 57.137880 | -2.121487 |
2 | AB10 7JB | 57.124274 | -2.127190 |
3 | AB11 5QN | 57.142701 | -2.093295 |
4 | AB11 6UL | 57.137547 | -2.112233 |
Our final dataframe, simply named df
matches each of the transaction’s postcodes, with the latitude and longitude coordinates, and divides prices by 1000, so that we reason in terms of ‘ks’ (thousands), which is easier.
1df = house_prices_df.merge(postcodes_df,how='inner',left_on='postcode',right_on='postcode')
2df['price'] = df['price'] / 1000
3df.head()
price | postcode | type | new | latitude | longitude | |
---|---|---|---|---|---|---|
0 | 246.0 | NN5 6SQ | S | N | 52.253869 | -0.946142 |
1 | 215.0 | NN5 6SQ | S | N | 52.253869 | -0.946142 |
2 | 238.0 | NN4 6LS | T | N | 52.198000 | -0.880000 |
3 | 311.0 | NN4 6LS | T | N | 52.198000 | -0.880000 |
4 | 305.0 | NN4 6LS | D | N | 52.198000 | -0.880000 |
1df['price'].describe()
count 757488.000000
mean 379.990774
std 1972.340933
min 0.001000
25% 171.000000
50% 270.000000
75% 420.000000
max 932540.400000
Name: price, dtype: float64
Visualisation
It is helpful to get intuition over our data set for which we visualise it as a heatmap, which proves, as expected, that areas such as Greater London and Manchester, suffer from higher house prices relative to the rest of England. To reduce noise, we use a colour scale to highlight prices ranging from 200k to 1000k, so that we don’t get the visualisation distorted by neither cheap nor expensive houses (those below 200k and above 1000k, respectively).
1def visualise(df, vmin, vmax):
2
3 df_sorted = df.sort_values(by='price')
4 x = df_sorted['longitude']
5 y = df_sorted['latitude']
6 c = df_sorted['price']
7
8 plt.rcParams['figure.figsize'] = [5, 6]
9 plt.rcParams['figure.dpi'] = 100
10
11 plt.scatter(x, y, s=0.01, c=c, cmap='plasma_r',
12 norm=colors.Normalize(vmin=vmin,vmax=vmax), alpha=0.8)
13 plt.colorbar()
14 plt.show()
15
16visualise(df, 200, 1000)
House Price Prediction by Location
The first goal is the creation of a function as follows:
Location -> Price
In the real world (e.g., the way Rightmove and Zoopla price estimators work) we would include the house number, the number of rooms, and so on, plus ’trend’-wise considerations in the calculations.
Thus, we are essentially boiling down the entire house address just to its geographical location, but not only. Houses aren’t just ‘houses’, they are flats, detached houses, etc. As such, we will further reduce our search space by focusing on terraced houses only. Therefore, the assumption is that the house price prediction applies to terrace houses only for the sake of simplicity.
1# Terraced houses only (i.e., Townhouses in the US)
2prices_df = df[(df['type'] == 'T')]
3# Obtain the average price paid in each postcode
4prices_df = prices_df.groupby(["postcode","latitude","longitude"], as_index=False)["price"].mean()
5prices_df.head()
postcode | latitude | longitude | price | |
---|---|---|---|---|
0 | AL1 1AS | 51.749072 | -0.335471 | 608.75 |
1 | AL1 1HW | 51.747109 | -0.336542 | 505.00 |
2 | AL1 1NF | 51.749497 | -0.336918 | 720.00 |
3 | AL1 1NL | 51.749318 | -0.339532 | 775.00 |
4 | AL1 1PA | 51.748662 | -0.334472 | 785.00 |
Using k-NN
Our first model is k-Nearest Neighbors (k-NN). In particular, KNeighborsRegressor
from SciKit-Learn, given that prices are continous values.
Let us start by splitting the data set into training and test sets.
1X = prices_df[['latitude','longitude']]
2y = prices_df['price']
3
4X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
k-NN isn’t hands off. We need to make an assumption about what is the most optimal number of neighbours (the hyperparameter k), as well as to whether the distance of each neighbour should be uniform, or otherwise; in other words, the hyperparameter weights. One way to find out is by brute force, that is to say, by trying different values, as follows:
1uniform = []
2distance = []
3r = range (1,21,2)
4
5for k in r:
6
7 # Euclidan, 'straight' distance
8 model = KNeighborsRegressor(n_neighbors = k, weights='uniform')
9 model.fit(X_train.values, y_train.values)
10 uniform.append(model.score(X_test.values,y_test.values))
11
12 # Distance is inversely proportional (to lessen the weight of outliers)
13 model = KNeighborsRegressor(n_neighbors = k, weights='distance')
14 model.fit(X_train.values, y_train.values)
15 distance.append(model.score(X_test.values,y_test.values))
16
17uniform = np.array(uniform)
18distance = np.array(distance)
19
20plt.rcParams['figure.figsize'] = [10, 3]
21plt.rcParams['figure.dpi'] = 100 # 200 e.g. is really fine, but slower
22plt.plot(r,uniform,label='uniform',color='blue')
23plt.plot(r,distance,label='distance',color='red')
24plt.legend()
25plt.gca().set_xticks(r)
26plt.show()
The results suggest that using the hyperparameters k = 15
, and weight='distance'
would provide the most accuracy.
1pd.DataFrame({"k" : r, "uniform" : uniform, "distance" : distance})
k | uniform | distance | |
---|---|---|---|
0 | 1 | 0.557452 | 0.557452 |
1 | 3 | 0.608900 | 0.616880 |
2 | 5 | 0.626724 | 0.630256 |
3 | 7 | 0.655916 | 0.651094 |
4 | 9 | 0.656827 | 0.656976 |
5 | 11 | 0.661121 | 0.663031 |
6 | 13 | 0.657153 | 0.663963 |
7 | 15 | 0.660301 | 0.667635 |
8 | 17 | 0.658013 | 0.667308 |
9 | 19 | 0.654204 | 0.666773 |
This process, though, is tedious and makes rigid assumptions about the training data set. A way of further shuffling the training data set is through a process called cross-validation. The GridSearchCV
function combines both the ability to define the number of splits and iterate through various hyperparameters.
In our case, rather than defining the convoluted loop seen a above, we simply specify the range of values we want to try out:
1params = {'n_neighbors':range(1,21,2),'weights':['uniform','distance']}
The next step is to pass the k-NN instance, our parameters (param
) bundle, and number of splits (cv=5
) to GridSearchCV
:
1model = GridSearchCV(KNeighborsRegressor(), params, cv=5)
2model.fit(X_train.values,y_train.values)
3model.best_params_
{'n_neighbors': 19, 'weights': 'uniform'}
Above, GridSearchCV
suggests weight = uniform
as our brute force process did, but k = 19
rather than k = 15
. This is because the cross-validation process does many different rounds of sampling, unlike our rigid loop.
1model.score(X_test.values,y_test.values)
0.6542038862362233
We now define our Location -> Price function as follows:
1def price(description,lat,lon):
2 features = [[lat,lon]]
3 print("{:30s} -> {:5.0f}k ".format(description,float(model.predict(features))))
4
5# Examples
6price('Oxford Circus, London', 51.515276, -0.142038)
7price('Harrods (B. Road), London', 51.499814, -0.163366)
8price('Peak District, National Park', 53.328508, -1.783416)
Oxford Circus, London -> 5531k
Harrods (B. Road), London -> 3956k
Peak District, National Park -> 270k
House Type by Location and Price
In the last section we observed the use of the k-NN regressor to predict house prices. Let us now use the same data set to work on a classification problem. The objective is predicting the house type (Detached, Semi, Terraced, etc.) based on its location and price as follows:
(Location, Price) -> Type
Let us first create a data set in which we obtain the average price for each house type given a location:
1
2types_df = df.groupby(["latitude","longitude","type"], as_index=False)["price"].mean()
3types_df.head()
latitude | longitude | type | price | |
---|---|---|---|---|
0 | 49.912272 | -6.300022 | T | 625.0 |
1 | 49.913984 | -6.309107 | S | 475.0 |
2 | 49.914130 | -6.312967 | O | 440.0 |
3 | 49.914392 | -6.315107 | F | 280.0 |
4 | 49.914397 | -6.311555 | F | 255.0 |
Next, we will discover the most optimal number of neighbours and weight, using GridSearchCV
:
1params = {'n_neighbors':range(1,21,2),'weights':['uniform','distance']}
2
3X = types_df[['latitude','longitude','price']]
4y = types_df['type']
5
6X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
7
8model = GridSearchCV(KNeighborsClassifier(), params, cv=5)
9model.fit(X_train.values,y_train.values)
10model.best_params_
{'n_neighbors': 19, 'weights': 'distance'}
The score is not as good as that for price prediction, but still, it has a 50% score, which is not bad for a first attempt, without massaging the data set in any way:
1model.score(X_test.values,y_test.values)
0.5336038203296112
Finally, we declare a function that allows seeing the prediction in action:
1ht = { 'F' : 'Flat', 'T' : 'Terraced', 'S' : 'Semi', 'D' : 'Detached', 'O' : 'Other'}
2
3def house_type(description,lat,lon,price):
4 features = [[lat,lon,price]]
5 print("{:30s} {:5.0f}k -> {}".format(description,price,ht[(model.predict(features)[0])]))
6
7house_type('Oxford Circus, London', 51.515276, -0.142038, 500)
8house_type('Harrods (B. Road), London', 51.499814, -0.163366, 5500)
9house_type('Peak District, National Park', 53.328508, -1.783416, 100)
Oxford Circus, London 500k -> Flat
Harrods (B. Road), London 5500k -> Other
Peak District, National Park 100k -> Terraced
Conclusion
Scikit Learn does a good job at predicting both continuous values, and classes, using KNeighborsRegressor()
, and KNeighborsClassifier()
, respectively, for a use case in which the key features are geographical coordinates, in which, the notion of ‘distance’ is highly intuitive.
The use of all available Land Registry’s records, (as opposed to only the 2021 data sets) would facilitate the prediction of ‘future prices’, as well as the analysis of novel interesting patterns. For example, flats and terraced houses tend to form clusters in terms of continuous house numbers. An effective prediction on the property type could be achieved based on the house number and location (without the price), without necessarily having a ‘sold’ record for the specific house number/postcode pair.
Acknowledgements
Many thanks to Enrique Riveros who found a bug in the visualise()
function, whereby the price was being sorted outside of the DataFrame, resulting in breaking the correspondence with the x
, and y
coordinates.