Predicting House Prices Using k-NN

Share on:

Table of Contents

Update History

  • 2022-04-29: First published
  • 2022-09-05: Corrected bug in visualise() function

Introduction

The aim of this article is to show how easy is to use the k-NN regressor and classifier provided by Scikit Learn, a popular Python machine learning library, for predicting house prices in England, based on their location. This is an intuitive and highly didactic problem, for which public data sets are readily available.

As a bonus, we also include a classification use case, in which we try to predict a house type (Flat, Terraced, etc.), based on its location and price.

Imports

The underlying libraries should be installed first using pip3 install <library>.

 1import pandas as pd
 2import numpy as np
 3import matplotlib
 4import matplotlib.colors as colors
 5import matplotlib.pyplot as plt
 6
 7from sklearn.model_selection import train_test_split
 8from sklearn.neighbors import KNeighborsClassifier
 9from sklearn.neighbors import KNeighborsRegressor
10from sklearn.model_selection import GridSearchCV

The Data Sets

Land Registry’s ‘Sold’ Prices 2021

Our main data set is the Land Registry’s ‘sold’ data set for 2021, which contains the house transactions for that year, including each property’s address, type, and price.

We are only interested in the price, postcode, the property type (D - Detached, S - Semi, F - Flat, T - Terraced, O - Other) and whether it is a new build or not (Y/N). We are not interested in month to month and neither year to year trends.

1house_prices_df = pd.read_csv("pp-2021.csv",
2                              usecols=[1,3,4,5],
3                              names=["price","postcode","type","new"])
4house_prices_df.head()

price postcode type new
0 246000 NN5 6SQ S N
1 238000 NN4 6LS T N
2 320000 NN29 7SF D N
3 560000 NN11 3TD S N
4 265000 NN5 6AH S N

Postcode to Latitude and Longitude

The next data set is a mapping of postcodes to geographical coordinates, in the form of (postcode, latitude, longitude) tuples.

1postcodes_df = pd.read_csv("ukpostcodes.csv",usecols=[1,2,3])
2postcodes_df.head()

postcode latitude longitude
0 AB10 1XG 57.144165 -2.114848
1 AB10 6RN 57.137880 -2.121487
2 AB10 7JB 57.124274 -2.127190
3 AB11 5QN 57.142701 -2.093295
4 AB11 6UL 57.137547 -2.112233

Our final dataframe, simply named df matches each of the transaction’s postcodes, with the latitude and longitude coordinates, and divides prices by 1000, so that we reason in terms of ‘ks’ (thousands), which is easier.

1df = house_prices_df.merge(postcodes_df,how='inner',left_on='postcode',right_on='postcode')
2df['price'] = df['price'] / 1000
3df.head()

price postcode type new latitude longitude
0 246.0 NN5 6SQ S N 52.253869 -0.946142
1 215.0 NN5 6SQ S N 52.253869 -0.946142
2 238.0 NN4 6LS T N 52.198000 -0.880000
3 311.0 NN4 6LS T N 52.198000 -0.880000
4 305.0 NN4 6LS D N 52.198000 -0.880000
1df['price'].describe()
count    757488.000000
mean        379.990774
std        1972.340933
min           0.001000
25%         171.000000
50%         270.000000
75%         420.000000
max      932540.400000
Name: price, dtype: float64

Visualisation

It is helpful to get intuition over our data set for which we visualise it as a heatmap, which proves, as expected, that areas such as Greater London and Manchester, suffer from higher house prices relative to the rest of England. To reduce noise, we use a colour scale to highlight prices ranging from 200k to 1000k, so that we don’t get the visualisation distorted by neither cheap nor expensive houses (those below 200k and above 1000k, respectively).

 1def visualise(df, vmin, vmax):
 2    
 3    df_sorted = df.sort_values(by='price')
 4    x = df_sorted['longitude']
 5    y = df_sorted['latitude']
 6    c = df_sorted['price'] 
 7
 8    plt.rcParams['figure.figsize'] = [5, 6]
 9    plt.rcParams['figure.dpi'] = 100 
10
11    plt.scatter(x, y, s=0.01, c=c, cmap='plasma_r', 
12                norm=colors.Normalize(vmin=vmin,vmax=vmax), alpha=0.8)
13    plt.colorbar()
14    plt.show()
15    
16visualise(df, 200, 1000)

House Price Prediction by Location

The first goal is the creation of a function as follows:

Location -> Price

In the real world (e.g., the way Rightmove and Zoopla price estimators work) we would include the house number, the number of rooms, and so on, plus ’trend’-wise considerations in the calculations.

Thus, we are essentially boiling down the entire house address just to its geographical location, but not only. Houses aren’t just ‘houses’, they are flats, detached houses, etc. As such, we will further reduce our search space by focusing on terraced houses only. Therefore, the assumption is that the house price prediction applies to terrace houses only for the sake of simplicity.

1# Terraced houses only (i.e., Townhouses in the US)
2prices_df = df[(df['type'] == 'T')]
3# Obtain the average price paid in each postcode
4prices_df = prices_df.groupby(["postcode","latitude","longitude"], as_index=False)["price"].mean()
5prices_df.head()

postcode latitude longitude price
0 AL1 1AS 51.749072 -0.335471 608.75
1 AL1 1HW 51.747109 -0.336542 505.00
2 AL1 1NF 51.749497 -0.336918 720.00
3 AL1 1NL 51.749318 -0.339532 775.00
4 AL1 1PA 51.748662 -0.334472 785.00

Using k-NN

Our first model is k-Nearest Neighbors (k-NN). In particular, KNeighborsRegressor from SciKit-Learn, given that prices are continous values.

Let us start by splitting the data set into training and test sets.

1X = prices_df[['latitude','longitude']]
2y = prices_df['price']
3
4X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

k-NN isn’t hands off. We need to make an assumption about what is the most optimal number of neighbours (the hyperparameter k), as well as to whether the distance of each neighbour should be uniform, or otherwise; in other words, the hyperparameter weights. One way to find out is by brute force, that is to say, by trying different values, as follows:

 1uniform  = []
 2distance = []
 3r = range (1,21,2)
 4
 5for k in r:
 6    
 7    # Euclidan, 'straight' distance
 8    model = KNeighborsRegressor(n_neighbors = k, weights='uniform')
 9    model.fit(X_train.values, y_train.values)
10    uniform.append(model.score(X_test.values,y_test.values))
11
12    # Distance is inversely proportional (to lessen the weight of outliers)
13    model = KNeighborsRegressor(n_neighbors = k, weights='distance') 
14    model.fit(X_train.values, y_train.values)
15    distance.append(model.score(X_test.values,y_test.values))
16
17uniform = np.array(uniform)
18distance = np.array(distance)
19
20plt.rcParams['figure.figsize'] = [10, 3]
21plt.rcParams['figure.dpi'] = 100 # 200 e.g. is really fine, but slower
22plt.plot(r,uniform,label='uniform',color='blue')
23plt.plot(r,distance,label='distance',color='red')
24plt.legend()
25plt.gca().set_xticks(r)
26plt.show()

The results suggest that using the hyperparameters k = 15, and weight='distance' would provide the most accuracy.

1pd.DataFrame({"k" : r, "uniform" : uniform, "distance" : distance})

k uniform distance
0 1 0.557452 0.557452
1 3 0.608900 0.616880
2 5 0.626724 0.630256
3 7 0.655916 0.651094
4 9 0.656827 0.656976
5 11 0.661121 0.663031
6 13 0.657153 0.663963
7 15 0.660301 0.667635
8 17 0.658013 0.667308
9 19 0.654204 0.666773

This process, though, is tedious and makes rigid assumptions about the training data set. A way of further shuffling the training data set is through a process called cross-validation. The GridSearchCV function combines both the ability to define the number of splits and iterate through various hyperparameters.

In our case, rather than defining the convoluted loop seen a above, we simply specify the range of values we want to try out:

1params = {'n_neighbors':range(1,21,2),'weights':['uniform','distance']}

The next step is to pass the k-NN instance, our parameters (param) bundle, and number of splits (cv=5) to GridSearchCV:

1model = GridSearchCV(KNeighborsRegressor(), params, cv=5)
2model.fit(X_train.values,y_train.values)
3model.best_params_
{'n_neighbors': 19, 'weights': 'uniform'}

Above, GridSearchCV suggests weight = uniform as our brute force process did, but k = 19 rather than k = 15. This is because the cross-validation process does many different rounds of sampling, unlike our rigid loop.

1model.score(X_test.values,y_test.values)
0.6542038862362233

We now define our Location -> Price function as follows:

1def price(description,lat,lon):
2    features = [[lat,lon]]
3    print("{:30s} -> {:5.0f}k ".format(description,float(model.predict(features))))
4
5# Examples
6price('Oxford Circus, London', 51.515276, -0.142038)
7price('Harrods (B. Road), London', 51.499814, -0.163366)
8price('Peak District, National Park', 53.328508, -1.783416)
Oxford Circus, London          ->  5531k 
Harrods (B. Road), London      ->  3956k 
Peak District, National Park   ->   270k 

House Type by Location and Price

In the last section we observed the use of the k-NN regressor to predict house prices. Let us now use the same data set to work on a classification problem. The objective is predicting the house type (Detached, Semi, Terraced, etc.) based on its location and price as follows:

(Location, Price) -> Type

Let us first create a data set in which we obtain the average price for each house type given a location:

1
2types_df = df.groupby(["latitude","longitude","type"], as_index=False)["price"].mean()
3types_df.head()

latitude longitude type price
0 49.912272 -6.300022 T 625.0
1 49.913984 -6.309107 S 475.0
2 49.914130 -6.312967 O 440.0
3 49.914392 -6.315107 F 280.0
4 49.914397 -6.311555 F 255.0

Next, we will discover the most optimal number of neighbours and weight, using GridSearchCV:

 1params = {'n_neighbors':range(1,21,2),'weights':['uniform','distance']}
 2
 3X = types_df[['latitude','longitude','price']]
 4y = types_df['type']
 5
 6X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
 7
 8model = GridSearchCV(KNeighborsClassifier(), params, cv=5)
 9model.fit(X_train.values,y_train.values)
10model.best_params_
{'n_neighbors': 19, 'weights': 'distance'}

The score is not as good as that for price prediction, but still, it has a 50% score, which is not bad for a first attempt, without massaging the data set in any way:

1model.score(X_test.values,y_test.values)
0.5336038203296112

Finally, we declare a function that allows seeing the prediction in action:

1ht = { 'F' : 'Flat', 'T' : 'Terraced', 'S' : 'Semi', 'D' : 'Detached', 'O' : 'Other'}
2
3def house_type(description,lat,lon,price):
4    features = [[lat,lon,price]]
5    print("{:30s} {:5.0f}k -> {}".format(description,price,ht[(model.predict(features)[0])]))
6
7house_type('Oxford Circus, London', 51.515276, -0.142038, 500)
8house_type('Harrods (B. Road), London', 51.499814, -0.163366, 5500)
9house_type('Peak District, National Park', 53.328508, -1.783416, 100)
Oxford Circus, London            500k -> Flat
Harrods (B. Road), London       5500k -> Other
Peak District, National Park     100k -> Terraced

Conclusion

Scikit Learn does a good job at predicting both continuous values, and classes, using KNeighborsRegressor(), and KNeighborsClassifier(), respectively, for a use case in which the key features are geographical coordinates, in which, the notion of ‘distance’ is highly intuitive.

The use of all available Land Registry’s records, (as opposed to only the 2021 data sets) would facilitate the prediction of ‘future prices’, as well as the analysis of novel interesting patterns. For example, flats and terraced houses tend to form clusters in terms of continuous house numbers. An effective prediction on the property type could be achieved based on the house number and location (without the price), without necessarily having a ‘sold’ record for the specific house number/postcode pair.

Acknowledgements

Many thanks to Enrique Riveros who found a bug in the visualise() function, whereby the price was being sorted outside of the DataFrame, resulting in breaking the correspondence with the x, and y coordinates.

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

website counters