Machine Learning for Price Prediction

David Landup

Introduction

Regression is a statistical method used to predict values of a target variable, based on dependent, predictor variables. The simplest form is linear regression between two variables, though multiple-linear regression can utilize more dependent variables.

Through Machine Learning - we can perform both linear and non-linear regression, which is typically too complex to form down as a concrete set of steps.

We perform regression in our lives all the time. For instance, if you've gone through the experience of moving to a new house or apartment - you probably remember the stressful experience of choosing a property, estimating its value based on some features a real estate platform displays, and comparing to other properties to check whether the features justify the price.

So - what makes the price of a property? If you ask a real estate agent, they can list out a fairly long list of features that increase or decrease the price of a property, based on the general consensus of the buyers in a specific area.

Some areas typically have properties with large lawns (such as suburban houses), while some properties typically don't have lawns (say, the city centre). Additionally, European architecture is significantly different than, say, American architecture and each country, city and even parts of cities have (to a degree) their own spin on urbanism. The prices today probably aren't the same as yesterday - let alone 6 months ago. Does this skew alter your perception of value?

It's hard to estimate the price of a property, which is why we generally delegate the task to someone with more experience than us, and even then, there's wiggle room.

In this Guided Project, we'll be performing house price prediction, for the city of Ames, Iowa. Since we don't know what makes the price of a property what it is - we'll employ Machine Learning to do the job for us.

Using Keras, the deep learning API built on top of Tensorflow, we'll experiment with architectures, build an ensemble of stacked models and train a meta-learner neural network (level-1 model) to figure out the pricing of a house.

Deep learning is amazing - but before resorting to it, it's advised to also attempt solving the problem with simpler techniques, such as with shallow learning algorithms. Our baseline performance will be based on a Random Forest Regression algorithm. Additionally - we'll explore creating ensembles of models through Scikit-Learn via techniques such as bagging and voting.

This is an end-to-end project, and like all Machine Learning projects, we'll start out with - with Exploratory Data Analysis, followed by Data Preprocessing and finally Building Shallow and Deep Learning Models to fit the data we've explored and cleaned previously.

While this project is focused on house price prediction - the exact same logic, steps and intuition can be carried over to any sort of target variable prediction.

Where Do We Get the Data?

Data comes from various sources. Naturally - one of them would be a real estate platform that offers various properties for you to browse through. You could even build a regression model to help you find your dream house in your area, using the most up-to-date data!

However, most websites and platforms don't have an API that allows you to get the data through controlled means, and web scraping oftentimes goes against the ToS. Whether or not scraping is moral is a debatable subject, and there isn't a clear cut answer.

Aggressive scraping can act as a DDoS attack and take a website down or slow it down for actual users, hurting both their experience and the website's brand. Additionally, aggressive scraping can hurt the analytics of the team maintaining the website, be a negative signal for SEO (by having a huge bounce rate) and by extension, impact its rankings.

On the other hand - non-agressive scraping may be fully legitimate. If a website measures traffic in the millions monthly, having a few thousand automated requests from your scraping script is circled up as an anomaly and doesn't mess up the analytics of the team. Such small scraping operations don't really impact the overall brand, experience or rankings of a website and certainly don't act as a DDoS attack.

Finally - you should pose the question:

What does the owner of the website want me to do with the data?

For instance, if you're on a real estate platform, the owner (probably) wants you to find information on listings, compare listings and find a property to buy. They (probably) don't want you to repost those same listings on another website, with altered prices, misleading quotes or false information.

If you're trying to build a mathematical (machine learning) model to help yourself (or others) find a good deal, leading to more and faster sales via that real estate platform, while keeping the number of requests within a reasonable number (as to not hurt the website in any form) - there's not much to object to.

Note: Before deciding to scrape a website - please do your best in finding out whether the website already offers an API. If it doesn't, it's not hard to reach out to the owner or maintainer of the website via email to politely ask for permission, explaining what you aim to do, and abiding to the essence of the ToS.

Throughout the Guided Project, we'll be using a pre-existing dataset, which can be found on Kaggle. The dataset isn't clean, and will require a fair bit of exploration, cleaning, making educated guesses, etc. Every dataset is unique, and if you happen to get data through an API - it'll be structured somewhat differently than this one.

Again - the domain (real estate properties) varies by city, region, continent, etc. and certain cultures and regions of people value certain property features differently!

Even with these distinctions, much the same pre-processing and exploration steps can be taken. As a matter of fact - the code used in this Guided Project was originally used for a completely different dataset, coming from a different continent where the domain is much different. With tweaks to the architecture (that you can automate, as covered in the project) - the method of discovery and development we'll be covering is general.

Lessson 1/6

You must first start the project before tracking progress.

Mark completed

Previous Next