House Price Prediction Jupyter Notebook

<h3 id="introduction">Introduction</h3>
Regression is a statistical method used to predict values of a target variable, based on dependent, predictor variables. The simplest form is linear regression between two variables, though multiple-linear regression can utilize more dependent variables.
<blockquote>
Through Machine Learning - we can perform both linear and non-linear regression, which is typically too complex to form down as a concrete set of steps.
</blockquote>
We perform regression in our lives all the time. For instance, if you've gone through the experience of moving to a new house or apartment - you probably remember the stressful experience of choosing a property, estimating its value based on some features a real estate platform displays, and comparing to other properties to check whether the features justify the price.
So - what makes the price of a property? If you ask a real estate agent, they can list out a fairly long list of features that increase or decrease the price of a property, based on the general consensus of the buyers in a specific area.
Some areas typically have properties with large lawns (such as suburban houses), while some properties typically don't have lawns (say, the city centre). Additionally, European architecture is significantly different than, say, American architecture and each country, city and even parts of cities have (to a degree) their own spin on urbanism. The prices today probably aren't the same as yesterday - let alone 6 months ago. Does this skew alter your perception of value?
It's hard to estimate the price of a property, which is why we generally delegate the task to someone with more experience than us, and even then, there's wiggle room.
<blockquote>
In this Guided Project, we'll be performing house price prediction, for the city of Ames, Iowa. Since we don't know what makes the price of a property what it is - we'll employ Machine Learning to do the job for us.
</blockquote>
Using Keras, the deep learning API built on top of Tensorflow, we'll experiment with architectures, build an ensemble of stacked models and train a meta-learner neural network (level-1 model) to figure out the pricing of a house.
Deep learning is amazing - but before resorting to it, it's advised to also attempt solving the problem with simpler techniques, such as with shallow learning algorithms. Our baseline performance will be based on a Random Forest Regression algorithm. Additionally - we'll explore creating ensembles of models through Scikit-Learn via techniques such as bagging and voting.
This is an end-to-end project, and like all Machine Learning projects, we'll start out with - with Exploratory Data Analysis, followed by Data Preprocessing and finally Building Shallow and Deep Learning Models to fit the data we've explored and cleaned previously.
<blockquote>
While this project is focused on house price prediction - the exact same logic, steps and intuition can be carried over to any sort of target variable prediction.
</blockquote>
<h3 id="wheredowegetthedata">Where Do We Get the Data?</h3>
Data comes from various sources. Naturally - one of them would be a real estate platform that offers various properties for you to browse through. You could even build a regression model to help you find your dream house in your area, using the most up-to-date data!
However, most websites and platforms don't have an API that allows you to get the data through controlled means, and web scraping oftentimes goes against the ToS. Whether or not scraping is moral is a debatable subject, and there isn't a clear cut answer.
Aggressive scraping can act as a DDoS attack and take a website down or slow it down for actual users, hurting both their experience and the website's brand. Additionally, aggressive scraping can hurt the analytics of the team maintaining the website, be a negative signal for SEO (by having a huge bounce rate) and by extension, impact its rankings.
On the other hand - non-agressive scraping may be fully legitimate. If a website measures traffic in the millions monthly, having a few thousand automated requests from your scraping script is circled up as an anomaly and doesn't mess up the analytics of the team. Such small scraping operations don't really impact the overall brand, experience or rankings of a website and certainly don't act as a DDoS attack.
Finally - you should pose the question:
<blockquote>
What does the owner of the website want me to do with the data?
</blockquote>
For instance, if you're on a real estate platform, the owner (probably) wants you to find information on listings, compare listings and find a property to buy. They (probably) don't want you to repost those same listings on another website, with altered prices, misleading quotes or false information.
If you're trying to build a mathematical (machine learning) model to help yourself (or others) find a good deal, leading to more and faster sales via that real estate platform, while keeping the number of requests within a reasonable number (as to not hurt the website in any form) - there's not much to object to.

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: Before deciding to scrape a website - please do your best in finding out whether the website already offers an API. If it doesn't, it's not hard to reach out to the owner or maintainer of the website via email to politely ask for permission, explaining what you aim to do, and abiding to the essence of the ToS.

 </div>
 </div>
 Throughout the Guided Project, we'll be using a pre-existing dataset, which can be found on Kaggle. The dataset isn't clean, and will require a fair bit of exploration, cleaning, making educated guesses, etc. Every dataset is unique, and if you happen to get data through an API - it'll be structured somewhat differently than this one.
<blockquote>
Again - the domain (real estate properties) varies by city, region, continent, etc. and certain cultures and regions of people value certain property features differently!
</blockquote>
Even with these distinctions, much the same pre-processing and exploration steps can be taken. As a matter of fact - the code used in this Guided Project was originally used for a completely different dataset, coming from a different continent where the domain is much different. With tweaks to the architecture (that you can automate, as covered in the project) - the method of discovery and development we'll be covering is general.

David Landup

Machine Learning for Price Prediction

<h4 id="importingmodules">Importing Modules</h4>
Let's take care of all of the imports, at the top of the script/Jupyter Notebook so we don't have to worry about imports later:
<pre><code class="hljs"># Scikit-Learn and Shallow Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet
from sklearn import metrics

# TF and Keras-related imports
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Data manipulation and processing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
</code></pre>
Since we'll be starting out with shallow learning techniques for the baseline performance - we've imported a utility function, a scaler (for preprocessing), two regressor models and the <code>metrics</code> module from Scikit-Learn.
Though TensorFlow, we import Keras, and a commonly used class so we can shorten calls such as <code>tf.keras.layers.Dense()</code> to <code>layers.Dense()</code>.
We're naturally importing <code>pandas</code> and <code>numpy</code> for handling and manipulating data, as well as Matplotlib and Seaborn to visualize it.
<h4 id="loadingthedata">Loading the Data</h4>
The dataset we'll be working with reports sales of residential units between 2006 and 2010 in a city called Ames which is located in Iowa, United States.

Ammar Alyousfi

Exploratory Data Analysis (EDA)

In the preprocessing stage, we'll prepare the data to be fed into machine learning models - whether we're performing shallow or deep learning. The first step is clearing the dataset of null values, abnormalities and enforcing data types. Then, we'll one-hot encode categorical variables into numerical ones.
Once that'd done - we can split the data into a training and testing set, as well as scale/standardize it to help both train the models faster and allow them to converge a bit easier. Let's start!
<h4 id="removingabnormalities">Removing Abnormalities</h4>
We've previously seen that abnormal listings exist, and by definition, they're in the minority. Since abnormal sales typically include a lower price for good features due to, say, being in a rush to sell a property, we don't want to discount the features as worse indicators of a price than they really are.
There are also &quot;partial&quot; sale conditions, which don't impact other variables, but do impact the price. If a house is partially built, it'll have the same area, lot frontage, etc. as a finished house! The overall quality can also be ranked as high, if high quality materials are used, but there's no guarantee that this will also reflect the overall condition variable (and even if it does, the variable doesn't correlate with the price much).

Data Preprocessing

<h4 id="shallowlearningalgorithms">Shallow Learning Algorithms</h4>
Let's start out with Shallow Learning - both as a sanity check and as a way to determine the baseline performance of a more simple rule-based algorithm. Although these models are simpler - they can be exceedingly powerful, and the strongest contender is a Random Forest which in many cases gets Deep Learning-level results, in a fraction of the time.
It'll be our main baseline performance indicator throughout the project! Besides the random forest regressor, we'll try out some other regressors as well.
<h4 id="elasticnetdecisiontreesandrandomforests">ElasticNet, Decision Trees and Random Forests</h4>
ElasticNets are linear regressors with L1 and L2 priors are regularizers. Linear regression won't get us really far if there's a complex relationship between variables, though in our case, even a linear regressor might perform decently given how strongly some of the features correlate with the price. If we had less features correlating so high, ElasticNet would probably perform significantly worse than it might perform here.
Decision Trees seem like a much more fitting algorithm for our case conceptually, and a Random Forest is an ensemble of Decision Trees that mutually help each other predict the most precise value. Generally speaking, ensemble models perform better than singular models, though it might not always be the case. Also generally speaking, Random Forests perform significantly better than singular Decision Trees, and are an extremely powerful algorithm.

Building Machine Learning Models

<h3 id="ensemblelearning">Ensemble Learning</h3>
Ensemble Learning is a powerful learning technique, that generally yields better results than an individual model, though it's no guarantee. Each model has a small factor of randomness, and increasing the number of models with their own quirks and traits, having them work together typically reduces the variance of the output and &quot;irons out&quot; irregularities.
<img src="//s3.stackabuse.com/media/guided+projects/hands-on-house-price-prediction-machine-learning-in-python-20.png" alt="client and agents">
Different realtors have seen different properties in their time, and each has their own wisdom. It's typical to hear a ballpark of appraisals from different people. In the illustration above, our orange client got a much more accurate idea of the value of the house just by averaging out the generally correct appraisals by multiple realtors, rather than listening to just one.
<blockquote>
In a similar way, many psychological studies show how diversity in human teams helps find solutions to problems faster. <a target="_blank" rel="nofollow noopener noreferrer" href="https://hbr.org/2016/11/why-diverse-teams-are-smarter">Harvard Business Review</a> outlies several studies that show how diverse teams display more accurate group thinking, process facts more carefully and display a higher level of innovation.
</blockquote>
It's not hard to grasp why - having multiple sets of eyes and lenses on a subject makes it easier to see more details. Having a more diverse set of models predicting a target variable and working together to produce this result helps with accuracy.

Ensemble Learning - Combining Model Performances

That concludes this Guided Project - &quot;Hands-On House Price Prediction - Machine Learning in Python&quot;. Thank you for taking a ride with us!
<blockquote>
Online education is spreading through the world, and is becoming an increasingly important part of many lives. We believe that accessible, high-quality resources can help empower people that build tomorrow, and remain guided by that goal.
</blockquote>
At StackAbuse, we believe that learning is not a one-stop time investment. It's life-long. Especially in the volatile and rapidly changing world of Computer Science and Software Engineering. So, we've pledged to update our courses, guides, and other upcoming material to keep the pace of progress in the field. Software is updating - it's only fitting that learning resources are updating as well.
Thank you for purchasing &quot;Hands-On House Price Prediction - Machine Learning in Python&quot;! We hope that it has brought a ton of value to you so far, and know that it will continue to do so as you dive further in to this topic.
<blockquote>
Now, we'd like to ask you to get involved in improving the next version our material.
</blockquote>
We believe that high-quality resources and education is community-driven and that minor (or major) contributions from each member results in a wonderful learning oasis. For this, feedback is crucial.

Thank You for Supporting Online Education

Jovana Ninkovic

If you've gone through the experience of moving to a new house or apartment - you probably remember the stressful experience of choosing a property, estimating its value based on some features a real estate platform displays, and comparing to other properties to check whether the features justify the price.
So - what makes the price of a property? If you ask a real estate agent, they can list out a fairly long list of features that increase or decrease the price of a property, based on the general consensus of the buyers in a specific area.
Some areas typically have properties with large lawns (such as suburban houses), while some properties typically don't have lawns (say, the city centre). Additionally, European architecture is significantly different than, say, American architecture and each country, city and even parts of cities have (to a degree) their own spin on urbanism. The prices today probably aren't the same as yesterday - let alone 6 months ago. Does this skew alter your perception of value?
It's hard to estimate the price of a property, which is why we generally delegate the task to someone with more experience than us, and even then, there's wiggle room.
<blockquote>
In this Guided Project, we'll be performing house price prediction, for the city of Ames, Iowa. Since we don't know what makes the price of a property what it is - we'll employ Machine Learning to do the job for us.
</blockquote>
Using Keras, the deep learning API built on top of Tensorflow, we'll experiment with architectures, build an ensemble of stacked models and train a meta-learner neural network (level-1 model) to figure out the pricing of a house.
Deep learning is amazing - but before resorting to it, it's advised to also attempt solving the problem with simpler techniques, such as with shallow learning algorithms. Our baseline performance will be based on a Random Forest Regression algorithm. Additionally - we'll explore creating ensembles of models through Scikit-Learn via techniques such as bagging and voting.
This is an end-to-end project, and like all Machine Learning projects, we'll start out with - with Exploratory Data Analysis, followed by Data Preprocessing and finally Building Shallow and Deep Learning Models to fit the data we've explored and cleaned previously.

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: This Guided Project is aimed at novices in Machine Learning and assumes rudimentary knowledge of Programming in Python and its ecosystem and at least a surface-level understanding of Keras and Scikit-Learn.

 </div>
 </div>