This article is a continuation of the prior article in a three part series on using Machine Learning in Python to predict weather temperatures for the city of Lincoln, Nebraska in the United States based off data collected from Weather Underground's API services.
In the first article of the series, Using Machine Learning to Predict the Weather: Part 1, I described how to extract the data from Weather Underground, parse it, and clean it. For a summary of the topics for each of the articles presented in this series, please see the introduction to the prior article.
The focus of this article will be to describe the processes and steps required to build a rigorous Linear Regression model to predict future mean daily temperature values based off the dataset built in the prior article. To build the Linear Regression model I will be demonstrating the use of two important Python libraries in the Machine Learning industry: Scikit-Learn and StatsModels.
In the third article of the series, Using Machine Learning to Predict the Weather: Part 3, I describe how the processes and steps required to build a Neural Network using Google's TensorFlow to predict future mean daily temperatures. Using this method I can then compare the results to the Linear Regression model.
Refamiliarizing Ourselves with the Dataset
In this GitHub repository you will find a Jupyter Notebook with the file name Weather Underground API.ipynb which describes the step-by-step actions required to collect the dataset we will be working with in this and the final article. Additionally, in this repository you will find a pickled Pandas DataFrame file called end-part1_df.pkl. So, if you would like to follow along without going through the somewhat painful experience of gathering, processing, and cleaning the data described in the prior article then pull down the pickle file and use the following code to deserialize the data back into a Pandas DataFrame for use in this section.
import pickle with open('end-part1_df.pkl', 'rb') as fp: df = pickle.load(fp)
If you receive an error stating No module named 'pandas.indexes' this is because you are probably using a version of pandas newer than the one I was using at the time of writing this article (v0.18.1).
To avoid this I have since then included a CSV file in the repo which contains the data from the end of part 1 that you can read in using the following code instead:
import pandas as pd df = pd.read_csv('end-part2_df.csv').set_index('date')
Background on Linear Regression using Ordinary Least Squares
Linear regression aims to apply a set of assumptions primary regarding linear relationships and numerical techniques to predict an outcome (Y, aka the dependent variable) based off of one or more predictors (X's independent variables) with the end goal of establishing a model (mathematical formula) to predict outcomes given only the predictor values with some amount of uncertainty.
The generalized formula for a Linear Regression model is:
ŷ = β0 + β1 * x1 + β2 * x2 + ... + β(p-n) x(p-n) + Ε
ŷis the predicted outcome variable (dependent variable)
xjare the predictor variables (independent variables) for j = 1,2,..., p-1 parameters
β0is the intercept or the value of
βjis the change in
ŷbased on a one unit change in one of the corresponding
Εis a random error term associated with the difference between the predicted
ŷivalue and the actual
That last term in the equation for the Linear Regression is a very important one. The most basic form of building a Linear Regression model relies on an algorithm known as Ordinary Least Squares which finds the combination of
βj's values which minimize the
Selecting Features for our Model
A key assumption required by the linear regression technique is that you have a linear relationship between the dependent variable and each independent variable. One way to assess the linearity between our independent variable, which for now will be the mean temperature, and the other independent variables is to calculate the Pearson correlation coefficient.
The Pearson correlation coefficient (r) is a measurement of the amount of linear correlation between equal length arrays which outputs a value ranging -1 to 1. Correlation values ranging from 0 to 1 represent increasingly strong positive correlation. By this I mean that two data series are positively correlated when values in one data series increase simultaneously with the values in the other series and, as they both go up in increasingly equal magnitude the Pearson correlation value will approach 1.
Correlation values from 0 to -1 are said to be inversely, or negatively, correlated in that when the values of one series increase the corresponding values in the opposite series decrease but, as changes in magnitude between the series become equal (with opposite direction) the correlation value will approach -1. Pearson correlation values that closely straddle either side of zero are suggestive to have a weak linear relationship, becoming weaker as the value approaches zero.
Opinions vary among statisticians and stats books on clear-cut boundaries for the levels of strength of a correlation coefficient. However, I have found that a generally accepted set of classifications for the strengths of correlation are as follows:
|0.8 - 1.0||Very Strong|
|0.6 - 0.8||Strong|
|0.4 - 0.6||Moderate|
|0.2 - 0.4||Weak|
|0.0 - 0.2||Very Weak|
To assess the correlation in this data I will call the
corr() method of the Pandas DataFrame object. Chained to this
corr() method call I can then select the column of interest ("meantempm") and again chain another method call
sort_values() on the resulting Pandas Series object. This will output the correlation values from most negatively correlated to the most positively correlated.
In selecting features to include in this linear regression model, I would like to error on the side of being slightly less permissive in including variables with moderate or lower correlation coefficients. So I will be removing the features that have correlation values less than the absolute value of 0.6. Also, since the "mintempm" and "maxtempm" variables are for the same day as the prediction variable "meantempm", I will be removing those also (i.e. if I already know the min and max temperatures then I already have the answer to my prediction).
With this information, I can now create a new DataFrame that only contains my variables of interest.
predictors = ['meantempm_1', 'meantempm_2', 'meantempm_3', 'mintempm_1', 'mintempm_2', 'mintempm_3', 'meandewptm_1', 'meandewptm_2', 'meandewptm_3', 'maxdewptm_1', 'maxdewptm_2', 'maxdewptm_3', 'mindewptm_1', 'mindewptm_2', 'mindewptm_3', 'maxtempm_1', 'maxtempm_2', 'maxtempm_3'] df2 = df[['meantempm'] + predictors]
Visualizing the Relationships
Because most people, myself included, are much more accustomed to looking at visuals to assess and verify patterns, I will be graphing each of these selected predictors to prove to myself that there is in fact a linear relationship. To do this I will utilize matplotlib's pyplot module.
For this plot I would like to have the dependent variable "meantempm" be the consistent y-axis along all of the 18 predictor variables plots. One way to accomplish this is to create a grid of plots. Pandas does come with a useful plotting function called the
scatter_plot(), but I generally only use it when there are only up to about 5 variables because it turns the plot into an N x N matrix (18 x 18 in our case), which becomes difficult to see details in the data. Instead I will create a grid structure with six rows of three columns in order to avoid sacrificing clarity in the graphs.
import matplotlib import matplotlib.pyplot as plt import numpy as np
%matplotlib inline # manually set the parameters of the figure to and appropriate size plt.rcParams['figure.figsize'] = [16, 22] # call subplots specifying the grid structure we desire and that # the y axes should be shared fig, axes = plt.subplots(nrows=6, ncols=3, sharey=True) # Since it would be nice to loop through the features in to build this plot # let us rearrange our data into a 2D array of 6 rows and 3 columns arr = np.array(predictors).reshape(6, 3) # use enumerate to loop over the arr 2D array of rows and columns # and create scatter plots of each meantempm vs each feature for row, col_arr in enumerate(arr): for col, feature in enumerate(col_arr): axes[row, col].scatter(df2[feature], df2['meantempm']) if col == 0: axes[row, col].set(xlabel=feature, ylabel='meantempm') else: axes[row, col].set(xlabel=feature) plt.show()
From the plots above it is recognizable that all the remaining predictor variables show a good linear relationship with the response variable ("meantempm"). Additionally, it is also worth noting that the relationships all look uniformly randomly distributed. By this I mean there appears to be relatively equal variation in the spread of values devoid of any fanning or cone shape. A uniform random distribution of spread along the points is also another important assumption of Linear Regression using Ordinary Least Squares algorithm.
Using Step-wise Regression to Build a Robust Model
A robust Linear Regression model should utilize statistical tests for selecting meaningful, statistically significant, predictors to include. To select statistically significant features, I will utilize the Python
statsmodels library. However, before I jump into the practical implementation of using the
statsmodels library I would like to take a step back and explain some of the theoretical meaning and purpose for taking this approach.
A key aspect of using statistical methods such as Linear Regression in an analytics project are the establishment and testing of hypothesis tests to validate the significance of assumptions made about the data under study. There are numerous hypothesis tests that have been developed to test the robustness of a linear regression model against various assumptions that are made. One such hypothesis test is to evaluate the significance of each of the included predictor variables.
The formal definition of the hypothesis test for the significance of a
βj parameters are as follows:
βj = 0, the null hypothesis states that the predictor has no effect on the outcome variable's value
βj ≠ 0, the alternative hypothesis is that the predictor has a significant effect on the outcome variable's value
By using tests of probability to evaluate the likelihood that each
βj is significant beyond simple random chance at a selected threshold
Α we can be more stringent in selecting the variables to include resulting in a more robust model.
However, in many datasets there can be interactions that occur between variables that can lead to false interpretations of these simple hypothesis tests. To test for the effects of interactions on the significance of any one variable in a linear regression model a technique known as step-wise regression is often applied. Using step-wise regression you add or remove variables from the model and assess the statistical significance of each variable on the resultant model.
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
In this article, I will be using a technique known as backward elimination, where I begin with a fully loaded general model that includes all my variables of interest.
Backward Elimination works as follows:
- Select a significance level
Αfor which you test your hypothesis against to determine if a variable should stay in the model
- Fit the model with all predictor variables
- Evaluate the p-values of the
βjcoefficients and for the one with the greatest p-value, if p-value >
Αprogress to step 4, if not you have your final model
- Remove the predictor identified in step 3
- Fit the model again but, this time without the removed variable and cycle back to step 3
So, without further delay let us build this fully loaded generalized model using
statsmodels following the above steps.
# import the relevant module import statsmodels.api as sm # separate our my predictor variables (X) from my outcome variable y X = df2[predictors] y = df2['meantempm'] # Add a constant to the predictor variable set to represent the Bo intercept X = sm.add_constant(X) X.ix[:5, :5]
# (1) select a significance value alpha = 0.05 # (2) Fit the model model = sm.OLS(y, X).fit() # (3) evaluate the coefficients' p-values model.summary()
summary() call will produce the following data in your Jupyter notebook:
|Date:||Thu, 16 Nov 2017||Prob (F-statistic):||0.00|
Ok, I recognize that the call to
summary() just barfed out a whole lot of information onto the screen. Do not get overwhelmed! We are only going to focus on about 2-3 values in this article:
- P>|t| - this is the p-value I mentioned above that I will be using to evaluate the hypothesis test. This is the value we are going to use to determine whether to eliminate a variable in this step-wise backward elimination technique.
- R-squared - a measure that states how much of the overall variance in the outcome our model can explain
- Adj. R-squared - the same as R-squared but, for multiple linear regression this value has a penalty applied to it based off the number of variables being included to explain the level of overfitting.
This is not to say that the other values in this output are without merit, quite the contrary. However, they touch on the more esoteric idiosyncrasies of linear regression which we simply just don't have the time to get into now. For a full explanation of them I will defer you to an advanced regression textbook such as Kutner's Applied Linear Regression Models, 5th Ed. as well as the
# (3) cont. - Identify the predictor with the greatest p-value and assess if its > our selected alpha. # based off the table it is clear that meandewptm_3 has the greatest p-value and that it is # greater than our alpha of 0.05 # (4) - Use pandas drop function to remove this column from X X = X.drop('meandewptm_3', axis=1) # (5) Fit the model model = sm.OLS(y, X).fit() model.summary()
|Date:||Thu, 16 Nov 2017||Prob (F-statistic):||0.00|
In respect of your reading time and in an attempt to keep the article to a reasonable length I am going to omit the remaining elimination cycles required to build each new model, evaluate p-values and remove the least significant value. Instead I will jump right to the last cycle and provide you with the final model. Afterall, the main goal here was to describe the process and the reasoning behind it.
Below you will find the output from the final model I converged on after applying the backwards elimination technique. You can see from the output that all the remaining predictors have a p-values significantly below our
Α of 0.05. Another thing worthy of some attention are the R-squared values in the final output. Two things to note here are (1) the R-squared and Adj. R-squared values are both equal which suggests there is minimal risk that our model is being over fitted by excessive variables and (2) the value of 0.894 is interpreted such that our final model explains about 90% of the observed variation in the outcome variable, the "meantempm".
model = sm.OLS(y, X).fit() model.summary()
|Date:||Thu, 16 Nov 2017||Prob (F-statistic):||0.00|
Using SciKit-Learn's LinearRegression Module to Predict the Weather
Now that we have gone through the steps to select statistically meaningful predictors (features), we can use SciKit-Learn to create a prediction model and test its ability to predict the mean temperature. SciKit-Learn is a very well established machine learning library that is widely used in both industry and academia. One thing that is very impressive about SciKit-Learn is that it maintains a very consistent API of "fit", "predict", and "test" across many numerical techniques and algorithms which makes using it very simple. In addition to this consistent API design, SciKit-Learn also comes with several useful tools for processing data common to many machine learning projects.
We will start by using SciKit-Learn to split our dataset into a testing and training sets by importing the
train_test_split() function from
sklearn.model_selection module. I will split the training and testing datasets into 80% training and 20% testing and assign a
random_state of 12 to ensure you will get the same random selection of data as I do. This
random_state parameter is very useful for reproducibility of results.
from sklearn.model_selection import train_test_split
# first remove the const column because unlike statsmodels, SciKit-Learn will add that in for us X = X.drop('const', axis=1) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12)
The next action to take is to build the regression model using the training dataset. To do this I will import and use the
LinearRegression class from the
sklearn.linear_model module. As mentioned previously, scikit-learn scores major usability bonus points by implementing a common
predict() API across its numerous numerical techniques which makes using the library very user friendly.
from sklearn.linear_model import LinearRegression
# instantiate the regressor class regressor = LinearRegression() # fit the build the model by fitting the regressor to the training data regressor.fit(X_train, y_train) # make a prediction set using the test set prediction = regressor.predict(X_test) # Evaluate the prediction accuracy of the model from sklearn.metrics import mean_absolute_error, median_absolute_error print("The Explained Variance: %.2f" % regressor.score(X_test, y_test)) print("The Mean Absolute Error: %.2f degrees celsius" % mean_absolute_error(y_test, prediction)) print("The Median Absolute Error: %.2f degrees celsius" % median_absolute_error(y_test, prediction))
The Explained Variance: 0.90 The Mean Absolute Error: 2.69 degrees celsius The Median Absolute Error: 2.17 degrees celsius
As you can see in the few lines of code above using scikit-learn to build a Linear Regression prediction model is quite simple. This is truly where the library shines in its ability to easily fit a model and make predictions about an outcome of interest.
To gain an interpretative understanding of the models validity I used the regressor model's
score() function to determine that the model is able to explain about 90% of the variance observed in the outcome variable, mean temperature. Additionally, I used the
median_absolute_error() of the
sklearn.metrics module to determine that on average the predicted value is about 3 degrees Celsius off and half of the time it is off by about 2 degrees Celsius.
Want to learn the tools, machine learning techniques, and data analysis used in this tutorial? Here are a few great resources to get you started:
- Data Analysis with Pandas and Python
- Deep Learning Prerequisites: Linear Regression in Python
- Python for Data Science and Machine Learning Bootcamp
In this article, I demonstrated how to use the Linear Regression Machine Learning algorithm to predict future mean weather temperatures based off the data collected in the prior article. I demonstrated how to use the
statsmodels library to select statistically significant predictors based off of sound statistical methods. I then utilized this information to fit a prediction model based off a training subset using Scikit-Learn's
LinearRegression class. Using this fitted model I could then predict the expected values based off of the inputs from a testing subset and evaluate the accuracy of the prediction, which indicates a reasonable amount of accuracy.
I would like to thank you for reading my article and I hope you look forward to the upcoming final article in this machine learning series where I describe how to build a Neural Network to predict the weather temperature.