A Look at Linear Regression (with Examples in Excel and Python)

Eric Kleppen, Contributor to the CareerFoundry blog

Data science is a fascinating field that aims to solve challenging problems like autonomous vehicles and artificial intelligence (AI). Recently, AI and data science have been in the news, as real estate company Zillow Group shut down its home buying program that used predictive analytics to forecast the price of homes months into the future. Although they could have spent more resources tuning their algorithm, they found the potential risks outweighed the benefits

In practice, predicting the price of homes months into the future is certainly a challenge—even for the most sophisticated algorithms. However, at its core, predicting a home’s value is simply a regression problem, since a home’s price is a continuous variable based on many independent variables like rooms, location, year built, and so on. Even simple linear regression—one of the simplest algorithms—can be used to price homes. 

In this tutorial, we will briefly introduce regression analysis and look at linear regression examples. First, we’ll build a simple linear regression model in Microsoft Excel. Then, we’ll build a more complex linear regression model using a few lines of Python code. By the end of this post, you’ll have an understanding of regression analysis, linear regression, and how to apply it.

You can use the clickable menu to skip ahead to any section:

  1. Regression analysis 101
  2. What is linear regression?
  3. Using Python to solve a bigger problem
  4. Assumptions of linear regression
  5. Building a linear regression model using scikit-learn
  6. Evaluating the linear regression model
  7. Linear regression modeling code review
  8. Conclusion and next steps

With that, let’s get started.

Regression analysis 101

In Data science, regression analysis is primarily used in prediction and forecasting tasks. Essentially, regression techniques fit a line to the data, which allows you to estimate changes to the dependent variable (for example, price) as changes occur to the independent variables (for example, size). Linear regression models assume the data have a linear relationship, and thus fit a straight line to it. Other regression models, like logistic regression, fit a curved line to the data. 

Diagram showing linear regression vs logistic regression
Linear regression vs logistic regression (image by author)

Regression analysis is a versatile set of techniques, because they are relatively easy to compute and explain compared to systems like neural networks and deep learning. Beyond a method for making predictions, regression analysis can also be used for things like identifying significant predictors and understanding the strength of relationships within the data. 

What is linear regression?

One of the most common forms of regression analysis is linear regression. It is classified as a supervised learning algorithm. Simple linear regression is the term used when the linear regression model uses a single dependent variable and a single independent variable. When there are multiple variables, it is called multiple linear regression

The linear regression algorithm draws the line of best fit through the data. It can do this by solving for the regression coefficient that minimizes the total error. To calculate the line of best fit we typically use Ordinary Least Squares (OLS).

Want an in-depth look at linear regression? Check out our beginner’s guide in this article!

We won’t dwell on the mathematics, but will do a quick explanation of the equation. It is important to know the basics of the linear regression calculation if you need to explain the predictions. The simple linear regression model equation is this:

prediction = intercept + slope * independent variable + error :

Mathematical equation showing prediction = intercept + slope * independent variable + error
Sourced from Wikipedia: Simple linear regression

y is the predicted value of the dependent variable.

a is the intercept. Think of this as where the line would cross the x axis (x=0).

B is the slope of the line.

x independent variable. 

e is the error or variation in the estimate of the regression coefficient.

All of this sounds complex, but can be done in a few clicks within a spreadsheet or a few lines of code within Python.

A Simple Example

An easy way to understand simple linear regression is to imagine we want to sell a house and have to determine a price. To determine price, first figure the finished square footage of the home is 1280. Using that information, let’s look at sale prices of similarly-sized homes in the surrounding area. Say we find five houses that are roughly the same size, all recently sold for slightly different prices:

Screenshot of part of a Microsoft Excel sheet showing data for SquareFeet (that is, square feet of ahouse) and Price
Example data (image by author)

If we plot the prices by square footage, we’ll notice an upward trend showing price increases as square footage increases. 

Using Microsoft Excel, enter the data into columns A and B respectively and create a chart by clicking Insert > Chart. Note: this can also be done in Google Sheets.

Screenshot of a Microsoft Excel sheet with data and a scatterplot generated from this data
Microsoft Excel scatter chart (image by author)

To predict the price of the home, fit a trendline to the data in the chart. 

Click Chart Elements > Trendline. Mark the Trendline checkbox. Make sure the Linear option is selected within the Trendline options. A trendline will appear.

Screenshot of a scatter graph showing Price versus SquareFeet with an arrow highlighting which Excel function creates the trendline
Microsoft Excel trendline (image by author)

Using the trendline, we can predict the price of the 1280 square foot home should be around $245,000. 

In Excel, open the Data Analysis tools and select Regression. A summary of statistics will appear in a new sheet by default. Notice the intercept and X Variable (slope) values. Plug those values into the above equation, along with the house’s square footage, to get a simple price prediction:

Regression summary output in Microsoft Excel with the Coefficient column and the Intercept and X Value rows highlighted in red.
Regression summary output in Microsoft Excel (image by author)

The negative intercept tells you where the linear model predicts house price (y) would be when square footage (x) is 0. Solve for y by using the Intercept and X Variable coefficient values:

y = -466500 + 555*1280 

= -466500 + 710400

= 243900

So there you have it! We just priced a house using a simple linear regression algorithm and calculation. It is that simple. 

Using Python to solve a bigger problem

Simple linear regression in Microsoft Excel is useful when you only have a couple variables to work with and a small data set. However, if you have hundreds of variables and millions of rows of data (like Zillow Group likely had) you will need a tool like Python to manipulate the data and build the multiple linear regression model, since there are two or more X variables.

Continuing with the house price example, let’s add a few more variables into the data. I am using a modified copy of the free Kaggle House Pricing data in this example

You have two options to get the data:

Reviewing the Data

I’ve reduced the data down to 11 columns of numeric data, from the original 81 columns in the data set. Columns that store categorical data and strings have been removed since engineering those columns to work with the algorithm is beyond this tutorial.

Start by importing dependencies (sandas, Plotly Express, and scikit-learn) and data from the train.csv file. 

import pandas as pd

import plotly.express as px

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_absolute_error

 

Read the CSV file into a DataFrame. Use .head() to display the first 5 rows and print the shape. Review the column names too. 

#read training data csv file

train = pd.read_csv(r”C:\Users\bendgame\Desktop\house-prices\train.csv”)

print(train.shape)

train.head()

Screenshot of the 'train' dataframe in Python
Train dataframe (image by author)

Notice there are 11 columns and 1431 rows in the train dataframe. Below is a description of the columns:

  • SalePrice: The property’s sale price in dollars. This is Y, or the target variable that you’re trying to predict . 
  • LotArea: Lot size in square feet
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • BedroomAbvGr: Number of bedrooms above basement level
  • KitchenAbvGr: Number of kitchens
  • Fireplaces: Number of fireplaces
  • GarageCars: Size of garage in car capacity

Assumptions of linear regression

Linear regression is most useful on data that meets certain assumptions. I’ve already filled in missing values and removed some outliers, but understanding how to review the data to make sure it is a good fit for linear regression is important. Follow this checklist to make sure your data is a good fit for the approach:

  • The dependent and independent variables should have a linear relationship.
  • The independent variables are not all highly correlated. Check multicollinearity using a correlation matrix.
  • The outliers are handled appropriately since they can have a heavy impact on results. Check for outliers using a scatter plot or other charts.
  • The data follows a multivariate normal distribution. Check for normality using a goodness of fit test.

Although I’ve performed some pre-processing on the data, let’s review some techniques for checking and visualizing the data. In general, reviewing data before modeling is called exploratory data analysis

Use pandas .describe() to display statistics about the columns. Looking at the statistics can help identify things, like outliers and standard deviation. 

train.describe()

Use Plotly Express histogram to review the distribution of SalePrice. We want it to follow a normal distribution and can use different scaling techniques to transform it if needed. For example, consider using a log transformation on SalePrice since it is a bit skewed.

px.histogram(train, x=’SalePrice’)

SalePrice histogram generated with pandas
SalePrice histogram (image by author)

Use a scatter plot to look for outliers and check the relationship between variables. Check variables SalePrice and LotArea for example. Remember, linear regression is sensitive to outliers. 

px.scatter(train, x=’LotArea’, y=’SalePrice’)

Scatterplot generated using pandas showing house sale price vs. lot area
Sale price vs. lot area scatter plot (image by author)

If the statsmodels python library is installed, one of the fantastic features of Plotly Express is we can perform simple linear regression in one line of code:

#install statsmodels lib if needed

#!pip install statsmodels

#perform simple linear regression

px.scatter(train, x=’LotArea’, y=’SalePrice’, trendline = ‘ols’, trendline_color_override=”red”)

Scatterplot with trendline showing relationship between lot size vs sale price
Plotly simple linear regression (image by author)

Use the scatter_matrix to view multiple scatter plots at the same time. Depending on the number of columns in the dataframe, consider specifying the ones passed into scatter_matrix to make it easier to see. 

px.scatter_matrix(train[[‘SalePrice’, ‘LotArea’,GrLivArea’]])

Lastly, check the correlations using the .corr() function. This can help us check for multicollinearity. It is common to use a heatmap to visualize correlations. 

print(train.corr())

px.imshow(train.corr())

Correlation map generated with Plotly
Correlation heatmap (image by author)

Building a linear regression model using scikit-learn

Creating a linear regression model and generating predictions is only three lines of code, thanks to scikit-learn. However, we must first separate the target variable (SalePrice) from the data. Then we must split the dataframe into a training set and a testing set. Splitting the data into a training and testing set is a best practice in terms of model validation. 

Use scikit-learn’s train_test_split() function to split the data.

#create target variable

y = train[‘SalePrice’]

#create array of features

x = train.drop(columns = ‘SalePrice’)

#split the data into train and test sets

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.55, random_state=42)

Next, instantiate a linear regression model and fit the data.

model = LinearRegression()

model.fit(xtrain, ytrain)

Now that the model has been trained, it can be used to predict prices. In practice, before pushing a model to production to be used for real predictions, it goes through rigorous evaluation. We can generate the predictions on the xtest data set and compare them to the ytest values to see how close the predictions are to the real prices. 

#Predicting the prices

pred = model.predict(xtest)

#look at the top 5 predictions vs top 5 real

print(pred[:5])

ytest[:5]

Linear regression predictions generated using scikit-learn
Linear regression predictions from scikit-learn (image by author)

We can see the predictions are close, but far from exact. As data scientists, we want to understand how much error exists in the model, so that we can compare multiple models and select the best one. To better understand the error, we must use a model evaluation technique.

Evaluating the linear regression model

We won’t know how accurate our model is unless we use a method for evaluating the line of best fit. One popular way of evaluating a model is using a technique called Mean Absolute Error (MAE). It is easy to understand because it is the average absolute value between the predicted point and the actual point.

#appy mean absolute error

mean_absolute_error(y_true=ytest, y_pred=model.predict(xtest))

Mean absolute error for the linear regression model
Mean absolute error for the linear regression model (image by author)

We can see the MAE is over 26,000 in the test set. A smaller value is better since we want accurate predictions. Other popular evaluation metrics include Mean Squared Error (MSE), Mean Absolute Percentage Error (MAPE), and R-squared. It is common to see multiple evaluation techniques used when evaluating a model because they all measure slightly different things. 

Linear regression modeling code review

Congratulations! You’ve just built a multiple linear regression model in Python, used it to predict house prices, and evaluated the model’s accuracy. All in just a few lines of code:

#create target variable

y = train[‘SalePrice’]

#create array of features

x = train.drop(columns = ‘SalePrice’)

#split the data into train and test sets

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=.55, random_state=42)

#build the model

model = LinearRegression()

#fit the model

model.fit(xtrain, ytrain)

#Predicting the prices

pred = model.predict(xtest)

#look at the top 5 predictions vs top 5 real

print(pred[:5])

print(ytest[:5])

#appy mean absolute error

mean_absolute_error(y_true=ytest, y_pred=model.predict(xtest))

Conclusion and next steps

Linear regression is a useful tool in the data analysis toolbox, and is capable of achieving great results in many use cases. Beyond pricing homes, regression analysis is used in all kinds of tasks, like identifying stock trends, understanding consumer behavior, and analyzing medical research. We just went through two linear regression examples to explain how it works, how to calculate a prediction, and how to visualize the line of best fit. In only a few clicks in Excel or a few lines of code in Python, you can apply linear regression to your data.

If you’re interested in learning more about linear regression, you may be interested in checking out our free, 5-day data analytics short course.

You could also check out other articles in our series on data analytics:

What You Should Do Now

  1. Get a hands-on introduction to data analytics and carry out your first analysis with our free, self-paced Data Analytics Short Course.

  2. Take part in one of our FREE live online data analytics events with industry experts.

  3. Talk to a program advisor to discuss career change and find out what it takes to become a qualified data analyst in just 4-7 months—complete with a job guarantee.

  4. This month, we’re offering the first 100 students reduced tuition—worth up to $1,120 off our full Data Analytics Program. To secure your spot, book an advisor call today.

What is CareerFoundry?

CareerFoundry is an online school for people looking to switch to a rewarding career in tech. Select a program, get paired with an expert mentor and tutor, and become a job-ready designer, developer, or analyst from scratch, or your money back.

Learn more about our programs