Machine Learning: Linear Regression

What is Linear Regression?
How does Linear Regression work?
Assumptions of Linear Regression
Features of Linear Regression
Does "Linear" really mean "Linear?"
How do we evaluate a linear regression
How to implement Linear Regression in Python?
How to improve Linear Regression model?
When to use Linear Regression?

What is Linear Regression?

Linear regression is a supervised machine learning model. In reality, there are multiple different linear models for regression in machine learning, but in this video, when we say "linear regression", we will be referring to the Ordinary Least Squares (OLS) linear regression, which is the most common form of linear regression.

How does Linear Regression work?

In middle school algebra, we learned that a line takes the form:
y = mx + b
Where m is the slope, x is the value on a cartesian plane, b is the intercept, and is the y value.
Linear regression essentially works by fitting a line of best fit through your data points that minimizes the sum of squared errors. The "error" is the difference between an actual and predicted value. OLS linear regression minimizes the total sum of squared errors. What this really means is that this algorithm finds the "m" and "b" that gives you the best possible predictions (by minimizing the sum of squared errors).

Assumptions of Linear Regression

These are not ordered in any particular order.

Regression model is linear. This is essentially stating that the dependent variable can be modeled as the linear combination of values and coefficients. In laymen's terms, this assumption is saying that the depending variable can be described in the form: y = mx+b where m is the slope and b is the coefficient.
Error term has average value of zero. If this assumption is not met, then the error terms are predictable and that important information is missing from your model. If the error term average is consistently negative or positive, then your model is predicting incorrect values consistently and your model has a bias problem.
Independent values are uncorrected with the error term. If this rule is not met, then the error term can be predicted which violates the assumption that the error term is random. If the error term is not random, it can be predicted, and that information should be included in the model.
Error terms are uncorrelated. If an error term can be used to predict a subsequent error term, then the error terms are correlated. This is an issue called autocorrelation or serial correlation. You may be able to remedy this issue by including past observations of independent variables.
Homoskedasticity. This assumption means that the variance of the errors is constant across all observations. If the variance changes, that is called heteroskedasticity. You can see if you have constant error terms or not by graphing the errors.
No independent variable is the perfect linear combination of another indepenent variable. This is saying that none of your independent variables can have a perfect linear relationship with any other independent variable. Perfect linear correlation means that the pearson coefficient is -1 or +1. Even if two variables are not perfectly correlated, but they are strongly correlated, that can lower the accuracy of your model. This issue is called multicolinearity. If you have two perfectly correlated variables, you should exclude one from the model.
Normally distributed residuals. This is the one assumptions thats more optional than mandatory when it comes to OLS linear regressions. However, if your errors are normally distributed, then you can do accurate hypothesis testing and have reliable confidence intervals.

Features of Linear Regression

There is a theorem in statistics called the Gauss Markov theorem. This theorem states that when the first six assumptions are met, the OLS linear regression is the best linear unbiased estimator, or BLUE, for short. This essentially means that OLS will give you the least variance and bias when compared to other linear models when the first 6 assumptions are met.

Does "linear" really mean "linear"?

We have been calling the OLS regression a linear regression because it fits a linear line. However, it can also be used to fit a nonlinear curve. This can be achieved by performing some algebra on one or more of your independent variables such as taking the natural log of an independent variable and then using that altered variable in your model. The model would still use the Ordinary Least Squares method to calculate the coefficients and intercept.

How do we evaluate a Linear Regression?

OLS linear regressions are evaluated using the R-squared metric. This metric represents the amount of variance in the dependent variable that can be explained by your independent variable(s). This metric ranges from [0,1], with 0 meaning your independent variables explain 0% of the variance in your dependent variable and 1 meaning your independent variables explain 100% of the variance in your dependent variable. A higher R-squared value is better.

How to implement Linear Regression in Python?

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

data = pd.read_csv("MOCK_Income_Data.csv")

x_train, x_test, y_train, y_test = train_test_split(data["Age"],data["Income"])

x_train

1     22
6     31
24    65
4     26
15    45
17    54
18    56
8     32
13    41
22    60
14    43
10    39
11    40
23    61
0     21
19    57
16    52
2     24
Name: Age, dtype: int64

np.array(x_train).reshape(-1,1)

array([[22],
       [31],
       [65],
       [26],
       [45],
       [54],
       [56],
       [32],
       [41],
       [60],
       [43],
       [39],
       [40],
       [61],
       [21],
       [57],
       [52],
       [24]])

reg_model = LinearRegression().fit(np.array(x_train).reshape(-1,1), y_train)

reg_model.score(np.array(x_test).reshape(-1,1), y_test)

0.8765257042185327

How to improve a Linear Regression?

Experiment with different variables in different combinations. Choose the one that works best.
Perform exploratory analysis. Try to include variables that you think make sense. You may also want to eliminate outliers to prevent them from weakening your predictive power.
Consider using adjusted r-squared. You may be able to increase your R-squared by including a lot of extraneous variables in your model. Adjusted r-squared will penalize adding extraneous variables, so you should consider using that metric in addition to regular R-squared.
Transform your data. You may want to take the natural log of one of more of your variables. You may want to normalize or standardize your variables. You can experiment with different data transformations and see which ones improve your model.

Pros and Cons of OLS Linear Regression

Pros

Simple, easy to understand and explain.
No hyperparameters to tune.
Can perform well with multiple variables.

Cons

Can be thrown off by outliers (you can remove outliers).
Must follow assumptions (ex. no linear relationships b/w variables, no heteroskedasticity).
Other models are likely to give better predictive performance.

When to use a Linear Regression?

Your dependent variable is numeric.
Your plan to use the model for inference (can be used for predictions also).
You need a simple or easy to explain model.