Introduction to Linear Regression You have seen how to find the equation of a line that connects two points. You have seen how to find the equation of a line that connects two points. Often, we have more than two data points, and usually the data points do not all lie on a single line.

You have seen how to find the equation of a line that connects two points. Often, we have more than two data points, and usually the data points do not all lie on a single line. It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation. You have seen how to find the equation of a

line that connects two points. Often, we have more than two data points, and usually the data points do not all lie on a single line. It is possible to find the equation of a line that most closely fits a set of data points. Such a line is called a regression line or a linear regression equation. Our goal here is to learn what a regression line is. You can then watch the presentation on how to find the equation of a regression

Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994. Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994. Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004,

where t=0 represents 1994. We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot. Consider the following table that the average price of a two-bedroom apartment in downtown New York City from 1994 to 2004, where t=0 represents 1994. We can plot each of these data points on a graph. Each point is of the form (t, p), so we have 6 points to plot.

They are (0, 0.38), (2, 0.40), (4, 0.60), (6, 0.95), (8, 1.20), and (10, 1.60). Just looking at them like this doesnt give much indication of a pattern, although we can see that the p- When we plot the points all together on a set of axes, we get the following scatter plot: 1.8 Price p in millions of $ 1.6 1.4 1.2 1 0.8

0.6 0.4 0.2 0 0 2 4 6 8 Time t in years since 1994 10 12

When we plot the points all together on a set of axes, we get the following scatter plot: Price p in millions of $ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2

4 6 8 10 12 Time t in years since 1994 It seems that the data do follow a somewhat linear pattern. We can find the line the line that most closely fits the equation and graph it over the data points.

1.8 Price p in millions of $ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4

6 8 Time t in years since 1994 10 12 We can find the line the line that most closely fits the equation and graph it over the data points. 1.8 Price p in millions of $

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 Time t in years since 1994

10 12 We can find the line the line that most closely fits the equation and graph it over the data points. Price p in millions of $ 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4

0.2 0 0 2 4 6 8 10 12 Time t in years since 1994

Notice that the line does not go through all of the data points. We can also find the equation of this line of best fit. 1.8 Price p in millions of $ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

0 2 4 6 8 Time t in years since 1994 10 12 We can also find the equation of this line of best fit. 1.8 Price p in millions of $

1.6 1.4 f(x) = 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6

8 Time t in years since 1994 10 12 We can also find the equation of this line of best fit. We can also get whats called the correlation coefficient. 1.8 Price p in millions of $

1.6 1.4 f(x) = 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

Time t in years since 1994 10 12 We can also find the equation of this line of best fit. We can also get whats called the correlation coefficient. 1.8 Price p in millions of $ 1.6

1.4 R f(x)==0.95 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

Time t in years since 1994 10 12 1.8 Price p in millions of $ We can also find the equation of this line of best fit. We can also get whats called the correlation coefficient. 1.6

1.4 R f(x)==0.95 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

10 12 Time t in years since 1994 You will be able to do all of this on Excel once you watch the instructional video and read the PDFs for this material. For now, we just want to get an idea of what the regression line is and what the correlation coefficient tells us What does the regression equation tell us about the relationship between time and sale price? 1.8 Price p in millions of $

1.6 1.4 R f(x)==0.95 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4

6 8 Time t in years since 1994 10 12 What does the regression equation tell us about the relationship between time and sale price? Price p in millions of $ 1.8 1.6

1.4 R f(x)==0.95 0.13 x + 0.22 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

10 12 Time t in years since 1994 The slope and the vertical intercept (usually the y-intercept, here the p-intercept) tell us different things. In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994).

In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars. In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). The regression equation is p=0.1264t+0.2229. Recall that price is in

millions of dollars. Thus, if t=0, the regression equation predicts a price of $0.2229 million or $222,900. In this case, the p-intercept tells us what the sale price is predicted to be when t=0 (that is, in the year 1994). The regression equation is p=0.1264t+0.2229. Recall that price is in millions of dollars. Thus, if t=0, the regression equation predicts a price of $0.2229 million or $222,900.

According to the table, the actual price was $0.38 million or $380,000. These values dont have to be the same however, since the regression equation cant match every point exactly. It is only a model that most closely fits the data points. What does the slope of the regression equation tell us? What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264.

What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264. We can always write a number x as x divided by 1, so we can write this slope as .

What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264. We can always write a number x as x divided by 1, so we can write this slope as . Recall that the definition of slope is .

What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264. We can always write a number x as x divided by 1, so we can write this slope as . Recall that the definition of slope is . In this case we are using p and t, so its .

What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264. We can always write a number x as x divided by 1, so we can write this slope as . Recall that the definition of slope is . In this case we are using p and t, so its

. So for our problem, we have . What does the slope of the regression equation tell us? The slope of our regression equation is 0.1264. We can always write a number x as x divided

by 1, so we can write this slope as . Recall that the definition of slope is . In this case we are using p and t, so its . So for our problem, we have . For this problem, t is measure in years and p is measured in millions of dollars.

For this problem, t is measure in years and p is measured in millions of dollars. So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400.

For this problem, t is measure in years and p is measured in millions of dollars. So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400. Even more plainly, we can say that the model predicts that the average price of a twobedroom apartment in New York City will increase by about $126,400 per year.

For this problem, t is measure in years and p is measured in millions of dollars. So more specifically, the slope can be interpreted to mean that if t increases by 1 year, the model predicts that the average price p of a two-bedroom apartment will increase by about $0.1264 million dollars, or $126,400. Even more plainly, we can say that the model predicts that the average price of a twobedroom apartment in New York City will increase by about $126,400 per year. We can now use the linear regression model to predict future prices. For example, if we wanted to predict what the price of an apartment was in 2008, we could plug in 14 Plugging in 14 for t into the regression equation gives

p=0.1264(14)+0.2229=1.9925. Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925. This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008. Plugging in 14 for t into the regression

equation gives p=0.1264(14)+0.2229=1.9925. This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008. You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or $981,300.

Plugging in 14 for t into the regression equation gives p=0.1264(14)+0.2229=1.9925. This means that if the trend continued, we can expect that the price of a two-bedroom apartment was around $1,992,500 in 2008. You can also use the regression equation to check how closely the model matches the actual price in some years that were given on the table. For example, for 2000 the equation predicts a price of p=0.1264(6)+0.2229=0.9813, or $981,300. According to the table, the actual price was $950,000, so the regression equation is pretty close. It is important to remember that the regression equation is just a model, and it

wont give the exact values. It is important to remember that the regression equation is just a model, and it wont give the exact values. If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues.

It is important to remember that the regression equation is just a model, and it wont give the exact values. If the equation is a good fit to the data however, it will give a very good approximation, so it can be used to forecast what may happen in the future if the current trend continues. Next, lets take a quick look at how a regression equation is derived, and then take a look at what the correlation coefficient (or the r-squared value on Excel) tell us about the regression equation. Lets take another look at the data points and the regression line. 1.8 Price p in millions of $

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

Time t in years since 1994 10 12 Lets take another look at the data points and the regression line. Price p in millions of $ 1.8 1.6 1.4 1.2 1 0.8 0.6

0.4 0.2 0 0 2 4 6 8 10 12 Time t in years since 1994

Why does this particular line give the best fit for the data? Why not some other line? It has to do with what is called a residual. 1.8 Price p in millions of $ 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0

2 4 6 8 Time t in years since 1994 10 12 It has to do with what is called a residual. Price p in millions of $

1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8

10 12 Time t in years since 1994 A residual is the difference between a particular data point and the regression line. If we zoom in on a particular data point, we can see what a residual is. 1.8 Price p in millions of $ 1.6

1.4 1.2 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 Time t in years since 1994

10 12 If we zoom in on a particular data point, we can see what a residual is. Lets zoom in on this particular data point. Zooming into this box:

Zooming into this box: We see the data point and the line. Zooming into this box: We see the data point and the line. The vertical distance between the line and the data point is the residual. Zooming into this box:

We see the data point and the line. The vertical distance between the line and the data point is the residual. Zooming into this box: We see the data point and the line. The vertical distance between the line and the data point is the residual.

Zooming into this box: We see the data point and the line. The vertical distance between the line and the data point is the residual. The idea behind linear regression is to keep the residuals as small as possible. There is a method that allows us to minimize the sum of all of the residuals.

There is a method that allows us to minimize the sum of all of the residuals. This is called the least-squares method. You can read about it in the PDF for linear regression. There is a method that allows us to minimize the sum of all of the residuals. This is called the least-squares method. You can read about it in the PDF for linear regression. Since these formulas can get fairly complicated, you will not be required to use them in the course.

There is a method that allows us to minimize the sum of all of the residuals. This is called the least-squares method. You can read about it in the PDF for linear regression. Since these formulas can get fairly complicated, you will not be required to use them in the course. You will only need to know how to find a regression line using Excel. You can watch the video on how to do this, or read through the

PDF, or both. There is a method that allows us to minimize the sum of all of the residuals. This is called the least-squares method. You can read about it in the PDF for linear regression. Since these formulas can get fairly complicated, you will not be required to use them in the course.

You will only need to know how to find a regression line using Excel. You can watch the video on how to do this, or read through the PDF, or both. Next, we look at what the correlation coefficient tells us about the regression equation. Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r. Recall that in our graph, a number was given, called the correlation coefficient, denoted by

the letter r. The correlation coefficient tells us how closely the regression line fits the data points. Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r. The correlation coefficient tells us how closely the regression line fits the data points. It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function.

Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r. The correlation coefficient tells us how closely the regression line fits the data points. It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function. A value very close to -1 indicates a very good fit with a negative sloping linear function.

Recall that in our graph, a number was given, called the correlation coefficient, denoted by the letter r. The correlation coefficient tells us how closely the regression line fits the data points. It has a value between -1 and 1. A value very close to 1 indicates a very good fit with a positive sloping linear function. A value very close to -1 indicates a very good fit with a negative sloping linear function. A value very close to 0 indicates a very poor fit with the data, so there will be no linear

relationship between variables in this case. Excel will not give the value of r, instead it gives the value of r squared. Excel will not give the value of r, instead it gives the value of r squared. The r-squared value basically tells us the same thing, but it will only be between 0 and 1.

Excel will not give the value of r, instead it gives the value of r squared. The r-squared value basically tells us the same thing, but it will only be between 0 and 1. If the r-squared value is close to 1, there is a very good linear fit for the data points. Excel will not give the value of r, instead it

gives the value of r squared. The r-squared value basically tells us the same thing, but it will only be between 0 and 1. If the r-squared value is close to 1, there is a very good linear fit for the data points. If the r-squared value is close to 0, there is a very poor fit between the data points. Excel will not give the value of r, instead it

gives the value of r squared. The r-squared value basically tells us the same thing, but it will only be between 0 and 1. If the r-squared value is close to 1, there is a very good linear fit for the data points. If the r-squared value is close to 0, there is a very poor fit between the data points. We will now look at some examples of what it looks like with an r-squared value close to 1 and with an r-squared value close to 0. Consider the following set of data points. Consider the following set of data points. 8

7 6 5 4 3 2 1 0 0 2 4 6 8 10

12 Consider the following set of data points. 8 7 6 5 4 3 2 1 0 0 2

4 6 8 10 12 They follow a clear linear pattern, so we should expect the r-squared value to be close to 1. Consider the following set of data points. 8 7 6

f(x) = 0.51 x + 1.94 R = 0.99 5 4 3 2 1 0 0 2 4

6 8 10 12 They follow a clear linear pattern, so we should expect the r-squared value to be close to 1. And it is. Now consider the following set of data points. Now consider the following set of data points.

20 18 16 14 12 10 8 6 4 2 0 0 2 4 6 8

10 12 Now consider the following set of data points. 20 18 16 14 12 10 8 6 4 2 0 0

2 4 6 8 10 12 These points seem to be scattered everywhere and dont follow any linear pattern.

Now consider the following set of data points. 20 18 16 14 12 10 8 6 4 2 0 0 2

4 6 8 10 12 These points seem to be scattered everywhere and dont follow any linear pattern. We expect the r-squared value to be close to 0. Now consider the following set of data points.

20 18 16 14 12 10 8 6 f(x) = 0.18 x + 8.33 R = 0.01 4 2 0 0

2 4 6 8 10 12 These points seem to be scattered everywhere and dont follow any linear pattern. We expect the r-squared value to be close to 0.

And it is. So, to summarize, a linear regression equation is a line that most closely fits a given set of data points. So, to summarize, a linear regression equation is a line that most closely fits a given set of data points. The regression equation can be used to predict future values, or values that are outside of the given data range.

So, to summarize, a linear regression equation is a line that most closely fits a given set of data points. The regression equation can be used to predict future values, or values that are outside of the given data range. We can find regression equation for any set of data points, no matter how scattered the data look, but we can tell how closely the data follow a linear pattern by looking at the rsquared value.

So, to summarize, a linear regression equation is a line that most closely fits a given set of data points. The regression equation can be used to predict future values, or values that are outside of the given data range. We can find regression equation for any set of data points, no matter how scattered the data look, but we can tell how closely the data follow a linear pattern by looking at the rsquared value. An r-squared value close to 1 indicates a very good fit to the given data, and an r-squared value close to zero indicates a very poor fit to the data.

The topic of linear regression is very deep, and we have only given a very brief introduction to it here. The topic of linear regression is very deep, and we have only given a very brief introduction to it here. You can read more about it in the PDF given on the Assigned Reading for section 1.4.

The topic of linear regression is very deep, and we have only given a very brief introduction to it here. You can read more about it in the PDF given on the Assigned Reading for section 1.4. Be sure you also watch the video about how to find a linear regression on Excel! You can find the video link in the Assigned Reading for section 1.4.