Skip to main content
Ontario Tech acknowledges the lands and people of the Mississaugas of Scugog Island First Nation.

We are thankful to be welcome on these lands in friendship. The lands we are situated on are covered by the Williams Treaties and are the traditional territory of the Mississaugas, a branch of the greater Anishinaabeg Nation, including Algonquin, Ojibway, Odawa and Pottawatomi. These lands remain home to many Indigenous nations and peoples.

We acknowledge this land out of respect for the Indigenous nations who have cared for Turtle Island, also called North America, from before the arrival of settler peoples until this day. Most importantly, we acknowledge that the history of these lands has been tainted by poor treatment and a lack of friendship with the First Nations who call them home.

This history is something we are all affected by because we are all treaty people in Canada. We all have a shared history to reflect on, and each of us is affected by this history in different ways. Our past defines our present, but if we move forward as friends and allies, then it does not have to define our future.

Learn more about Indigenous Education and Cultural Services

Least Squares Trendline and Correlation

Introduction

In the real world, we often gather data that looks like it would be well fit by a certain function or model, but not perfectly. For example, you might have data that looks like it's well fit by a line, but it's a bit scattered. In order to find the line that best fits the data, we need to figure out the correct slope and y-intercept, but it's not always easy to tell what the best line will be...the "least squares" method helps us find this!

LEAST SQUARES LINE

In order to find the best line, we want to minimize the distance of the data points to the line that we select. The most common way to do this is the "least squares" method, where we minimize the square of the distance of the points to the line. This method is illustrated in the diagram below, where we minimize the area of the yellow boxes.

graph of line using least squares method

If we want to minimize the sum of the squares, the appropriate formula for the slope and y-intercept of the line of best fit becomes:

\(m=\frac{n(\sum {xy})-(\sum {x})(\sum {y})}{n(\sum {x^2})-(\sum {x})^2}\)   \(b=\frac{(\sum {x^2})(\sum {y})-(\sum {x})(\sum {xy})}{n(\sum {x^2})-(\sum {x})^2}\)


 Example: Consider the three data points: (2, 10), (2, 14), (4, 20). Find the least squares line for this set of data.

Solution: First, let's find the quantities that we need to substitute into the formulas.

\(n=3,\sum{x} = 9, \sum {y} = 44, \sum{xy} = 142, \sum{x^2} = 29\)

At this point, we just have to plug in the values we found above into the formula for the slope and the y-intercept:

\( \begin{align} m&=\frac{n(\sum {xy})-(\sum {x})(\sum {y})}{n(\sum {x^2})-(\sum {x})^2}\\
 &=\frac{3(142) - (9)(44)}{3(29) - (9)^2} \\
&=5 \end{align} \)

\( \begin{align} b&=\frac{(\sum {x^2})(\sum {y})-(\sum {x})(\sum {xy})}{n(\sum {x^2})-(\sum {x})^2}\\
&=\frac{(29)(44) - (9)(142)}{3(29) - (9)^2} \\
&=-\frac{1}{3} \end{align} \)

So, finally, our least squares line (ie. the line of best fit) is 

\(y=5x-\frac{1}{3}\)

Correlation

In order to determine whether out data is well fit by a line, we define something called the coefficient of correlation, denoted r. Here's how to interpret it:

  • A value close to 1 means that the data is extremely well fit by a line with positive slope.
  • A value close to -1 means that the data is extremely well fit by a line with a negative slope.
  • A value close to 0 means that the data is not well fit by a line.

\(r=\frac{n(\sum {xy})-(\sum {x})(\sum {y})}{\sqrt{n(\sum {x^2})-(\sum {x})^2}\cdot\sqrt{n(\sum {y^2})-(\sum {y})^2}}\)

Example: Let's again consider the three data points: (2,10), (3, 14), (4, 20) that we found the least squares line for in the example above. Find the coefficient of correlation.

Solution: We first need to find the quantities that we need to substitute into the formula (all of these were already found above except the last one).

\(n=3,\sum{x} = 9, \sum {y} = 44, \sum{xy} = 142, \sum{x^2} = 29, \sum{y^2} = 696\)

Now, we just go ahead and substitute everything into the formula:

\(r=\frac{n(\sum {xy})-(\sum {x})(\sum {y})}{\sqrt{n(\sum {x^2})-(\sum {x})^2}\cdot\sqrt{n(\sum {y^2})-(\sum {y})^2}}\)
\(=\frac{3(142) - (9)(44)}{\sqrt{3(29) - (9)^2}\cdot\sqrt{3(696) - (44)^2}}\)
\(=\frac{30}{\sqrt{6}\cdot\sqrt{152}}\)
\(\approx 0.9934\)

Since we obtain value that is very close to 1, it appears that the data is well fit by a line with positive slope.