How to Calculate Sample and Population Correlation Coefficient

This article will make you understand on what is sample and population correlation coefficient and how to calculate them properly. Read more...
Sample and Population Correaltion Image

Introduction

The correlation coefficient measures the strength and direction of a linear relationship between two variables. This coefficient can be used to describe both the relationship between two sets of data (sample) or to describe how two sets of data relating to the entire population from which they were selected (population). The formula may be slightly different depending on what you are calculating, but the idea behind each calculation remains the same. Let's understand more about the correlation coefficient in this article.

The meaning of correlation coefficient

Let's understand Correlation simply, it is a measure of how two variables change together. For example, if you are looking at height and weight, the height and weight will be correlated. The correlation coefficient measures how well that relationship is: If height goes up by 1 inch, does weight go up by 1 pound? If height goes up by 10 inches, does weight go up by 10 pounds? And so on. The correlation coefficient can range from -1 to +1; if it's -1 then there is a perfect negative relationship between x and y (if x goes up then y goes down); if it's +1 there is a perfect positive relationship (if x goes up then y goes up). 

If you want to get a complete overview of Pearson's Correlation Coefficient check this article: Correlation Coefficient in Machine Learning

Now you have an intuition on what is a correlation, but what are population correlation and sample correlation? Let's see.

What are population and sample in statistics?

The population in statistics is a collection of items that are being studied. It is usually a large group of people or things, but it can also be a small group. A sample is a subset of data taken from a larger population. A statistic is any numerical value that describes or summarizes information about a sample or a population. 

Suppose we want to calculate the average height of the population in a country. It's not practical to measure the height of the entire population. So we need to take something known as a sample. We can choose any number of people, as long as they are randomly selected from all over that country. In order for them to be randomly selected, they must not be chosen because they are related or friends with each other. For example, if you choose your best friend and his/her family, then it is not a random sample. After choosing our sample, we need to collect their heights. So, now we have collected data about a sample. This sample can be statistically determined in order to estimate the average heights of the population in that country. So population in statistics is the entire set of things that can be drawn for statistical purposes. In fact, it is not a statistic but a parameter.

Population vs Sample in statistics
Population and Sample

Calculation of Sample Correlation Coefficient

How to calculate the Sample Correlation Coefficient? Well, we can say that the sample correlation coefficient of two variables is the sample covariance of two variables divided by the product of the sample standard deviation of two variables. If that doesn't make much sense let's see what the equations should look like:

The equation for Correlation in relation to Covariance and Standard Deviation:

Correlation in relation to Covariance and Standard Deviation
Correlation Coefficient Eqn

The equation for Sample Covariance:

Sample Covariance Equation
Sample Covariance Eqn

Now for the sample standard deviation, the equation will be:

Sampe Standard Deviation Equation
Sample Standard Deviation Eqn

Example for Sample Correlation Coefficient

Consider some data points:

X: 17, 13, 15, 16, 6, 11, 14, 9, 7, 12
Y: 36, 46, 35, 24, 12, 18, 27, 22, 2, 8

Table for Calculation

Let's calculate the Sample Covariance of x and y:

Sample Standard Deviation for x:

Sample Standard Deviation for y:

Let's plug everything into the equation of Correlation:


So we got 0.6863 as our sample correlation coefficient which is approximately 68% and is positive.

Calculation of Population Correlation Coefficient

Let's now calculate the population correlation coefficient as same as we did in the sample correlation coefficient. But the difference is that the n-1(Total number of samplein sample correlation is changed to N(Total Number of population). In fact, we can say that the population correlation coefficient of two variables is the population covariance of two variables divided by the product of the population standard deviation of two variables. 

Population Covariance of x and y is given by:

Population Covariance Equation
Population Covariance Eqn

Population Standard Deviation of x and y is given by:

Population Standard Deviation Equation
Population Standard Deviation Eqn

Example for Population Correlation Coefficient

Let's consider the same x and y values we have done previously

Calculating the Population Covariance for x and y,

Population Standard Deviation for x:



Population Standard Deviation for y:


When plugging everything into the equation of the Population Correlation Coefficient, we'll get: