Data Analysis for Industry & Education Brighton Webs Ltd.
statistical and data services for industry
Home
Index
Feedback

Correlation Coefficient

The correlation coefficient is used to measure the strength of the relationship between two variables, say x and y.  It is typically referred to as r.  It is calculated using the formula:

Correlation Coefficient - formula for r

The correlation coefficient says something about the strength of the relationship between two variables, it does not quantify it.  It can provide the basis for further analysis designed to determine a causal relationship.

The nature of the relationship is shown is indicated by the value of r:

Relationship r
Linear 1.0
Inverse Linear - 1.0
Non-Linear/Random 0 < abs(r) < 1
No Relationship 0

Related Topics

Linear Regression
Critical values of r

From one to almost zero in three graphics

The graphics below illustrate the way that the correlation coefficient varies with the relationship between x and y for a sample of 15 points.

Case 1

If y = x, there is a linear relationship between the two variables.

Case 2

If y = x+random(-20,20), the relationship between the two variables is is weaker

Case 3

If x = random(1,100) and y=random(1,100), there is no relationship between the variables and the value of r is low.

A.K.A.

The correlation coefficient is sometimes referred to as the product moment correlation coefficient.

Significance Testing

If the correlation coefficient of the population is zero, the value of the function below for a sample of n  will follow a t distribution with n-2 degrees of freedom

Correlation Coefficient - t Statistic

This is shown in graphic form below with the values based on the sample data set at the 95% confidence level:

Thus if the value of r is greater than the value expected if it was due to chance, then the relationship between x and y is not due to chance.

This relationship provides the basis for tables of the critical value of r at a given significance level. 

The table below shows the value of t and r for each of the above cases:

Case r t 95% critical
value for r
95% critical value for t Significant
1 +1.00 0.51 2.16 Yes
2 +0.91 +7.31 0.51 2.16 Yes
3 +0.17 +0.62 0.51 2.16 No
In Case 1 where there is no random element, a significance test is irrelevant as indicated by a t value of infinity.  In case 2, the X/Y plot suggests there is a relationship between X and Y and this is confirmed by the value of t which is greater than the 5% critical value.  Finally Case 3, where there is no relationship between X and Y, t less less than the critical value.

Example

The real world example is based on the price of flour versus its protein content.  The data was obtained from the packet labels and the price of the products in a single supermarket which provided a range of products wide enough to provide a half decent dataset.

The data is shown in tabular form below:

x y
10.4 0.71
10.3 0.73
11.5 0.83
12.8 0.81
13.2 0.86
13.9 0.97

The results are presented in the same form as the contrived examples:

Case r t 5% critical
value for r
5% critical value for t Significant
Real-world 0.91 4.39 0.81 2.78 Yes

In the above table there are 6 data points (4 degrees of freedom)

As the value of r is greater than the critical value, we can say that the relationship between the price of flour and the protein content is significant and that it is probable that protein content is one of the factors that influences the price of flour.

Discussion

Originally, the real world example was selected because experience of shopping for flour suggested that there was a simple relationship between protein content and price.  However, when the data was collected, as is often the case, there were several other factors to be considered and the dataset had to be filtered to remove variance due to other factors.

The other factors were provenance e.g. organic/non-organic and brand e.g. independent brands/supermarket own brands.

Page updated: 29-Feb-2008

 

For more information: info@brighton-webs.co.uk