One of the more interesting and important parts of studying two different sets of data is to see if they are correlated. It might make one wonder if the order of the data matters. In this blog post, I show with three different methods - by an empirical example, by looking at the correlation function, and visually - why the order doesn’t matter so long as each data point is matched to the same datapoint each time.
Empirical Examples of Correlation Order
I took the following table and checked the correlation of the first column to each of the other columns.
I got the correlation values of 1, 0.963142661, -0.289999765, 0.988187371, 0.774789519, and 0.775376284, respectively. I then sorted the entire table by the various columns, but the correlation never changed. However, when I sorted each column independently, the correlation did in fact change.
Let’s dig a little deeper as to why by looking into the correlation function.
The Correlation Function
The correlation function of x and y is defined as the covariance of x and y divided by the product of the standard deviations of x and y:
correlation(x,y) = covariance(x,y)/std(x)std(y)
This means, as described on the Math Is Fun webpage, we only need to know a few things: the sum of x, y, x^2, y^2, and xy. Intuitively, and perhaps by experience, we know that the order by which we sum numbers does not affect the result. This is also a basic mathematical concept known as the associative property of addition.
But why does changing the order of one column matter?
Because we also need to know the sum of xy - that is, it’s not so much the order of the column, but the pairings that matter. If we take this very basic example:
We see that reversing the order of y messes with the (x,y) pairings, making the results different:
Visually and Conceptually Comparing Correlation Order
Finally, we look into what correlation is trying to tell us why order doesn’t matter so long as the variables remained paired with one another.
A correlation of 1 means that there is a linear relationship between the variables, whereas a correlation of -1 means that there is an inverse linear relationship between the variables. Visually, this means if you plotted each pair of points in either case and connected the dots, then you would have a straight line.
The closer to the correlation is to zero, the less of a line is formed.
Here are the above examples plotted:
Plotted Correlation Examples
You can imagine if that if the x was sorted without regard to y, or vice versa, the graphs would look very different. However, it doesn’t matter which dot you drew first. The same is true for their correlations.