In my scientific field (Neuroscience), Principal Component Analysis (PCA) is very trendy. Surprisingly, even if it is widely used, I have the impression that many people are scared of this analysis. I understand that. I mean : Principal Component Analysis does look like a scary thing to do. It certainly does look like advanced analysis.
Well, surprisingly again, PCA is ONLY two lines of code in Matlab. Yes, only 2 and only using good old Matlab functions without any toolbox.
These 2 lines of code are a little dense conceptually but nothing too fancy, so let’s embark on this adventure to demystify PCA!
First, as usual, we need a good example. Instead of picking something from Neuroscience, I decided to take something that would speak to a larger audience and will also be interesting : Polls.
There is a presidential election very soon in France, so everybody is talking about these Polls and I thought we could see what PCA can tell us about it.
So first, let’s gather data. I went here : http://www.sondages-en-france.fr/
And I collected the results of all the polls since the beginning of the year.
I came out with this graph :
This graph only shows the percentage value of each candidate in the last 55 polls. You can see all the candidates go up or down with time. For instance, the red curve is Jean-Luc Melenchon, a left wing candidate that have been rising steeply in the last 15 polls.
So I took this data and organized it into one single Matrix called PollData.
In this matrix, each column is one candidate and each line is the distribution of percentage throughout all candidates for one Poll.
Now, in PCA, the first thing to do is to get the covariance matrix. Hold on, that is an easy one. The covariance matrix is just an extension of the variance. On the diagonal, it calculates the variance of each variable (here the variance of the polls for one candidate). The other elements are the covariance of, for example, candidate 1 and candidate 2. If the value is high, they covary. If the value is negative, they anti-covary. If zero, they are not correlated.
In Matlab, getting the covariance matrix is easy, just do :
This is line number 1 of the PCA.
You can actually plot this matrix on an image. It is sort of interesting. Here I get this :
Click on the figure to get a bigger version. This matrix shows covariation. So here it is clear that Nicolas Sarkozy (right wing) is anti-correlated with Francois Hollande (left wing). That’s logic. Even more interesting, Le Pen (extreme right) is very highly anti-correlated with Sarkozy (right). Logic again, they fight for the same people.
Ok, that’s nice. If you are french, I am sure you are deeply enjoying this mathematic over politic analysis. But let’s suppose we ask the following :
What is really important in these polls? What are the most important variations in the data.
This is when PCA comes handy.
PCA is a way to redistribute the variance along their maximal direction. To do so, it just creates a new coordinate system that takes into account these variances.
But let’s just do it and you will see what I am talking about.
We are going to take our covariance matrix, and we are going to look for the eigenvectors and the eigenvalues of this matrix, like this :
That’s it! This is line 2. We have done PCA. Let’s make sense out of it.
So, what eigs does here is to look for the first 4 eigenvectors. That means it is going to first look at the covariance matrix and try to find the highest covariance between all 9 candidates. It will construct a combination of all these candidates to create a new candidate that varies the most. This is principal component 1. Then it iterates and try to find a new combination that is orthogonal to the previous one. That means that if candidate 3 is very strong on the first component, then the algorithm can’t pick it anymore and its weigh will be weak on all the following components.
Now the distribution of these coefficients is in V.
D is a diagonal matrix that gives you the variance of each of these new components (actually the inverse of the variance).
If we now plot V, we get the following image :
Now I am going to let you revise your french politic. But this is quite interesting.
This graph tells you, on the first column, that the Principal Component 1 is very very positive on Sarkozy and Jean-Luc Melenchon and very negative on Marine Le Pen.
In other words, the most important thing in all these Polls is that both Sarkozy and Melenchon are rising and that Le Pen is going down. This is Component 1.
Mathematic is telling here that these 3 persons are the most important changing variables in the Polls. Hollande, even if he is so far the most likely winner of the election, is not part of this dynamic.