This post has been on the back of my mind for quite some time, ever since I wrote about Principal Component Analysis. Independent Component Analysis or ICA is an algorithm to extract underlying structure hidden in multi-dimensional datasets. As for PCA, It is fairly ubiquitous (including in Neuroscience). However, as opposed to PCA, getting an intuitive understanding of ICA is much harder and many persons use it as a magic black box (no offense, I have been there too). I won’t provide comments on any code as fastICA is already freely available in Matlab. My hope here is to facilitate the understanding of what ICA is and how it works.
Before we dive into more advanced mathematical things, I would like to give a few words of historical background.
According to wikipedia, PCA is a fairly old concept invented by Karl Pearson (yes, the same one that introduced the pearson coefficient) at the very beginning of the 20th century. Independent Component Analysis came much later and is in fact quite contemporary (at least to my generation) as it was pioneered in the 80s-90s in France by Christian Jutten and now retired Jeanny Herault. I am very excited to tell you that the inspiration for ICA actually came from Neuroscience.
Later on, two modern theoretical neuroscientists, Tony Bell and Terry Sejnowsky made an important milestones for the development of ICA by publishing a fast and efficient implementation. It wasn’t long before another variant was released and in 1997, Aapo Hyvärinen and Erkki Oja (two mathematicians who also wrote a lot about neuroscience : have you heard about the Oja learning rule?) published in 1997 a paper about fastICA. I believe it is one of the most used implementation of ICA today.
I would like everyone to grasp what is going on so don’t feel offended if you are already strong in statistics and everything is basic to you. Just jump some parts. For all the others hang there with me: As an algorithm, ICA is just pure beauty.
Let’s start with the beginning. I suppose a good start is to go over that first word : Independent.
- What does independence means?
The most classical example given with ICA is the famous cocktail party problem. Suppose you get into a room where 2 persons are talking. Let’s also suppose that they are not talking to each other but are talking about completely different topics, not even noticing each other conversation. This is a good example of independence.
The formal definition of independence refer to probabilities.
If we assume the voice amplitude are X1 for the person 1 and X2 for the person 2 than you can actually calculate the probability p(X1) which denotes the probability of recording a particular amplitude X1 at any moment in time. You can estimate this probability by recording the person 1 for a long time and looking at the distribution of all amplitudes.
Obviously you can do the same thing for the person 2 and obtain a probability p(X2).
Now here is the trick. Assuming you can record X1 and X2 simultaneously, you could also estimate the probability p(X1,X2) which measure the chance you will have to find a particular combination of X1 and X2 while the two speaker are talking at the same time in the room. In statistics, if the speakers are truly independent, then :
p(X1,X2)=p(X1)p(X2) or the multiplication of both probabilities.
If you think about it for a few seconds, I am sure it will make total sense to you.
In other words, if speaker 1 and speaker 2 are not independent, than you cannot use the equation above to estimate the probability of p(X1,X2) : you have to measure it directly.
When does this happen? Well, pretty easily, if speaker 1 actually start a direct conversation with speaker 2, they will take turn talking and the independence will be lost.
At this point, I hope everything is clear. Very often a strong confusion occurs when you start thinking about other ways to measure dependencies between variables, the most used being correlation or covariance.
- Indeed what is the difference between independence and uncorrelation (or the absence of correlation)?
When two variables X1 and X2 are not independent, there are many ways to quantify their dependency. Covariance, Pearson correlation or Mutual information are all different ways to quantify that particular dependency. Covariance or correlation typically quantified linear relationships well while mutual information will performed better with non-linear dependency.
When two variables are uncorrelated, it simply means that one particular measure of dependency is zero, it does NOT MEAN that the variables are independent. In other words, uncorrelation does NOT involve independence BUT independence DOES involve uncorrelation. You can indeed create examples of variables that have zero correlation but are very much depending on each other. It typically occurs when their dependencies is non-linear. I will let you consult wikipedia for an example of such variables.
Now that you know what independence means, let’s look at the second word : component.
“Independent Component Analysis” clearly seems to indicate that this algorithm is trying to extract hidden variables or components that are buried within the data. It is also trying to extract these by maximizing their independence. This is the KEY difference with Principal Component Analysis or PCA. What PCA does is to extract hidden variables that are uncorrelated, NOT independent. In other words, components that are extracted via ICA need to fulfill much more strict conditions.
Obviously the million dollar question is : How to do that?
I am going to explain how fastICA does this. Keep in mind that there are other ways to achieve such a feat as the solution to this problem is typically not unique.
But first, we need to make another statistical detour: A mathematical theorem that is absolutely “central” to ICA is the Central Limit Theorem (I will refer to it as CLT).
Let’s suppose you have a large number of independent variables. For instance, you are in a very large room or a bar with N people talking independently. Each one of these i person has a probability distribution p(Xi). What the CLT says is that WHATEVER this p(Xi) is (in other words, the person can speak in any ways it pleases him/her), the overall sum of all Xi (in other words, what you hear from this brouhaha) will ALWAYS converge toward a gaussian distribution for sufficiently large N. I am not going to do the mathematical demonstration of this theorem (please go here for that), but here is the real TRICK of fastICA:
Let’s assume you have 2 microphones in the room and you want to discover 2 independent components (or the precise conversation of 2 persons) that are hidden in this brouhaha of sound where N>>2. You know for certain that :
- Each persons is talking independently from the others
- Each person does not talk along a gaussian distribution
- Their conversations are mixed linearly
In other words, what you are looking for is a linear combination of your 2 microphones that will subtract the noise from each other and extract the independent components.
Because of the CLT, you should choose a set of weight that maximize the Non-Gaussianity of the new variables. Indeed the CLT states that adding more variables makes you converge toward gaussian distributions, therefore removing the influence of other variables (or speakers) necessarily makes you converge toward the non-gaussian original variable that is one independent speaker.
Please make sure you understand the last sentence, this is the absolute key to ICA.
In other words, you need to sum your microphones inputs so as to maximize the non-gaussianity of the newly created “microphone”.
Given this general principle established, I am sure you realize that the number of possible combinations is still challenging. For a long time, ICA was recognized as a powerful algorithm but a fast implementation was lacking because of the overall amount of possibilities. To circumvent that problem, fastICA uses 2 important tricks :
- Since you know that the final results need to be made of independent variables, why not start the search from decorrelated variables? Given that is what PCA does, it makes total sense to first extract the principal components from the data, which are decorrelated. ICA next needs to add this extra step that will provide independence on top of decorrelation using this principle extracted from the CLT.
- There is one thing that will never be possible to recover: It is the absolute amplitude of each component. The reason for that lies in this unmixing matrix. For any solution of amplitude A and unmixing matrix M, there is an infinite amount of perfectly equivalent solutions of amplitude α A and unmixing matrix 1 / α * M . Because of that, it makes total sense to restrict the search to solutions that have a unitary amplitude A=1. FastICA therefore restrict its search to unmixing matrix that are orthonormal.
Both of these tricks are typically encapsulated into the term “whitening“. It just means that the inputs data is preprocessed to extract decorrelated components of norm 1.
We are now left with the final important question : How does fastICA process this whitened inputs to extract independent components using this CLM trick?
Well, as surprisingly as it sounds, ICA actually starts from a random unmixing matrix and iteratively improves it. To improve this matrix, it obviously needs a measure a non-gaussianity. The goal is to make sure that at each iteration the amount of non-gaussianity increases. As it turns out, there are many ways to measure the non-gaussianity of a variable. Even if this is one of the important parameters available in the fastICA package, you will almost surely realized that most of them will work quite well. The reason for this is that we actually don’t care exactly how the non-gaussianity is measured as long as it increases at each iteration. Some functions will provide faster convergence than others. Typically, these functions are non-linear functions of the input variable.
Given all that, the mathematical core of fastICA is to convert the search for non-gaussianity into a search for a fixed point. If you lay down the equations that each local maxima of non-gaussianity must respect, you end up with an equation of the form f(X)=X. This is why the original paper is stated “Fast and Robust Fixed-Point Algorithms for Independent Component Analysis”. I am not going to dive into the details here but if you are interested at this point, I invite you to consult the original papers.
At each iteration, fastICA uses this equation to update its unmixing matrix. By design, the non-gaussianity necessarily increases. fastICA measures the changes of each individual component and stop the search if the changes is smaller than a threshold, or if it assumes that it has reached convergence.
I want to discuss 2 very important side-effect of this search.
- Technically, the variable you want to extract should be independent. But in practice, you can actually tolerate a large amount of dependency. The reason for this lies in the way ICA works : it is trying to maximize non-gaussianity. In other words, if the variable are not fully independent, maximizing non-gaussinity of each of them still allow to separate them quite well.
- fastICA starts from a random seed. Typically there are multiple solutions to the unmixing problem (or local maxima of non-gaussianity). You should carefully evaluate the convergence of the algorithm to make sure the solution is valid.
I wanted to dive into why I think ICA is also fundamental to how the brain works but I will keep this for another post.
Please comment if anything is not clear of if you want further details!