How to organize large datasets?

Here I present ideas to organize large datasets in Matlab. This is not meant to be a definitive answer as this is a complicated subject that depends  a lot on your particular application. I propose the usage of trees organization and demonstrate how to achieve that in Matlab with native objects. I also think this post is interesting regarding its advanced usage of structure and cells. This is also the first time I introduce recursive functions.

In the past, I encountered this problem several times. Organizing your data in a convenient way is not an easy thing. It is important though as this choice will affect all of your programs. Depending on how you set this up, your coding can be a pain or relatively easy.

The inherent problem with this is that each experiment, recording, trials, … has some specific details that you want to keep in the data even if everything else is the same. Another issue is that some trials can have different numbers of data points or sampling frequencies. And so on. I am sure many of you have faced this problem.

Arrays are not adapted to this problem because if each one of your recording is of different length then you can’t aggregate all of your data points in a single matrix.

I believe one elegant solution to this is a tree. Inherently when you record or accumulate data, you create a tree with multiple branches. One experimental day is a branch and each “leaf” is an individual recording. Within a single branch you can also perform multiple type of recording leading to sub-branches of your branch.

Another reason to why I believe this is a good solution is due to how we store data. File and folders architecture is also a tree so any data organization that mimics this architecture will easily be loaded and saved to hard drive. It is also intuitive as we are used to organize hard drive this way.

Some folks use complicated databases to achieve this. I believe a simpler solution based on a mixture of cell and structure can just do the job. I think simplicity is something we should really enforce here.

Let’s assume you have two days of experiments with 50 time traces recorded on each day. Here is how I would organize this in this scheme :

DataTree{1}.data='06/21/12';
DataTree{1}.type='day';
DataTree{2}.data='06/22/12';
DataTree{2}.type='day';

% Filling recording on day1
for i=1:50
   DataTree{1}.branch{i}.data=rand(1000,1);
   DataTree{1}.branch{i}.type='traces';
   DataTree{1}.branch{i}.sampling=0.1; % for 0.1 Hz for instance
end

% Filling recording on day2
for i=1:50
   DataTree{2}.branch{i}.data=rand(1000,1);
   DataTree{2}.branch{i}.type='traces';
   DataTree{1}.branch{i}.sampling=0.1; % for 0.1 Hz for instance
end

As you can see, I first make a cell using { } and fill the cells with structures. Then for all structure, I have a field named branch that contain a cell again. And the syntax repeats itself as deeply as you need.

Each field name is indicative of some relevant information. So you can decide to have multiple fields for any structure.

Using both objects is advantageous because this way all structures are independent of each other so they can have different fields name. Using cells make also sense because this is where repetition comes into play. The advantage of this approach is that you can do whatever you want. You can add field at your convenience. It keeps the freedom but allows growing to very large dataset.
Another important point here is that all the objects don’t need to be contiguous in memory. So adding more objects doesn’t require any memory reallocation.

Let’s assume now that you have an entire tree of data stored this way. Each branch is at a different depth in the tree. How to retrieve the relevant piece of data from such a large and complicated architecture?

Browsing through such a tree is, on a first look, quite a complicated task. You need to make multiple nested for loops that search for particular data type. However, as for browsing through hard drive, there is a very elegant solution to this problem.

First, I would decide on some fixed field like ‘type’ and ‘branch’, this way you can help your searching algorithm a little.

Then, you should use recursive functions (function that call themselves). I wrote a dedicated post about it but an example of this technique is also provided here :

function FoundData=RecursiveFunction(FoundData,CurrentNode)
for i=1:numel(CurrentNode)
   if strcmp(CurrentNode{i}.type,'traces')
      FoundData=[FoundData CurrentNode{i}.data];
   elseif isfield(CurrentNode{i},'branch')
      FoundData=RecursiveFunction(FoundData,CurrentNode{i}.branch);
   end
end

As you can see, this function calls itself until there are no branches. This way you can position the relevant data (here the type ‘traces’) at any depth in the tree. The function will locate it and aggregate in FoundData all the data fields with ‘traces’ as type.

The way to call this function is :

FoundData=RecursiveFunction({},DataTree)

You can easily modify this searching function to search for branches that are sub-branches of a particular type. For example you could have a branch with a property that specifies some experimental conditions and limit the data collection to some chosen values of this experimental condition.

Here is an example of such a DataTree that would work as well :

DataTree{1}.data='06/21/12';
DataTree{1}.type='day';

DataTree{1}.branch{1}.type='baseline';
% Filling recording on day1 - baseline
for i=1:50
   DataTree{1}.branch{1}.branch{i}.data=rand(1000,1);
   DataTree{1}.branch{1}.branch{i}.type='traces';
   DataTree{1}.branch{1}.branch{i}.sampling=0.1; % for 0.1 Hz for instance
end

DataTree{1}.branch{2}.type='stimulation';
% Filling recording on day1 - stimulation
for i=1:50
   DataTree{1}.branch{2}.branch{i}.data=rand(300,1);
   DataTree{1}.branch{2}.branch{i}.type='traces';
   DataTree{1}.branch{2}.branch{i}.sampling=0.3; % for 0.1 Hz for instance
end

The beauty of this system is that you can keep your search algorithm constant. They will work whatever the depth of your Tree. So you are more flexible in the way you organize your data. For example, if one day you did your experiment with more conditions, you can add a branch. Or if your days are grouped as well, you can make several branches for the days.

This entry was posted in Intermediate. Bookmark the permalink.

4 Responses to How to organize large datasets?

  1. travis says:

    Hey, really like the blog!

    Don’t you run intro problems with size very quickly? I, too, am storing neural data, and I don’t think I could store multiple days in this fashion.

    And Structures can’t be indexed using matfile hd5 settings so I would have to load this structure whenever I want to use it right?

  2. Jerome says:

    Hi Travis,

    Yes, for storing very large things. Using Matlab structures and cells might not be the most efficient way.

    You can however do the same idea with pure hdf5 load and save call. I would not use matlab load and save function to do that but rather directly call h5create, h5read and h5write in matlab.
    You can access sub-directories directly within the hdf5 file this way.

    I talked about these issues in a recent blog post on Inscopix’s blog :
    https://www.inscopix.com/blog/perspectives-calcium-movie-storage-and-call-new-standard-hdf5

  3. Nicolas says:

    Hello !

    Thank you very much for your work, it is very helpful to me.

    While looking for the best way to store my datasets, I found the “tree” class : http://www.mathworks.com/matlabcentral/fileexchange/35623-tree-data-structure-as-a-matlab-class

    Do you know it and, if so, what are the pros and cons of the tree class vs your method ?

    Thanks again

Leave a Reply

Your email address will not be published. Required fields are marked *