How to store large datasets?

In this post, I talk about how to store very very large datasets on hard drive. I also talk about some semi-documented features of Matlab storage file, the MAT file and discuss the usage of HDF5 files that can store TeraBytes of data (and more) in a single file.

I believe large data storage and management is one of todays major challenge, especially in Biology. Many phenomenon now requires highly dimensional datasets. Engineers and researchers often comes with brilliant ideas to create machines that can generate these datasets but they often forget the most important part of all : Data analysis and Data storage.

In my own field, cameras acquire more and more data simultaneously, so we are generating tons and tons of data in a glimpse.  Still most researchers are using TIFF files to store their movies. TIFF is a standard format to store raw images (in compressed form or not) that is limited to a maximal size of 4 GB. Somehow ImageJ found a way around this limitation but this is a temporary fix on a larger issue. I already talked about several issues related to TIFF files, particularly in Matlab. I now believe it is time to thank TIFF for its services and move on to something else, better suited to our needs.

Funny enough the solution is right there, already available and I want to convince you of its true power.

In a previous post, I introduced this idea of using Trees or hierarchy-based structures to organize your dataset. At this point, I didn’t mentioned that this type of organization is actually very very common and that we are all very used to it.

The best example of this is how files are organized on the hard drive :

When you access a file at /Folder1/Folder2/File.txt, you are basically navigating a tree. This is the reason why recursive function are so useful when searching on a hard drive.

If a tree is a nice way to organize any dataset then why not using a tree to store variables of many kind in a single file?

This is the idea that was developed at the National Center for Supercomputing Applications (NCSA), some 20 years ago. Yes, that old. The Hierarchical Data Format (HDF) was born on 1987.  The goal was to develop a file format that combine flexibility and efficiency to deal with extremely large datasets. Somehow the HDF file was not used all other the place as we didn’t really need it. In my field, we were happy with TIFFs. But now datasets are getting so large that we are reconsidering alternatives and HDF5 (the current version of HDF) is I believe one of our best alternatives.

Why?

  • Because variables can be as big as you want (I mean VERY BIG here, like TB).
  • You can store as many variables as you want in a file and organize them in a tree.
  • Variables can be of any types and sizes  in the same file and are stored in binary form.
  • The format is OpenSource and bundled with libraries in C,C++,Java
  • It provides very fast read and write access to all variables and to subsections of arrays.

AND maybe most IMPORTANTLY for us, Mathworks chose the HDF5 format to actually store the data in their MAT file. Yes, I am sure many of you were not aware of this. The MAT file is NOT a proprietary format, it is based on a standardized data format that you can access very easily OUT of Matlab.

I will add that if Mathworks, one of the leader in technical computing, decided to use HDF5 than it is very likely that you should do the same!

Alright, maybe you are convinced, maybe not. But I am sure you are wondering, how to start using HDF5.

Here you have three options.

  1. As Matlab MAT file is essentially a HDF5 file, you can just use Matlab save and load routines to store your large variables whether they are organized in a tree or not.
    X=rand(1000,1000,100);
    save('test.mat','X');
    
  2. Matlab recenlty introduced a new function called matfile that can give you access to subparts of a variable. In essence, matfile is using the HDF5 capacities to provide quick access to subportion of a dataset located in a file.
    PointToMat=matfile('test.mat');
    Image=PointToMat.X(:,:,10);
    

    In the above code, you are accessing a plane of the variable X as it is within the file.
    The nice thing about this scheme is that you can now create matrix that are so large that they would not fit in your memory. Like here :

    PointToMat=matfile('test.mat','Writable',true);
    PointToMat.Y(1000,1000,2500)=0;
    for i=1:2500
       PointToMat.Y(:,:,i)=rand(1000,1000);
    end
    

    In the above code, I first initialize a matrix in test.mat that is 1000*1000*2500 0 in the file. At double precision, this would be 20GB. On most modern computer (as of today) this will not fit on a 16GB RAM. But because you work on hard drive, you can do this :-). And then I fill it up with data. It takes a while to run (come on, you are asking Matlab to generate quite a large matrix on hard drive, what do you expect?). Check you file. It should be pretty big in the end.

  3. Option 3 is if you want to take full advantage of all HDF5 capabilities. So you would want to use direct calls to HDF files routines instead of using the wrapper provided by Mathworks to access MAT files. That would be :
    fileinfo = h5info('test.mat');
    

    h5info will give some information on the actual dataset stored in the file. It actually give you back the Tree structure we were talking about.
    You can then use h5read and h5write to access your variables. For instance :

    X=h5read('test.mat','/X');
    

    You access X as if it was on a hard drive hierarchy directly. Here X is located at the root ‘/’ but it could be at ‘/bla/bla/bla/X’ if you wanted.
    I would not try to access Y entirely if I were you as you would ask Matlab to store 20GB of data in memory and this might slow down your computer a little bit…

If you need to optimizing read/write speed when dealing with MAT/HDF5 files, I encourage you to read Yair’s excellent post on that topic.

I am curious to know what you guys do to tackle this huge problem. How do you store large datasets?

 

This entry was posted in Intermediate. Bookmark the permalink.

10 Responses to How to store large datasets?

  1. Beatriz says:

    Hi Jerome
    Thanks for your post! What version of matlab has the matfile & h5info functions? I have version 9 and I didn-t find it
    :-)
    Beatriz

  2. jane says:

    Thank you :)

  3. nick says:

    Why not save your data into smaller more manageable chunks (ones that aren’t too big to load)? Creating new variables from parts of other larger variables is much faster if the larger variable is loaded in memory.

    For example:

    Using

    PointToMat=matfile(‘test.mat’);
    Image1=PointToMat.X(:,:,10);
    Image2=PointToMat.X(:,:,11);
    Image3=PointToMat.X(:,:,12);

    is much slower (around 50% for me) than

    load test;
    Image1=X(:,:,10);
    Image2=X(:,:,11);
    Image3=X(:,:,12);

    and it would be even faster if the variable X was smaller.

    Just trying to figure out how this helps with speed. The description of these variables is also, as you mentioned, much like existing file systems. Is loading a variable (say Y) saved within a very large variable (say Q) with a hierarchical structure faster than just loading Y from a folder (called Q, which contains all the variables that Q has). It sounds like it is simply just the same thing as making folders and saving variables in them. Please let me know what I am missing. Also, more examples would be helpful to illustrate the utility of this.

    • Jerome says:

      Hi Nick,

      You can access part of variables that cannot fit entirely into memory. The underlying HDF5 system is capable of that. I often use it for files that are >50 GO, sometimes even 1 TO. Loading the entire thing, as you propose in X, could be much slower if you only need to access one very specific subpart.

      How big is your X variable in test.mat?

  4. nick says:

    I used your example, so X=rand(1000,1000,100);

    I understand that if a variable is already very large then this is an excellent way (or the only way) to access parts of it. I see the utility. However, if a choice can be made to store one large variable, or several small ones, then I don’t see why making one very large variable is worthwhile. The variable X in the example could be saved as 100 “images”, instead of just saving them in X. Loading each image is then faster than accessing them using the matfile.

    In the case that X is already one large variable, it can be parsed and saved as smaller ones. This would be worthwhile if each image is accessed many times.

    I often have to decide how large to build variables due to memory constraints. Currently, I am running into an issue in which I could utilize the potential of HDF5, or just save smaller chunks. However, I am not sold that there is any utility to having large variables if they can be avoided.

    • Jerome says:

      Hi Nick,

      I routinely use hdf5 with more than 10,000 “frames” of a raw movie. I don’t know if you have tried but 10,000 files in a single folder often give windows some troubles (at least in my hands).
      Also, you might want to try lower levels access to hdf5 (instead of matfile), they are way faster (again in my hands). Storing 10,000 images is a very inefficient way to store data : Each image will come with tons of header data that is duplicated. This file format is designed for fast random access so it should be as fast as parsing a single binary file, as long as you navigate along the “chunk” dimension (the dimension the data is split in the hdf5 file).

      In the end, when you want to do things very efficiently, it is going to be very system dependent and you have to dig in deeper in the underlying implementation and you hardware capability.

      • nick says:

        Well, I wouldn’t suggest 10,000. Maybe chunks of 100, or whatever plays nicely with your system.

        Also, h5read clocks in about the same on my Mac as matfile.

        What are your suggestions for lower level access? Can you post some code/links? Otherwise, very informative site. Thanks.

        • Jerome says:

          It is not the reading part of mat-file that is slow. Often I noticed while navigating mat file created by matlab (using h5info) that the chunk size was not optimally set. Using h5create allow to organize hdf5 file in much more optimal way for your data, especially for sub-access.
          Matlab as no way to know how to set this chunk size properly.

          This is a good source :
          https://pytables.github.io/usersguide/optimization.html
          Although this is for python, it still applies here.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>