In this post, I talk about how to store very very large datasets on hard drive. I also talk about some semi-documented features of Matlab storage file, the MAT file and discuss the usage of HDF5 files that can store TeraBytes of data (and more) in a single file.
I believe large data storage and management is one of todays major challenge, especially in Biology. Many phenomenon now requires highly dimensional datasets. Engineers and researchers often comes with brilliant ideas to create machines that can generate these datasets but they often forget the most important part of all : Data analysis and Data storage.
In my own field, cameras acquire more and more data simultaneously, so we are generating tons and tons of data in a glimpse. Still most researchers are using TIFF files to store their movies. TIFF is a standard format to store raw images (in compressed form or not) that is limited to a maximal size of 4 GB. Somehow ImageJ found a way around this limitation but this is a temporary fix on a larger issue. I already talked about several issues related to TIFF files, particularly in Matlab. I now believe it is time to thank TIFF for its services and move on to something else, better suited to our needs.
Funny enough the solution is right there, already available and I want to convince you of its true power.
In a previous post, I introduced this idea of using Trees or hierarchy-based structures to organize your dataset. At this point, I didn’t mentioned that this type of organization is actually very very common and that we are all very used to it.
The best example of this is how files are organized on the hard drive :
When you access a file at /Folder1/Folder2/File.txt, you are basically navigating a tree. This is the reason why recursive function are so useful when searching on a hard drive.
If a tree is a nice way to organize any dataset then why not using a tree to store variables of many kind in a single file?
This is the idea that was developed at the National Center for Supercomputing Applications (NCSA), some 20 years ago. Yes, that old. The Hierarchical Data Format (HDF) was born on 1987. The goal was to develop a file format that combine flexibility and efficiency to deal with extremely large datasets. Somehow the HDF file was not used all other the place as we didn’t really need it. In my field, we were happy with TIFFs. But now datasets are getting so large that we are reconsidering alternatives and HDF5 (the current version of HDF) is I believe one of our best alternatives.
- Because variables can be as big as you want (I mean VERY BIG here, like TB).
- You can store as many variables as you want in a file and organize them in a tree.
- Variables can be of any types and sizes in the same file and are stored in binary form.
- The format is OpenSource and bundled with libraries in C,C++,Java
- It provides very fast read and write access to all variables and to subsections of arrays.
AND maybe most IMPORTANTLY for us, Mathworks chose the HDF5 format to actually store the data in their MAT file. Yes, I am sure many of you were not aware of this. The MAT file is NOT a proprietary format, it is based on a standardized data format that you can access very easily OUT of Matlab.
I will add that if Mathworks, one of the leader in technical computing, decided to use HDF5 than it is very likely that you should do the same!
Alright, maybe you are convinced, maybe not. But I am sure you are wondering, how to start using HDF5.
Here you have three options.
- As Matlab MAT file is essentially a HDF5 file, you can just use Matlab save and load routines to store your large variables whether they are organized in a tree or not.
- Matlab recenlty introduced a new function called matfile that can give you access to subparts of a variable. In essence, matfile is using the HDF5 capacities to provide quick access to subportion of a dataset located in a file.
In the above code, you are accessing a plane of the variable X as it is within the file.
The nice thing about this scheme is that you can now create matrix that are so large that they would not fit in your memory. Like here :
PointToMat=matfile('test.mat','Writable',true); PointToMat.Y(1000,1000,2500)=0; for i=1:2500 PointToMat.Y(:,:,i)=rand(1000,1000); end
In the above code, I first initialize a matrix in test.mat that is 1000*1000*2500 0 in the file. At double precision, this would be 20GB. On most modern computer (as of today) this will not fit on a 16GB RAM. But because you work on hard drive, you can do this :-). And then I fill it up with data. It takes a while to run (come on, you are asking Matlab to generate quite a large matrix on hard drive, what do you expect?). Check you file. It should be pretty big in the end.
- Option 3 is if you want to take full advantage of all HDF5 capabilities. So you would want to use direct calls to HDF files routines instead of using the wrapper provided by Mathworks to access MAT files. That would be :
fileinfo = h5info('test.mat');
h5info will give some information on the actual dataset stored in the file. It actually give you back the Tree structure we were talking about.
You can then use h5read and h5write to access your variables. For instance :
You access X as if it was on a hard drive hierarchy directly. Here X is located at the root ‘/’ but it could be at ‘/bla/bla/bla/X’ if you wanted.
I would not try to access Y entirely if I were you as you would ask Matlab to store 20GB of data in memory and this might slow down your computer a little bit…
If you need to optimizing read/write speed when dealing with MAT/HDF5 files, I encourage you to read Yair’s excellent post on that topic.
I am curious to know what you guys do to tackle this huge problem. How do you store large datasets?