Massive data for miniscule communities

July 25, 2012

MSU researchers led by James Tiedje have developed a method to better study and analyze the large amount of data from soil microbial communities.

It’s relatively easy to collect massive amounts of data on microbes. But the data files are so large that it takes days to transmit them to other researchers and months to analyze them once they are received.

Researchers at Michigan State University (MSU) have developed a new computational technique, featured in a recent issue of the Proceedings of the National Academy of Sciences, that relieves the logjam that these big data issues create. The paper is co-authored by MSU AgBioResearch scientist James Tiedje and C. Titus Brown, MSU assistant professor in bioinformatics.

“Microbial communities living in soil or the ocean are quite complicated,” said Tiedje, MSU university distinguished professor of microbiology and molecular genetics and the director of the MSU Center for Microbial Ecology. “Their genomic data is easy enough to collect, but their data sets are so big that they actually overwhelm today’s computers.”

The general technique developed by the research team can be used on most microbial communities. The interesting twist is that the team created a solution using small computers, a novel approach when you consider that most bioinformatics research relies on supercomputers, Brown said.

“To thoroughly examine a gram of soil, we need to generate about 50 terabases [a terabase is equivalent to 1012 base pairs] of genomic sequence – about 1,000 times more data than was generated for the initial human genome project,” Brown explained . “It would take about 50 laptops to store that much data. Our paper shows the way to make it work on a much smaller scale.”

Analyzing DNA data using traditional computing methods is like trying to eat a large pizza in a single bite. The huge influx of data bogs down computers’ memory and causes them to “choke.” The new method employs a filter that folds the DNA “pizza” up compactly using a special data structure, allowing the computers to nibble on slices of data and eventually digest the entire sequence. This technique creates a 40-fold decrease in memory requirements, which allows scientists to plow through reams of data without using a supercomputer.

Tiedje and Brown will continue to pursue this line of research. To encourage others to investigate it further and improve upon it, the researchers made the complete source code and ancillary software available to the scientific community.