- Big Data
- Processing Rat Brain Neuronal Signals Using an Apache Hadoop Computing Cluster – Part III
Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.
Here are two plots of the output data from our benchmark run. Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.
These plots show two distinct regions. The high gamma band (in yellow) shows peaks in the middle phase bins. The low gamma band (in dark brown) shows peaks in the lower and upper phases, which wrap around and form a single region.
Since the goal of this project was to see if the Hadoop ecosystem running on a computing cluster could provide performance and usability improvements to the neuronal signal processing workflow, we did a scale test. This first graph shows the performance of 1, 2, 3, and 4 simultaneous rat runs. It shows that we do in fact see benefit for up to three simultaneous runs, but the fourth run takes a jump in execution time. This comes from our cluster resource limitation of 46 computing slots in the convolution step. Each rat run requires 15 slots, so in order to process the fourth run, we have to wait until 15 slots free up.
The next graph shows the breakdown of each processing step. The convolution and averaging steps dominated the total execution time. For the convolution step, we saw only a slight increase in time for runs 2 and 3, compared with the time for run 1, because we had available computing slots. The fourth run, however, needed to wait for a previous run to complete, so it took about as long to execute as twice the first run time. The averaging step, since it consumed all processing slots for each run, showed a linearly increasing amount of processing time.
If we had a cluster with additional computing slots, we would be able to handle more rat runs in parallel, because of the convolution step characteristics, and be able to process the averaging step more quickly. With our current cluster, a rule of thumb was we could process approximately three 15-channel rat runs per hour, so 100 rat runs would take about 2 weeks to process.
Performance Comparison with Matlab
Previously, the analysis described in this paper was performed with Matlab on a single workstation. An apples-apples comparison with our Hadoop cluster implementation is difficult, because the workstation memory limitations required some compromise changes to the workflow, while the Hadoop implementation implements the ideal workflow.
The Hadoop approach does the processing in this order:
- Averaging the 12 channels
- Computing the mean and standard deviation of the average channel values
- Subsetting the time intervals
This took about 4 hours to process using 46 computing slots. The Matlab approach does the processing in this order:
- Subsetting the time intervals
- Averaging the 12 channels
This took about 10 hours to process, using a small number of process cores. Therefore, the Matlab approach was much more resource efficient, but the Hadoop approach provided better, and cheaper, scale up capability.
For the convolution step, the Matlab approach could read in a channel signal from disk to memory, perform a single kernel convolution, and output the result back to disk in about 11 seconds. The comparable number with the Hadoop implementation was about 37 seconds. However, the Hadoop cluster could process each channel in parallel, one channel per computing slot. Matlab could also be configured to leverage parallel processing, but at a higher cost. Here is a more specific breakdown of the time spent in the convolution processing:
Load Kernel: .33 sec
Load Data: 15.82 sec
Signal FFT: 1.08 sec
Kernel FFT: .60 sec
Product: .20 sec
Inverse FFT: 1.07 sec
Output Data: 38.00 sec
The last four steps were repeated for each of the 196 wavelet kernels. It was clear that while at first we expected this task would be CPU bound, the FFT processing was so efficient that we spend most of the time in data output. If we had more disk arms on each node in our cluster, we would expect that this time would decrease significantly.
An important benefit of the Hadoop cluster approach, when compared with doing analysis on a single Matlab workstation, is that we now have the capacity to fully process the data, instead of being constrained by workstation memory, and can keep all of the intermediate data online for ad hoc analysis. This analysis can be done with Hive.
As usual, interesting work spawns ideas for other interesting work. Here is a list of potential future research:
- Look at doing the channel averaging before the convolution step, which would further improve storage demands, allow for more parallel convolution rat runs and replacement of the large averaging task with a much smaller one
- Provide better support for incremental rat run additions to the processing data
- Provide better support for selecting specific subsets of data channels, which correspond to specific brain regions
- See if increasing the number of disk arms per node will improve I/O performance
- Experiment with overcommitting MapReduce slots beyond the available physical processing cores
- Process all existing rat run datasets
For more information on big data activities at the University of St. Thomas Graduate Programs in Software, see our Center of Excellence for Big Data (CoE4BD) webpage at http://www.stthomas.edu/coe4bd, and on Twitter @coe4bd.