Over the past year, Nielsen has been working with Intel, Cloudera and Numerical Algorithms Group to bring the worlds of high-performance computing (HPC) and big data closer together. By merging these technologies, we’ll be able to run large-scale optimizations, simulations and segmentation analyses more efficiently than before and within the same platform that houses our data.
As a topic, big data has been in the news a lot the last few years. When it comes to utilizing big data, companies employ a technology called Hadoop, which has become largely synonymous with big data. Hadoop is an open-source software framework for storing data and running applications that allows companies to process and manage very large amounts of data in a manageable way. Hadoop’s primary programming model for processing data has historically been MapReduce, which allows for massive scalability across hundreds or thousands of servers in a cluster. While adequate for many purposes, MapReduce is not optimal for analytic tasks that require multiple passes through the data or communications between all of the processes involved in analyzing the data.
On the other hand, HPC resides primarily in the domain of government labs, universities and companies in the oil and gas and finance industries. Central to HPC is something called Message Passing Interface (MPI), which allows programs to run across a cluster of computers while still being able to communicate with each other.
The chart below shows the similarities between these two technologies. At the foundation are the servers, disks and network connecting them together. On top of that, and essential to both technologies, is the distributed file system, which allows processes running on any server to view the same data as all the other processes. On top of that, and again key to both technologies, is the resource manager that allocates cluster resources to applications upon request. Finally, at the top of each stack, is the programing environment: MPI and Hadoop frameworks. Our focus, to date, has been to integrate the top two components of each stack by allocating Hadoop resources to be used by an MPI application on the Hadoop cluster.
Integrating Hadoop and MPI brings together the best of two complementary technologies. This integration will provide the data management capabilities of Hadoop with the performance of native MPI applications on the same cluster. Intel and Cloudera plan to provide production support for this integration in future releases of their software while we continue to explore the possibilities that such an integration will have for our clients.