Analyzing Analytics for Parallelism

Rajesh Bordawekar

IBM T. J. Watson Research Center

The process of identifying, extracting, processing, and integrating information from raw data, and then applying it to solve a problem is broadly referred to as analytics. The distinguishing feature of an analytics application is the use of mathematical formulations for modeling and processing the raw data, and for applying the extracted information. Although analytics applications have come of age, they have not yet received significant attention from the computer systems and architecture community. A key goal of the tutorial to provide broad, yet in-depth survey of the emerging field of analytics to the broader systems community.

The focus of the tutorial is targeted for computer architects, compiler developers, and system architects. The material presented in this tutorial is based on an ongoing IBM study of accelerating real-world analytics workloads using different types of compute and systems approaches. The primary source of the tutorial will be two technical reports, Analyzing Analytics and A Survey of Business Analytics Models and Algorithms and a paper, Project Trident: An Investigation into Integrating Databases, Analytics, and High-Performane Computing. During the course of the tutorial, the participants will also learn both the similarities and differences between traditional (e.g., HPC and transactional) and analytics workloads, and learn how to apply their current knowledge to the emerging analytics field.

We assume that the targeted audience would be familiar with the basics of parallel programming and have exposure to beginner mathematical modeling and numerical linear algebra. The focus of this tutorial would be on the computational and runtime patterns of the analytical workloads, without going into the details of the underlying elementary mathematical formulations. The content level of the tutorial material will be 80% beginner and 20% intermediate.

The half-day (3 hour) tutorial will have 3 parts: (1) Overview of the analytics workloads, (2) Discussion of key analytics models, and (3) Implications on multi-core architectures and systems. In the first part, we will discuss examples of real-life analytics workloads and examine how their functional goals affect the algorithmic design and implementation. In the second part, we introduce 13 key analytical models (or Exemplars) that are most widely employed in analytics. For each exemplar, we discuss implementation of key algorithms. We then use this information to identify common computational and runtime patterns, data structures, and data types, and discuss how these could be mapped most effectively onto parallel systems. The final part explores optimization opportunities for analytics workloads on emerging multi-core architectures and systems software. During this section, we will describe in detail how key analytics algorithms like Clustering, Regression, Classification, Monte Carlo simulation, and Mathematical programming can be efficiently mapped to contemporary computer architectures. We will also describe how these may be parallelized using frameworks on contemporary multi-core processor and accelerator architectures.

Detailed Outline

Analytics Workloads in Practice (20 minutes)

Analytics: A Definition
Analytics at your service
Analytics Exemplars

Key Analytics Models (Exemplars) (120 minutes)

In this section, we introduce 13 key analytics models or exemplars. For every analytics exemplar, we will first summarize the basic ideas, review the core algorithms, and then describe their computational (e.g., key data types and data structures) and runtime patterns. References for further reading will also be provided.

Regression Analysis
Clustering
Nearest Neighbor Search
Association Rule Mining
Neural Networks
Support Vector Machines
Decision Tree Learning
Time-series Processing
Text Analytics
Monte Carlo Methods
Mathematical Programming
Online Analytical Processing (OLAP)
Graph Analytics

Parallelization and Acceleration Opportunities (40 minutes)

In this section, we fill first summarize the computational and runtime patterns of the analytics exemplars, and then use this information to discuss the parallelization and acceleration opportunities for the analytics workloads. Next we describe in detail a few key analytics models, e.g., clustering, regression analysis, and mathematical modeling. We will conclude with a discussion of processor architecture and software stack issues related to efficient mapping and parallelization of the above computational patterns on contemporary parallel computer systems.

Computation and runtime patterns of the Analytics Exemplars
Summary of Parallelization and Acceleration Opportunities
Detailed dive-in: Clustering, Regression, Mathematical Programming, etc.
Impact on Processor Architecture and Software Stacks

References

R. Bordawekar, Project Trident: An investigation into integrating databases, analytics, and high-performance computing. In SC Companion 2012, pages 1326-1328, November 2012.
R. Bordawekar, R. Blainey, and C. Apte. Analyzing Analytics . Technical Report RC25317, IBM T. J. Watson Research Center, 2012. To appear in ACM SIGMOD Record, December 2013.
R. Bordawekar, R. Blainey, C. Apte, and M. McRoberts. Analyzing Analytics Part (1): A Survey of Business Analytics Models and Algorithms . Technical Report RC25186, IBM T. J. Watson Research Center, 2011.