EP Analytics’ expertise and tools can assist enterprises in maximizing the return-on-investment in HPC systems. We assist clients with Performance Characterization, Energy Efficiency, System Design and Emerging Technology Integration.
ARM for HPC
The energy consumption of large scale HPC and hyper-scale computing systems is becoming a serious concern and has been the subject of increasing attention in the academic literature and popular press. Adoption of the energy-efficient ARM processor architecture for server systems is one approach to dealing with this challenge. In particular, the recent introduction of an ARM architecture with 64-bit capability, essential for high performance applications, has sparked a deluge of interest in the server and high performance computing segments. Chips based on the 64-bit ARM v8 specification are now available from Applied Micro and will soon be available from AMD, Broadcom, Qualcomm, and others (EP Analytics has a 64-bit Applied Micro server in its lab today). The Open Compute Project, focused on hyper-scale data centers, has developed an ARM server specification. Research recently published by EP Analytics has shown that fine-grained application analysis can be used to identify performance bottlenecks when porting applications from current x86 servers to ARM-based servers, allowing application porting efforts to be focused to maximize the energy efficiency and performance potential of ARM-based server platforms. Enabling the broad adoption of this type of analysis is critical to unleashing the energy efficiency of the ARM platform and requires a robust software tools ecosystem for supporting application analysis and porting efforts.
Over a decade-long period, EP Analytics and its principals have been conducting research in the performance modeling and analysis of HPC systems and developing tools for static and dynamic analysis, modeling, and simulation of such systems. Under a previous Phase I SBIR grant (DOE Award #DE-SC0009497), EP Analytics developed the “EPAX Toolkit,” a static binary analysis tool forming a key foundation for the complete tool suite. Recognizing that achieving maximum utility and adoption of the tools requires ease-of-use for experienced (not expert) developers, we propose to enable our tool suite with a visual interface and potential integration into third party tools and integrated development environments.
Related Papers & Presentations
Characterization and Bottleneck Analysis of a 64-bit ARMv8 Platform
Abstract: This paper presents the first comprehensive study of the performance, power and energy efficiency of the Applied-
Micro X-Gene, the first commercially available 64-bit ARMv8 platform. Our study includes a detailed comparison of the X-Gene to three other architectural design points common in
HPC systems. Across these platforms, we perform careful measurements across 400+ workloads, covering different application domains, parallelization models, floating-point precision models,
memory intensities, and several other features. We find that the X-Gene has 1.2× better energy consumption than an Intel Sandy Bridge, a design commonly found in HPC installations, while the
Sandy Bridge is 2.3× faster.
Precisely quantifying the causes of performance and energy differences between two platforms is a challenging problem. This paper is the first to adopt a statistical framework called Partial Least Squares (PLS) Path Modeling to this problem. PLS Path Modeling allows us to capture complex cause-effect relationships and difficult-to-measure performance concepts relating to the effectiveness of architectural units and subsystems in improving application performance. Using PLS Path Modeling to quantify the causes of the performance differences between X-Gene and Sandy Bridge in the HPC domain, our efforts reveal that the performance of the memory subsystem is the dominant factor.
Michael Laurenzano, Ananta Tiwari, Allyson Cauble-Chantrenne, Adam Jundt, Roy Campbell†, and Laura Carrington
†High Performance Computing Modernization Program, U.S. Dept. of Defense
Accepted to: ISPASS (International Symposium on Performance Analysis of Systems and Software), 2016. Available upon request.
Compute Bottlenecks on the New 64-bit ARM
Abstract: The trifecta of power, performance and programmability has spurred significant interest in the 64-bit ARMv8 platform. These new systems provide energy efficiency, a traditional CPU programming model, and the potential of high performance when enough cores are thrown at the problem. However, it remains unclear how well the ARM architecture will work as a design point for the High Performance Computing market. In this paper, we characterize and investigate the key architectural factors that impact power and performance on a current ARMv8 offering (X-Gene 1) and Intel’s Sandy Bridge processor. Using Principal Component Analysis, multiple linear regression models, and variable importance analysis we conclude that the CPU frontend has the biggest impact on performance on both the X-Gene and Sandy Bridge processors.
Adam Jundt, Allyson Cauble-Chantrenne, Ananta Tiwari, Joshua Peraza, Michael Laurenzano, and Laura Carrington
Accepted to: E2SC (Energy Efficient Supercomputing), 2015. Available upon request.
Performance and Energy Efficiency Analysis of 64-bit ARM Using GAMESS
Abstract: Power efficiency is one of the key challenges facing the HPC co-design community, sparking interest in the ARM processor architecture as a low-power high-efficiency alternative to the high-powered systems that dominate today. Recent advances in the ARM architecture, including the introduction of 64-bit support, have only fueled more interest in ARM. While ARM-based clusters have proven to be useful for data server applications, their viability for HPC applications requires an in-depth analysis of on-node and inter-node performance. To that end, as a co-design exercise, the viability of a commercially available 64-bit ARM cluster is investigated in terms of performance and energy efficiency with the widely used quantum chemistry package GAMESS. The performance and energy efficiency metrics are also compared to a conventional x86 Intel Ivy Bridge system. A 2:1 Moonshot core to Ivy Bridge core performance ratio is observed for the GAMESS calculation types considered. Doubling the number of cores to complete the execution faster on the 64-bit ARM cluster leads to better energy efficiency compared to the Ivy Bridge system; i.e., a 32-core execution of GAMESS calculation has approximately the same performance and better energy-to-solution than a 16-core execution of the same calculation on the Ivy Bridge system.
Ananta Tiwari, Kristopher Keipert, Adam Jundt, Joshua Peraza, SaromS. Leang, Michael Laurenzano, Mark Gordon, and Laura Carrington
Accepted to: Co-HPC (International Workshop on Hardware-Software Co-Design for High Performance Computing), 2015. Available upon request.
ARM in HPC: Presentation at ARM TechCon
A Look at Heterogeneous Architectures in HPC: Presentation at NRL
Characterizing the Performance-Energy Tradeoff of Low-Power ARM Processors in HPC
Abstract: Deploying large numbers of small, low power cores has been gaining traction recently as a design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors. In this work we methodically characterize the performance and energy of HPC computations drawn from a number of problem domains on current ARM and x86 processors. Unsurprisingly, we find that the performance, energy and energy-delay product of applications running on these platforms varies significantly across problem types and inputs. Using static program analysis we further show that this variation can be explained largely in terms of the capabilities two processor subsystems: floating point/SIMD and the cache/memory hierarchy, and that static analysis of this kind is sufficient to predict which platform is best for a particular application/input pair. In the context of these findings, we evaluate how some of the key architectural changes being made for upcoming 64-bit ARM platforms may impact HPC application performance.
Michael Laurenzano, Ananta Tiwari, Adam Jundt, Joshua Peraza, Laura Carrington, William Ward, Jr.†, and Roy Campbell†
†High Performance Computing Modernization Program, U.S. Dept. of Defense
Accepted to: Euro-Par, 2014. Available at Springer.