WOSB & WBENC-Certified WBE
WOSB & WBENC-Certified WBE

Performance Tools

PerfPal

PerfPal is a full-featured performance framework that consists of a suite of easy-to-use HPC performance profiling and modeling tools. Unlike other performance analysis frameworks, PerfPal eschews the need for recompiling the application code to enable collection of performance data. PerfPal utilizes binary instrumentation-based techniques to directly modify the production binaries to collect execution traces. The instrumented binaries can then be run on production computing environments at scale; raw traces collected using this method are then analyzed to derive performance reports and viewgraphs. The design of the performance reports and recommendations is driven by important usability lessons that the principals of EP Analytics have learned in their extensive experience with directly engaging with the HPC developers in many performance engineering collaborations. Some key components/capabilities of PerfPal are the following:

1. PerfPal’s Perfector component is a lightweight profiler that, for each rank/task, breaks the application execution time into computation, MPI communication and I/O times.

2. PerfPal’s VecMeter component gives precise information on the level of vector unit utilization at loop and function levels.

3. PerfPal's "Hot-Path" component  automatically inserts timers around all functions to help identify key bottlenecks in large HPC codes.

Contact us for more information.

PerfPal's "Hot-Path" Visualization

The "Hot-Path" visualization shows the key execution profiles using a control flow graph; nodes represent functions and directed edges represent function calls. Each node can be annotated with performance data such as cache hit rates, vectorization metrics, instruction mix, function times, etc.

MPI Profile

PerfPal can generate an execution profile of HPC applications; shown here is the breakdown of overall application time into computational times and communication event times. The latter is further broken down to show time spent in implicit synchronization.

MemInsight

The MemInsight tool-suite employs profiling techniques to trace and analyze data movement behavior and thread-level performance of production HPC codes. The analyzed behavior is presented to the end-users in the form of easy-to-understand and actionable performance reports, viewgraphs and optimization recommendations. Some example reports include:

1. A report that shows code-sections with poor cache usage along with actionable optimization recommendations.
2. View-graphs that depict thread-level performance (e.g., work imbalance across threads along with additional information on thread-level overhead).

Contact us for more information.

Perfector

EP Analytics’ Perfector (for Performance Inspector) tool helps maximize the utility of each dollar spent on large-scale HPC system procurements. Perfector achieves this goal by deploying ultra light-weight tools to automatically analyze the communication, computation and Input/Output (I/O) behavior of large HPC codes. The analyzed behavior is presented to the code developers and system administrators in the form of viewgraphs that are intuitive and actionable. Below we highlight the key aspects/features of Perfector.

Collection of performance behavior is transparent to the user and the process only requires making a few simple changes to the job script (e.g., loading a module).

Often load imbalance (i.e., disproportionate sharing of work across multiple compute resources) is one of the main reasons that impede the scalability of HPC applications. Perfector collects per-MPI-rank-level performance statistics to that provide load imbalance information by also taking into account the time spent in implicit synchronization during MPI collective operations.

Contact us to learn more about licensing Perfector. We offer flexible licensing terms and conditions.

MPI Profile

Perfector can generate an execution profile of HPC applications; shown here is the breakdown of overall application time into computational times and communication event times. The latter is further broken down to show time spent in implicit synchronization.

MPI Profile - Deep Dive

A deep-dive into the MPI profile of specific ranks. Two ranks (9 and 255) show vastly different profiles and Perfector's data helps analyze the root causes.

MPI "Hot" Sites

Perfector can collect information to characterize “hot” MPI call sites; i.e., specific call sites in source code that account for a majority of time spent in communication. In the Figure, func_10 accounts for 26% of the total communication time.

Binary Analysis

EP Analytics has a long history in developing BI toolkits. PEBIL, which is a widely used binary instrumentation toolkit for x86-64/Linux, was developed and is currently maintained by our team. EPAX, a BI toolkit for ARM, is currently under development.

Because we design and develop our own BI toolkits and we understand the performance issues faced by HPC codes on modern architectures, we have a unique ability to build optimizations into both the BI toolkits and our performance tools. Such optimizations enable us to analyze  HPC codes in production environments.

Contact us for more information.