Performance counter-based strategies to improve data locality on multiprocessor systems: reordering and page migration techniques

Lorenzo del Castillo, Juan Angel

Performance counter-based strategies to improve data locality on multiprocessor systemsreordering and page migration techniques

Lorenzo del Castillo, Juan Angel

Dirixida por:

Francisco Fernández Rivera Director
Juan Carlos Pichel Campos Director

Universidade de defensa: Universidade de Santiago de Compostela

Fecha de defensa: 13 de xaneiro de 2012

Tribunal:

Emilio López Zapata Presidente/a
Javier Díaz Bruguera Secretario/a
Petr Tuma Vogal
Eduard Ayguadé Parra Vogal
José Alberto Cardoso Cunha Vogal

Departamento:

Departamento de Electrónica e Computación

Tipo: Tese

Teseo: 317652 DIALNET MINERVA editor

Resumo

Over the last years, we have witnessed an important evolution in the available computational resources in science and engineering. The line that has traditionally separated multicomputers from multiprocessors is getting blurred, and nowadays most modern supercomputers include several multicore, NUMA (Non Uniform Memory Access) multiprocessor nodes interconnected by a high-speed network. In this context, data locality becomes a subject of paramount importance for the performance of parallel codes. As systems have grown in complexity, the need for understanding what is happening inside a program has also increased. Profiling, understood as a performance monitoring technique that records information about a running code, has proven very useful to narrow down its bottlenecks. In this way, the performance monitoring hardware counters, included in the vast majority of modern microprocessors, provide an essential tool to monitor and gain an insight into the system during the execution of a program. Recently, a new player came on stage. Precise Event-Based Sampling (PEBS) is a performance counter-based profiling technique that has been enhanced in the Intel Itanium family with respect to their predecessors. Their performance counters have reached a precision level to the point of returning not only the exact address at which an event occurs, but also the latency of that access. This opens the door to the development of new performance techniques based on that information. In this dissertation, we approach the study of PEBS techniques to improve the performance of applications on a NUMA, Itanium2-based system. We demonstrate that a low-cost, PEBS profiling can support strategies to improve the performance of an important group of computational and scientific codes in runtime. In addition, the accurate information provided by the new Event Adress Registers (EAR) of the Itanium2 architecture helps foster the development of new data allocation strategies. Following this line, we have also developed a series of dynamic page migration PEBS strategies. Specifically, two problems are addressed: how to improve the performance of locality optimisation techniques for irregular codes in runtime, particularising for the Sparse Matrix-Vector product kernel, and how to develop strategies for dynamic page migration. The main contributions of this dissertation are: 1. A study of the different factors that affect the performance, as well as data and thread allocation policies, in the FINISTERRAE supercomputer, the target platform in which this thesis relies on. 2. The implementation of a performance model for FINISTERRAE. 3. The development of hardware counter-based strategies to assist reordering techniques for irregular codes in order to reduce their cost and improve their behaviour. 4. The development of novel hardware counter-guided, dynamic page migration algorithms that take advantage of the new features provided by the PEBS. As a software contribution, we present a user-level page-migration framework to monitor, sample and control an application in runtime.