Hardware counter based performance analysis, modelling, and improvement through thread migration in numa systems

  1. García Lorenzo, Oscar
Dirixida por:
  1. Tomás F. Pena Director
  2. José Carlos Cabaleiro Domínguez Director

Universidade de defensa: Universidade de Santiago de Compostela

Fecha de defensa: 11 de abril de 2016

Tribunal:
  1. Emilio Luque Fadón Presidente/a
  2. Patricia González Secretario/a
  3. Petr Tuma Vogal
  4. María Inmaculada García Fernández Vogal
  5. Enrique Salvador Quintana Ortí Vogal
Departamento:
  1. Departamento de Electrónica e Computación

Tipo: Tese

Resumo

These last years have seen an important evolution in the computational resources available in science and engineering. Currently, most high performance systems include several multicore processors and use a NUMA (Non Uniform Memory Access) memory architecture. In this context, data locality becomes a highly important issue for parallel codes performance. It is foreseeable that the complexity as SMP (Symmetric Multiprocessing) NUMA systems increases during the next years. These will increase both the number of cores and the memory complexity, including the various cache levels, which implies memory access latency will depend, increasingly, of the proximity or affinity of the different threads to the memory modules where their data reside. Improving the performance and scalability of parallel codes on multicore architectures may be quite complex. This way, memory management on parallel codes will become more complicated, especially from the point of view of a programmer who wishes to obtain the best performance. Not only this, but the problem worsens in the usual case with different processes in execution simultaneously. Automatically migrating executing threads among the cores and processors, depending on their behaviour, may improve performance of parallel programs. Furthermore, it may allow to simplify their development, since the programmer avoids to explicitly manage locality. Modern microprocessors include registers that give useful information at a low cost, usually known as hardware counters (HCs). HCs are not commonly used due to a lack of tools to easily obtain their data. These HCs, in modern processors, allow to obtain the memory access latency during cache miss resolutions, and even the memory address that leads to the event. This opens the door to the development of new techniques for performance improvement based on this information. A procedure to easily and automatically obtain data about a shared memory parallel code execution on SMP multicore and NUMA systems, to model it using the hardware counters of modern processors, alongside additional information, as the memory access latencies from different threads. This procedure will be used during a parallel program execution, at runtime, to model its performance. This information will be used to improve the efficiency of the execution of said parallel codes automatically and transparently to the user.