Addressing the challenges of the new CPU architecture
The Concurrency Forum at CERN
The successful processing of more than 15PB of data recorded by the LHC experiments in 2012 led to extraordinary discoveries in the field of HEP. The CPU hardware that will be at our disposal to treat the total projected data volume after the Lond Shutdown 1 will not feature higher clock frequencies but rather many-cores and massively parallel coprocessors, such as graphics coprocessors (GPGPUs). To prepare for this change in technology, the SFT group works on a radical change in the software used by the LHC experiments.
Recent developments in the architecture of CPUs signals a new era that will affect every facet of modern technology from electronic devices to cell phones. Software engineering is moving towards a paradigm shift in order to accommodate new CPU architectures with many cores, in which concurrency will play a more fundamental role in programming languages and libraries. Newer generations of computers have started to exploit higher levels of parallelism of an unprecedented scale.
Multiplce challenges arise for the HEP-community software from multiple-processors.
Today, processors employing 2,4,6 or more cores (multi‐core) are in widespread use and this trend is expected to continue in the future towards chips having hundreds of cores. In order to take advantage of the unique capabilities one has to think of developing and implementing new algorithms. For the moment, the possible gains from the new computer architecture are limited by the fraction of the software that can be parallelized to run on multiple cores simultaneously, whilst exploiting at the same time the opportunities offered by microscopic parallelism within the CPU. The required reengineering of existing application codes will likely be dramatic and will mark another phase change in the software industry.
HEP software will also need to accommodate the new hardware architectures by introducing parallelism whenever possible in order to make efficient use of all the available cores. This implies the development of new models and specialized software frameworks to assist scientists in developing their software algorithms and applications in order to benefit from all the available cores.
Traditionally HEP experiments exploit multiple cores by having each core process in parallel different HEP 'events'; this is an example of a so-called embarrassingly parallel problem that results in speedup factors that scale with the number of cores. However, as already mentioned, a trend towards many (100's) cores on a single socket is expected in the near future, whilst technical limitations on connecting them to shared memory could reduce the amount of memory that can be accessed efficiently by a single core. This causes a major problem for experiments running at the LHC, since planned increases in luminosity are expected to result in more complex events with higher memory requirements. One is led to conclude that in future we will need to efficiently use multiple cores to process a single event i.e. move towards finer‐grain parallelism in our data processing applications.
Parallelism needs to be implemented in all the different levels starting from large data production jobs, which can run on several computer centres at the same time down to the level of micro-parallelism inside the CPU chip. The main goal of this exercise is to reduce the resources used by a single core, in terms of I/O bandwidth, memory requirements, connections to open files and achieve better efficiency. Moreover, efficient exploitation of vector registers (SIMD), instruction pipelining, multiple instructions per cycle and hyper threading will be key to reap the benefits of the new architectures.
Our ability to schedule algorithms to run concurrently depends on the availability of the input data each algorithm requires in order to function correctly. One has to look carefully through all the algorithms and define the current dependencies between different algorithms before moving forward with further parallelization at all levels from the event level to the algorithm and the sub-algorithmic level.
Perhaps one of the most urgent tasks is to address specific algorithms that take a large fraction of the total processing time. This can be addressed by making the algorithm fully re‐entrant (not easy) such that it can be executed by several threads at the same time, instantiating several copies of the algorithm each running on a different event concurrently, or parallelizing the algorithm. A new approach to processing many events in parallel will also be needed in order to avoid the long tails in the processing time distribution of many events.
Running experiments have a significant investment in the frameworks of their existing data processing applications (HLT, Reconstruction, Simulation and Analysis) and it is realistic to expect that collaborations will favor incorporating new vectorization and parallelization techniques in an adiabatic manner, whilst still preserving the quality of the physics output. The introduction of these new techniques into the existing code base might have a rather confined local impact, in some cases, as well as a more global impact in others. For example the vectorization of some inner‐loop calculation ('hot spot') may affect only a very concrete algorithm and could be done independently of many other optimizations that can be applied to the code. On the contrary, being able to run on many concurrent events with the aim to reduce the memory footprint in a multi-threaded application may require rather deep changes in the framework and services used to build the complete application and therefore will have much larger consequences. The goal should therefore be to develop a new set of deliverables, in terms of models and framework services, that can be integrated into the various existing data processing frameworks and which can assist physicists in developing their software algorithms and applications.
Cern grid computer farm, where the power of thousands of PCs are combined to crunch data from the LHC. Photograph: David Parker/Science Photo Library
One of the main tasks will be to study and evaluate the various hardware solutions that are currently available in the context of each of the various computational problems we have to deal with. Each parallel hardware option comes with its own constraints and limitations (e.g. memory organization and the required programming language or programing model etc.) and these all have to be understood for taking the best decisions and ensuring scalability in the longer term. A major effort is under way to re-engineer both existing algorithms and data structures present in our data processing frameworks, reconstruction and simulation toolkits that account for tens of millions of lines of code.
Concurrency Forum and prototyping work
The expression of parallelism in HEP software represents a paradigm-shift for our field, facing us with a challenge comparable to the major change that was introduced during the 1990's, which was to evolve from structured towards object-oriented programming. The HEP community has joined forces to address these challenges and the Software Development for Experiments group (PH-SFT) is playing a role in making ongoing activities visible to the whole community through the creation of the Forum on Concurrent Programming and Frameworks (http://concurrency.web.cern.ch).
The forum was setup at the start of 2012 in an attempt to share knowledge among interested parties (major laboratories and experiment collaborations) that should work together to develop 'demonstrators' and agree minimally on technology so that they can share code and compare results. The goals are to identify the best tools, technologies (libraries) and models for use in HEP applications trying to cover all data processing domains: simulation, reconstruction and analysis. A regular series of biweekly meetings has been running since January 2012 with a large and growing involvement by the whole community.
With the different demonstrators that are being developed we expect to cover all of the different technology options that are generally available and in which interest has been expressed in the community. These studies include the following:
- investigation of new programming languages to harness concurrency
- design and implementation of a new whiteboard service able to schedule algorithms based on their data dependencies
- development of a high performance simulation prototype
- use of task oriented thread libraries such as Intel TBB, libdispatch etc
- exploiting GPGPUs for specific HEP tasks, including triggering and simulation, using special languages such as OpenCL and CUDA
- use of virtualisation as a more transparent approach to improving performance on many-‐core platforms
- use of techniques traditionally used in High Performance computing, such as OpenMP and MPI
- study of data locality and time-ordering of algorithm execution (in particular in Geant4) to measure the effect on the use of CPU caches as a way to leverage on performance
- multiprocessing approaches involving the sharing of C++ objects amongst processes
- parallelization of likelihood functions for data analysis and of statistical analysis functions found in the ROOT libraries
The idea is to be able to compare the performance and scalability of each solution under similar conditions and similar simplifications. This is achieved by defining a number of benchmarks that will be used by the demonstrators. Towards the end of last year we had collected sufficient information to start making decisions on what technologies and concurrency models we as a community should be adopting. We are now entering a serious software development phase with the aim of having a first running complete experiment application by the end of 2013.