The University of the Witwatersrand High Energy Physics Group is a collaborator in the “A Toroidal LHC Apparatus” (ATLAS) experiment. ATLAS is a particle detector that works with the Large Hadron Collider (LHC), a particle accelerator at the European Organization for Nuclear Research (CERN) based in Switzerland. In this experiment, sub-atomic particles produced during high energy proton-proton collisions are detected.
These collisions are recorded every 50 ns (or 20 MHz frequency) leading to data flows of 10 Pb/s (1 Pb is 106 Gb). These data flows are far above the capabilities of today’s conventional computing and network technologies. As a result, a very small fraction of these data (less than 1 in 106 events) can be kept for further analysis in conventional computing farms. In order to achieve this rejection, very fast decisions need to be made on the basis processing of large data volumes in order to decide what data to keep and what data to further transfer. This is accomplished with dedicated devices: high-throughput electronics. The upgraded LHC will provide collisions at rates that will be at least 10 times higher than those of today. This imposes new challenges for which new, more advanced designs are required. A picture of the Data Center at the LHC is depicted to the right.
Processing the vast quantities of data produced by the Square Kilometer Array (SKA) in South Africa will require very high performance central supercomputers capable of 100 petaflops per second processing power. This is about 50 times more powerful than the current most powerful supercomputer and equivalent to the processing power of about one hundred million PCs. The technological challenges related to high-throughput data flows with fast processing and decision imposed on data taking at the ATLAS detector today are common to those facing the SKA.
Electrical power is becoming more and more expensive and so the power used by a processing cluster should be minimised. The capital cost for large scale computing for a big science project is significant, so this should also be minimised. Typically, standard server grade computers are used for a processing cluster or supercomputer, which are based on Intel Xeon or similar processors which use about 100W of electric power per CPU. In some advanced installations, GPU computing is used for special tasks. GPUs use non-standard programming languages and are thus difficult to program and have some other drawbacks such as limited memory and limited communications throughput and so are not as widespread as conventional CPUs. Depending on the processing tasks required for the data, it is possible that the full CPU processing power is not fully utilised, leading to a drop in processing efficiency versus electric power usage.
A solution to the aforementioned issues, namely power usage, cost, computational efficiency and data throughput can be found in ARM processors. ARM processors are found in almost all mobile phones and tablets. This widespread and popular use is due to the efficiency of the ARM instruction sets which are electrically very low power and computationally powerful. These massive industries also drive the development of lower cost, lower power and faster CPUs. ARM processors typically use in the order of 1 W of power versus the 100 W for a conventional x86 CPU. Currently, an ARM Cortex-A15, which is one of the fastest ARM processors, is about a quarter as fast as a fast x86 CPU. It is therefore proposed that ARM processors can be used as the main processing unit in a high throughput supercomputer for big science projects that is both cost effective and power efficient. There is industry interest in using ARM processors in servers, but to date no attempts have been made to develop a large cluster of ARM processors to be used as a high-throughput supercomputer.
The project being pioneered at Wits already has several members and the group aims to be at the forefront of the field of high-throughput super computing in South Africa. Other universities in South Africa, China and CERN in Switzerland are also interested and are already contributing to the cutting-edge project. The first step we have taken is the investigation on current technologies as well as literature surveys. The big question we need to answer before moving forward is which ARM architecture will we choose?
This could be the Cortex-A15, -A7 or -A9 or perhaps a newer, as yet unreleased architecture such as the Cortex-A12 or -A50 series which supports 64 bit processing. Different ARM development boards based on these different architectures have been purchased and are currently being benchmarked comprehensively to help with the decision. Processor performance depends on many aspects such as integer and floating point performance, memory bandwidth and error rates, context switching time (the time it takes the CPU to swap to processing a different process) and process forking (where a process makes a copy of itself and both copies now run simultaneously) and will be tested and compared. The processor power consumption will be measured during all tests.
Currently we have purchased three different boards. The Cubieboard2 (Duel Cortex-A7), Wandboard (Quad Cortex-A9) and Odroid (Quad Cortex-A15 and Quad Cortex-A7). These boards have been set up to run linux. Initially we installed a pre compiled version of Fedora on the Cobieboard which turned out to be incompatible with the Wandboard at the time of writing. This lead to the adoption of Ubuntu 12.04 LTS on the Wandboard. Thus we had different operating systems on the boards which is not ideal when benchmarking. However, we decided this could give us an indication of the operating system performance. Compiling and running a simple Linpack file immediately showed us that the performance was very dependant on the optimisations used in the compiler, the compiler itself and the fundamentals of each operating system. For example Ubuntu 12.04 moved over to hard floating points which meant having to use the correct flags when compiling the C file.
After initial testing we have decided to move over and compile our own version of Gentoo which eliminates the extra variables that come with the precompiled operating systems. Once this has been achieved further benchmarks will be run on the boards and this will provide bias free results when it comes to the operating system. A suitable benchmarking program is currently being investigated. This turned out to be non-trivial as a lot of the benchmarking software available has not been entirely ported over to ARM architectures. Thus a large amount of time goes into building benchmarks only to see them fail due to a dependency package not being ported correctly.
Results to follow shortly.