Helmholtz AI computing infrastructure put to the test

As in the previous year, Helmholtz AI researchers joined a benchmarking study to analyze the center’s position in the computing infrastructure market — and the results are finally out!

The rushed development of AI methods and tools can make it difficult to keep up with all available options for computing, and even more difficult to identify the best alternative for a given task. This is why benchmarking values are key to compare and choose the best AI hardware option available. Benchmarking platforms give an overall view of relevant aspects like performance, environmental impact, efficiency, training speed, etc. 

That’s why, as in the previous year, Helmholtz AI members from the Steinbuch Centre for Computing (SCC) at Karlsruhe Institute of Technology (KIT) and the Jülich Supercomputing Centre (JSC) at Forschungszentrum Jülich have jointly submitted their results to the MLPerfⓇ HPC benchmarking suite. And we are proud to announce that our infrastructures run on the best performing AI chips!

The initiative to submit was jointly coordinated by Helmholtz AI members, Daniel Coquelin, Katharina Flügel, and Markus Götz, from SCC and Jan Ebert, Chelsea John, and Stefan Kesselheim from JSC. The results cover our two units in those centers: the HoreKa supercomputer at SCC and the JUWELS Booster at JSC. Both run using NVIDIA A100 GPUs, one of the best performing according to the benchmark. The JUWELS Booster in particular used up to 3,072 NVIDIA A100 GPUs during these measurements.

The MLPerfⓇ HPC benchmarking suite is a great opportunity to fine-tune both code-based and system-based optimization methods and tools. For example, based on the CosmoFlow benchmark (Physical Quantity Estimation From Cosmological Image Data), we were able to improve our submission by over 300% compared to last year! While fine-tuning our IO operations, for example, we discovered ways for our filesystems to more rapidly and reliably deliver read and write performance. Thanks to this, the recent CosmoFlow benchmark results, featured by IEEE Spectrum [1] and HPCWire [2], HoreKa achieved the runner-up position behind NVIDIA's Selene system and the top spot for academic and research institutions in terms of fastest training time, outcompeting even larger systems like RIKEN's Fugaku.

As the impacts of climate change become more apparent, it is also imperative to be more conscious about our environmental footprint, especially with respect to energy consumption. To that end, the system administrators at HoreKa have enabled the use of the Lenovo XClarity Controller to measure the energy consumption of the compute nodes*. For the submission runs on HoreKa, 1,127.8 kWh were used. This is slightly more than what it takes to drive an average electric car from Portugal to Finland.

The MLPerfTM HPC benchmarking suite is vital to determining the utility of our HPC machines for modern work flows. We look forward to submitting again next year!

*This measurement does not include all parts of the system and is not an official MLCommons methodology, however it provides a minimum measurement for the energy consumed on our system. As each system is different, these results cannot be directly transferred to any other submission.