Skip to main content

Exploring AI training on the LUMI supercomputer

AI specialists from academia and industry took a deep dive into LUMI's capabilities for advanced AI training during a two-day workshop organized by the LUMI user support team and EuroCC2.
By
05/06/2024 10:06
Billede
AI workshop
Foto: DeiC

Providing access to world class HPC facilities and equipping users with practical skills to harness their potential is the aim of both the LUMI consortium and the EuroCC2 project. On May 29-30 they joined forces in organizing a specialized workshop on AI training using the LUMI supercomputer titled "Moving your AI training jobs to LUMI: A Hands-On Workshop".

“The aim of EuroCC2 is to develop HPC competences across the sectors of academia, public administration and industry and we are excited to offer this hands-on-training opportunity which will enable more users to utilize the power of the LUMI supercomputer”, Marta Ewa Schulze, Project manager, EuroCC Denmark.

Christian Schou Oxvig, LUMI user support team member and senior specialist HPC & AI/ML at DeiC adds:

“With this workshop we aimed to give existing AI researchers an introduction to LUMI for AI, providing them with the knowledge and hands-on experience they need to move their AI training workflows to LUMI. I was impressed with the engagement from the participants and their progress towards running their AI training jobs on LUMI.“, Christian Schou Oxvig, LUMI user support team member and senior specialist HPC & AI/ML, DeiC.

The workshop drew AI specialists from several European countries keen to explore the state-of-the-art in high-performance computing for AI. The two-day event provided a deep dive into LUMI's unique architecture and capabilities, aiming to equip attendees with practical skills for AI model training on LUMI. The workshop kicked off with an introduction to LUMI and its architecture, highlighting differences from other clusters, such as its use of AMD GPUs and the Slingshot Interconnect. Following this, participants were guided through the LUMI web interface and hands-on sessions with PyTorch in JupyterLab, focusing on the practical limitations and advantages of the interactive interface.

Next, the LUMI batch job system was introduced, and the attendees learned to submit their PyTorch AI training jobs using the commmand line.They learned to monitor GPU activity and were introduced to the use of Singularity containers for managing their AI software environments on LUMI, including converting conda or pip environments into containers. To solve this task, the participants used the container software tool cotainr, developed by DeiC.

Day two focused on scaling AI training to multiple GPUs, addressing common challenges and solutions for distributed computing. Practical sessions included converting single GPU training jobs to use all GPUs in a node and performing hyper-parameter tuning with Ray using multiple GPUs. The workshop also covered extreme-scale AI with model parallelism and optimizing network performance across multiple nodes.

Participants were also introduced to the challenges related to handling AI training data on LUMI, with sessions on loading training data from Lustre and LUMI-O. A short introduction to coupling machine learning with “classical” HPC simulations using SmartSim were also given. The event culminated with participants applying their new skills to their own projects, supported by the workshop's instructors.

Amongst the participants were Eleni Briola, Machine Learning Scientist, Danish Meteorological Institute signed up for the workshop to familiarize herself with LUMI and to optimize the performance of her machine learning algorithms by running them on this high-performance computing system:

“The workshop was extremely beneficial. The hands-on exercises were particularly valuable, as they provided a deeper understanding of the concepts and practical experience with LUMI.”

With her was her DMI colleague, Research Scientist Irene Livia Kruse, Research Scientist who hopes to use the LUMI supercomputer in the near future, for AI-based weather research and forecasting and therefore jumped at the chance to get guidance on the system before starting. As her colleague, she was very satisfied with the outcome of the workshop:

“The training workshop has been clear and the hands-on tutorials throughout are a brilliant way to figure out in real time if I’ve understood the lectures (and if there’s something I haven’t fully understood, I’ve enjoyed having the opportunity to ask the technical support for help on the spot). Besides learning about the LUMI system and its GPUs I also got acquainted with coding techniques that I can start applying in my day-to-day work”.

 

Want to know more?

All slides, tasks, Q&A and video recorded presentations from this workshop can be found in the LUMI training archive: https://lumi-supercomputer.github.io/LUMI-training-materials/ai-20240529/ 

Stay posted on upcoming events and trainings for LUMI users: Events and Training - LUMI (lumi-supercomputer.eu)