EXAALT Harnesses Kokkos Library to Exceed Exascale Performance Goals

May 3, 2024 — Beyond what the eye can see lies a vast and interconnected network of molecules and their constituent atomic particles—the building blocks of all physical matter. Atomic and molecular interactions play out in choreographed moves that dictate the structure and properties of everything on Earth, from natural materials such as diamond and iron to manufactured materials with engineered attributes—for example, high-heat resistance, ultralow density, or remarkable strength.

MD simulation of helium and neutron damage in tungsten zirconium carbide, modeled using the SNAP machine learning potential in LAMMPS. Helium clusters form at the top surface of the polycrystalline tungsten. The simulation was run on OLCF Frontier using 46,656 AMD MI250X GCDs for approximately one (aggregate) Frontier-day. The scale of this representative microstructure was enabled by SNAP performance improvements. Image credit: Mitch Wood, SNL.

Materials such as these are needed for technology advancement in areas ranging from commercial fusion energy and jet engine manufacturing to other applications, such as memristors, where the functionality of a device depends on how its internal structure changes with time.[1]

Molecular dynamics (MD)—an advanced computational method for studying material behavior at the fundamental level—has become a cornerstone of computational science and is a key component of developing materials with enhanced properties. As part of the Department of Energy’s (DOE’s) Exascale Computing Project (ECP), a team of domain scientists, including physicists and chemical engineers, and high-performance computing (HPC) experts from national laboratories and multinational computing technology providers, came together to create the Exascale Atomistics for Accuracy, Length, and Time (EXAALT) application.

EXAALT allows for exploration of molecular interactions over unprecedented combinations of time, length, and accuracies by fully leveraging the power of exascale computing, enabling researchers to glean important insight into the longer term, engineering-scale behavior of a material as a system evolves.

The combined power of the team’s core competencies led to development of advanced algorithms, implementation of key software technologies, and seamless integration of software and hardware, which together resulted in the application achieving a nearly 400x speed up in performance running on only 75 percent of the Oak Ridge Leadership Computing Facility’s Frontier exascale system—compared to a baseline run on the Mira system at the Argonne Leadership Computing Facility—significantly exceeding the ECP goal of 50x. ECP provided a recipe for this success: sustained funding, incentivized cross-discipline collaboration, and a focused end goal on developing capabilities needed to bring exascale computing to fruition.

Danny Perez, a physicist within the Theoretical Division at Los Alamos National Laboratory and the project’s principal investigator, says, “Typically, writing code is secondary because projects focus on delivering the science. But with ECP, the focus was always on developing the capability.” Through ECP, over the last seven years the team had the opportunity to dig deep into the code, optimize algorithms to make them as fast as possible, and in doing so significantly improve overall performance. Perez continues, “EXAALT came together the way that it did because we had strong relationships between scientists, applications and software developers, and hardware experts and a sustained effort over a long period, which allowed us to deliver something truly exceptional.”

Elegant Code, Complicated Physics

Foundational to EXAALT are three cutting-edge MD computer codes: LAMMPS (Large-Scale Atomic/Molecular Massively Parallel Simulator); ParSplice (Parallel Trajectory Splicing); and LATTE (Los Alamos Transferable Tight-Binding for Energetics). Together, these codes combine classical MD approaches with high-performance implementations of quantum MD to calculate the empirical and interatomic potentials and equations of motion for atoms in a system as a function of their spatial relationship to one another.

An interatomic potential known as SNAP (spectral neighbor analysis potential) and its successful implementation within LAMMPS was a key aspect of the EXAALT work. So, too, was integrating the ECP-enhanced Kokkos performance portability library, which provides key abstractions for both compute and memory hierarchy[2] so that the code can run efficiently on the GPU-based architectures of DOE’s exascale machines.

The SNAP algorithm is an elegantly complex piece of code. Developed by Sandia National Laboratories’ (SNL) Aidan Thompson, a LAMMPS and MD expert, SNAP uses machine-learning techniques to accurately predict the underlying physics of material behavior using high-accuracy quantum calculations. It requires deeply nested loops with an irregular loop structure to express the energy of each atom as a linear function of selected bispectrum components of the neighbor atoms; an initial step of this calculation is mapping the positions of the neighbors of each atom onto a unit sphere in four dimensions.[3]

While the code worked fairly well on CPU systems, its complexity made mapping the underlying physics onto new GPU hardware a tricky business. Stan Moore, a LAMMPS–Kokkos lead developer at SNL, says, “Early on in the ECP project, we had what we thought was a solid Kokkos version of the SNAP potential in LAMMPS, but its fraction of peak performance on the GPU-based machines was trending downward as new systems came online, and we really didn’t know how to make it better.”

ECP’s collaboration-centric approach to progress offered opportunities to grow the team and bring in the expertise needed to solve the problem. Through facility-supported hackathons, such as those provided by the Oak Ridge Leadership Computing Facility (OLCF) and National Energy Research and Scientific Computing Center (NERSC), team members, including application engineers from NERSC, NVIDIA, and AMD, came on board to sleuth out the source of the problem and reverse the troubling trend. Ultimately, the realized performance improvement was the work of many hands to both completely redesign the SNAP algorithm and improve GPU compatibility.

Performance of the SNAP EXAALT benchmark on NVIDIA and AMD GPU hardware across different code versions, showing 24x speedup for an MI250X GCD compared to the same hardware running the baseline version of the code. (Image credit: Stan Moore, SNL).[1] [1] Image reprinted from Journal of Nuclear Materials, Vol. 594. Lasa, A. et al, “Development of multi-scale computational frameworks to solve fusion materials science challenges,” 155011, 2024, with permission from Elsevier.

Hackathons Round Up Experts

Rahul Gayatri, an application performance specialist at NERSC, was the first to join the team through the NERSC Science Acceleration Program (NESAP)—a collaborative effort in which NERSC partners with code teams, vendors, and other software developers to prepare for advanced architectures and new HPC systems.[4]

For the hackathon events, typically developers will work on smaller parts of the whole code—in particular, the part that is causing an issue. Gayatri recalls, “The team was focused on improving the single-node performance of the SNAP module on NVIDIA GPUs. I was given a proxy app called a “TestSNAP,” and was tasked to try and make it faster. We ended up completely rewriting the code.” These optimizations for TestSNAP were later ported to LAMMPS.

The team tried several optimization strategies, but one called kernel fission was remarkably beneficial. Rather than having a single large kernel handle all the calculations of an atom, the work was broken up into multiple, smaller kernels, where each one concentrated on the completion of a stage in the algorithm for all the atoms.[5] Gayatri says, “This allowed us to optimize individual kernels because each of them might have different needs in how they are scheduled on the machines and also allowed us to better utilize the resources of a GPU.”

However, the optimizations were memory-intensive, requiring additional storage to pass atom-specific intermediate information between kernels. As it happens, necessity is the mother of invention, and with ECP providing a solid source of funding and long-term stability, the team had the time and resources to develop methodologies for reducing the memory footprint for their newly optimized code by algorithmic improvement using an adjoint refactor.

A subsequent hackathon sponsored by NVIDIA brought the company’s senior compute developer technology engineer Evan Weinberg to the team. Weinberg says, “The team showed me this test code and gave me this 300-page tome on angular momentum and quantum mechanics and had me jump right in.” Weinberg’s educational and professional experience—a blend of physics and software engineering—provided a unique skill set, one that allowed him to both understand the underlying science and the computing structures. He says, “I had hardware and broad optimization experience, but I could also understand the fundamental equations, which allowed me to develop nontrivial optimizations that boosted performance.”

For example, Weinberg identified that a previously unused equation in the SNAP calculation could help simultaneously preserve good data locality and good data access pattern. He says, “The SNAP calculation had used one equation, and I determined that by using this second one, we could traverse what is essentially a graph as it computes.” This revelation provided a breakthrough for using shared memory on GPUs to optimize the number of reads and writes that are done from global memory. Gayatri says, “By using this bispectrum symmetry available in the SNAP algorithm, we could save the atom-specific intermediate results in the shared memory, which allowed us to spend less cycles accessing global memory.”[6]

ECP’s governance structure provided the breathing room the team needed to try something new, fail, and then try again, to ultimately make inroads in overall capability. “ECP emphasized that investment in software and algorithm advancement, in addition to hardware, makes a huge difference in achieving significant jumps in performance,” says Weinberg. “It isn’t just the same code that exceeded the ECP goal, it is a combination of new hardware with changes to software and completely new algorithms, and these types of advances aren’t made in a month.” Indeed, the initial work to optimize the SNAP algorithm—started during the first hackathon—took about a year and half to complete. Weinberg also recalls that he had worked on an optimization idea for about 4 weeks, which when completed led to a performance slowdown. “I was so frustrated,” he says. “I went for a run, figured something out, came back, and rewrote the code from scratch. If I’d had only that hypothetical one month, my contribution would have been a bust.”

A Two-Way Street

In the end, the tight connection between the domain scientists, developers, and vendors was essential to EXAALT’s overall success and exemplifies the shared fate aspect of ECP work—all sides gain wisdom, experience, and insight that can propel innovation throughout the network. For example, Kokkos, one of many ECP-enhanced pieces of open-source software, is used in 15 ECP applications and has achieved extensive reach within the broader HPC community. “The Kokkos team helped us resolve issues all along the way, but we also provided feedback to them that helped them improve Kokkos for other applications,” says Moore. “We also had help from experts at Oak Ridge and Argonne who helped optimize SNAP in LAMMPS and write the backends in Kokkos for the different operating systems. ECP was a golden age of LAMMPS development for GPUs.”

Although the team’s initial focus was on optimizing the Kokkos SNAP on NVIDIA GPUs, it expanded to include AMD and Intel collaborators and other experts at Oak Ridge and Argonne national laboratories. Weinberg shared that these close collaborations not only helped the team, but also helped to grow the skills of the individual contributors. “ECP allowed me to work directly with the domain scientists and application experts that were developing and running the algorithms.

The fact that I could reach out to other members of the team who were doing the most tightly integrated work—and who could throw the metaphorical textbook at me—helped me to do my best work, which allowed me to provide real value to the rest of the team.” This interconnectedness also fosters cross-pollination of ideas between different project groups. Several EXAALT developers also worked on other ECP applications, and what was learned in one space provided insight in others. Gayatri says, “Developing these relationships made it easier for us to collaborate on other projects to their benefit.”

MD in a SNAP—Frontier Success and More

Along the development path, the team achieved several milestones that marked significant strides in application and software development. “Meeting the EXAALT KPP-1 (Key Performance Parameter) target was our biggest success,” says Moore. “We took all the optimizations generated through ECP and ECP’s CoPA (Co-Design Center for Particle Applications), NERSC, and NVIDIA, and built LAMMPS as a library for Danny to run for EXAALT. He ran the application on 7,000 nodes of Frontier and was able to not only meet but significantly exceed the 50x speed up needed for the benchmark.”

Additionally, in 2021, the team was recognized as one of six finalists for the coveted Association for Computing Machinery (ACM) Gordon Bell Prize. Moore says, “All the updates developed through ECP that went into EXAALT were also applied to the Gordon Bell submission.” Although future fusion and fission reactor processes were used as the motivating problem for EXAALT ECP performance metrics, the team applied a different use case for the submission and achieved unprecedented scaling and unmatched real-world performance of SNAP MD—simulating 1 billion carbon atoms for 1 nanosecond of physical time using ORNL’s Summit supercomputer.[7]

Moving forward, EXAALT is poised to become an indispensable tool for understanding material behavior for a wealth of applications, and this success would not have been possible without ECP and the network of experts it brought together. Moore says, “ECP broke down the barriers between the laboratories and vendors and allowed us to work together toward a common goal to achieve something much bigger than our small individual project.” Perez is quick to point out that while it may look serendipitous that the team came together the way it did, it really happened because ECP provided the structure. “The special relationships forged through ECP and NERSC, under NESAP, and the ECP-supported hackathons with vendors really made these connections and our success possible. When everyone has the same priorities, the rate at which you can make progress is very different. All the ways to measure success are compatible with each other and that makes it much easier for teams to work together. This ‘all hands on deck’ approach pushed by ECP, where everyone involved is pushing toward one goal, has really proved to be something worth doing.”

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Source: Caryn Meissner, ECP