EXAALT Harnesses Kokkos Library to Exceed Exascale Performance Goals

May 3, 2024

May 3, 2024 — Beyond what the eye can see lies a vast and interconnected network of molecules and their constituent atomic particles—the building blocks of all physical matter. Atomic and molecular interactions play out in choreographed moves that dictate the structure and properties of everything on Earth, from natural materials such as diamond and iron to manufactured materials with engineered attributes—for example, high-heat resistance, ultralow density, or remarkable strength.

MD simulation of helium and neutron damage in tungsten zirconium carbide, modeled using the SNAP machine learning potential in LAMMPS. Helium clusters form at the top surface of the polycrystalline tungsten. The simulation was run on OLCF Frontier using 46,656 AMD MI250X GCDs for approximately one (aggregate) Frontier-day. The scale of this representative microstructure was enabled by SNAP performance improvements. Image credit: Mitch Wood, SNL.

Materials such as these are needed for technology advancement in areas ranging from commercial fusion energy and jet engine manufacturing to other applications, such as memristors, where the functionality of a device depends on how its internal structure changes with time.[1]

Molecular dynamics (MD)—an advanced computational method for studying material behavior at the fundamental level—has become a cornerstone of computational science and is a key component of developing materials with enhanced properties. As part of the Department of Energy’s (DOE’s) Exascale Computing Project (ECP), a team of domain scientists, including physicists and chemical engineers, and high-performance computing (HPC) experts from national laboratories and multinational computing technology providers, came together to create the Exascale Atomistics for Accuracy, Length, and Time (EXAALT) application.

EXAALT allows for exploration of molecular interactions over unprecedented combinations of time, length, and accuracies by fully leveraging the power of exascale computing, enabling researchers to glean important insight into the longer term, engineering-scale behavior of a material as a system evolves.

The combined power of the team’s core competencies led to development of advanced algorithms, implementation of key software technologies, and seamless integration of software and hardware, which together resulted in the application achieving a nearly 400x speed up in performance running on only 75 percent of the Oak Ridge Leadership Computing Facility’s Frontier exascale system—compared to a baseline run on the Mira system at the Argonne Leadership Computing Facility—significantly exceeding the ECP goal of 50x. ECP provided a recipe for this success: sustained funding, incentivized cross-discipline collaboration, and a focused end goal on developing capabilities needed to bring exascale computing to fruition.

Danny Perez, a physicist within the Theoretical Division at Los Alamos National Laboratory and the project’s principal investigator, says, “Typically, writing code is secondary because projects focus on delivering the science. But with ECP, the focus was always on developing the capability.” Through ECP, over the last seven years the team had the opportunity to dig deep into the code, optimize algorithms to make them as fast as possible, and in doing so significantly improve overall performance. Perez continues, “EXAALT came together the way that it did because we had strong relationships between scientists, applications and software developers, and hardware experts and a sustained effort over a long period, which allowed us to deliver something truly exceptional.”

Elegant Code, Complicated Physics

Foundational to EXAALT are three cutting-edge MD computer codes: LAMMPS (Large-Scale Atomic/Molecular Massively Parallel Simulator); ParSplice (Parallel Trajectory Splicing); and LATTE (Los Alamos Transferable Tight-Binding for Energetics). Together, these codes combine classical MD approaches with high-performance implementations of quantum MD to calculate the empirical and interatomic potentials and equations of motion for atoms in a system as a function of their spatial relationship to one another.

An interatomic potential known as SNAP (spectral neighbor analysis potential) and its successful implementation within LAMMPS was a key aspect of the EXAALT work. So, too, was integrating the ECP-enhanced Kokkos performance portability library, which provides key abstractions for both compute and memory hierarchy[2] so that the code can run efficiently on the GPU-based architectures of DOE’s exascale machines.

The SNAP algorithm is an elegantly complex piece of code. Developed by Sandia National Laboratories’ (SNL) Aidan Thompson, a LAMMPS and MD expert, SNAP uses machine-learning techniques to accurately predict the underlying physics of material behavior using high-accuracy quantum calculations. It requires deeply nested loops with an irregular loop structure to express the energy of each atom as a linear function of selected bispectrum components of the neighbor atoms; an initial step of this calculation is mapping the positions of the neighbors of each atom onto a unit sphere in four dimensions.[3]

While the code worked fairly well on CPU systems, its complexity made mapping the underlying physics onto new GPU hardware a tricky business. Stan Moore, a LAMMPS–Kokkos lead developer at SNL, says, “Early on in the ECP project, we had what we thought was a solid Kokkos version of the SNAP potential in LAMMPS, but its fraction of peak performance on the GPU-based machines was trending downward as new systems came online, and we really didn’t know how to make it better.”

ECP’s collaboration-centric approach to progress offered opportunities to grow the team and bring in the expertise needed to solve the problem. Through facility-supported hackathons, such as those provided by the Oak Ridge Leadership Computing Facility (OLCF) and National Energy Research and Scientific Computing Center (NERSC), team members, including application engineers from NERSC, NVIDIA, and AMD, came on board to sleuth out the source of the problem and reverse the troubling trend. Ultimately, the realized performance improvement was the work of many hands to both completely redesign the SNAP algorithm and improve GPU compatibility.

Performance of the SNAP EXAALT benchmark on NVIDIA and AMD GPU hardware across different code versions, showing 24x speedup for an MI250X GCD compared to the same hardware running the baseline version of the code. (Image credit: Stan Moore, SNL).[1] [1] Image reprinted from Journal of Nuclear Materials, Vol. 594. Lasa, A. et al, “Development of multi-scale computational frameworks to solve fusion materials science challenges,” 155011, 2024, with permission from Elsevier.
Hackathons Round Up Experts

Rahul Gayatri, an application performance specialist at NERSC, was the first to join the team through the NERSC Science Acceleration Program (NESAP)—a collaborative effort in which NERSC partners with code teams, vendors, and other software developers to prepare for advanced architectures and new HPC systems.[4]

For the hackathon events, typically developers will work on smaller parts of the whole code—in particular, the part that is causing an issue. Gayatri recalls, “The team was focused on improving the single-node performance of the SNAP module on NVIDIA GPUs. I was given a proxy app called a “TestSNAP,” and was tasked to try and make it faster. We ended up completely rewriting the code.” These optimizations for TestSNAP were later ported to LAMMPS.

The team tried several optimization strategies, but one called kernel fission was remarkably beneficial. Rather than having a single large kernel handle all the calculations of an atom, the work was broken up into multiple, smaller kernels, where each one concentrated on the completion of a stage in the algorithm for all the atoms.[5] Gayatri says, “This allowed us to optimize individual kernels because each of them might have different needs in how they are scheduled on the machines and also allowed us to better utilize the resources of a GPU.”

However, the optimizations were memory-intensive, requiring additional storage to pass atom-specific intermediate information between kernels. As it happens, necessity is the mother of invention, and with ECP providing a solid source of funding and long-term stability, the team had the time and resources to develop methodologies for reducing the memory footprint for their newly optimized code by algorithmic improvement using an adjoint refactor.

A subsequent hackathon sponsored by NVIDIA brought the company’s senior compute developer technology engineer Evan Weinberg to the team. Weinberg says, “The team showed me this test code and gave me this 300-page tome on angular momentum and quantum mechanics and had me jump right in.” Weinberg’s educational and professional experience—a blend of physics and software engineering—provided a unique skill set, one that allowed him to both understand the underlying science and the computing structures. He says, “I had hardware and broad optimization experience, but I could also understand the fundamental equations, which allowed me to develop nontrivial optimizations that boosted performance.”

For example, Weinberg identified that a previously unused equation in the SNAP calculation could help simultaneously preserve good data locality and good data access pattern. He says, “The SNAP calculation had used one equation, and I determined that by using this second one, we could traverse what is essentially a graph as it computes.” This revelation provided a breakthrough for using shared memory on GPUs to optimize the number of reads and writes that are done from global memory. Gayatri says, “By using this bispectrum symmetry available in the SNAP algorithm, we could save the atom-specific intermediate results in the shared memory, which allowed us to spend less cycles accessing global memory.”[6]

ECP’s governance structure provided the breathing room the team needed to try something new, fail, and then try again, to ultimately make inroads in overall capability. “ECP emphasized that investment in software and algorithm advancement, in addition to hardware, makes a huge difference in achieving significant jumps in performance,” says Weinberg. “It isn’t just the same code that exceeded the ECP goal, it is a combination of new hardware with changes to software and completely new algorithms, and these types of advances aren’t made in a month.” Indeed, the initial work to optimize the SNAP algorithm—started during the first hackathon—took about a year and half to complete. Weinberg also recalls that he had worked on an optimization idea for about 4 weeks, which when completed led to a performance slowdown. “I was so frustrated,” he says. “I went for a run, figured something out, came back, and rewrote the code from scratch. If I’d had only that hypothetical one month, my contribution would have been a bust.”

A Two-Way Street

In the end, the tight connection between the domain scientists, developers, and vendors was essential to EXAALT’s overall success and exemplifies the shared fate aspect of ECP work—all sides gain wisdom, experience, and insight that can propel innovation throughout the network. For example, Kokkos, one of many ECP-enhanced pieces of open-source software, is used in 15 ECP applications and has achieved extensive reach within the broader HPC community.  “The Kokkos team helped us resolve issues all along the way, but we also provided feedback to them that helped them improve Kokkos for other applications,” says Moore. “We also had help from experts at Oak Ridge and Argonne who helped optimize SNAP in LAMMPS and write the backends in Kokkos for the different operating systems. ECP was a golden age of LAMMPS development for GPUs.”

Although the team’s initial focus was on optimizing the Kokkos SNAP on NVIDIA GPUs, it expanded to include AMD and Intel collaborators and other experts at Oak Ridge and Argonne national laboratories. Weinberg shared that these close collaborations not only helped the team, but also helped to grow the skills of the individual contributors. “ECP allowed me to work directly with the domain scientists and application experts that were developing and running the algorithms.

The fact that I could reach out to other members of the team who were doing the most tightly integrated work—and who could throw the metaphorical textbook at me—helped me to do my best work, which allowed me to provide real value to the rest of the team.” This interconnectedness also fosters cross-pollination of ideas between different project groups. Several EXAALT developers also worked on other ECP applications, and what was learned in one space provided insight in others. Gayatri says, “Developing these relationships made it easier for us to collaborate on other projects to their benefit.”

MD in a SNAP—Frontier Success and More

Along the development path, the team achieved several milestones that marked significant strides in application and software development. “Meeting the EXAALT KPP-1 (Key Performance Parameter) target was our biggest success,” says Moore. “We took all the optimizations generated through ECP and ECP’s CoPA (Co-Design Center for Particle Applications), NERSC, and NVIDIA, and built LAMMPS as a library for Danny to run for EXAALT. He ran the application on 7,000 nodes of Frontier and was able to not only meet but significantly exceed the 50x speed up needed for the benchmark.”

Additionally, in 2021, the team was recognized as one of six finalists for the coveted Association for Computing Machinery (ACM) Gordon Bell Prize. Moore says, “All the updates developed through ECP that went into EXAALT were also applied to the Gordon Bell submission.” Although future fusion and fission reactor processes were used as the motivating problem for EXAALT ECP performance metrics, the team applied a different use case for the submission and achieved unprecedented scaling and unmatched real-world performance of SNAP MD—simulating 1 billion carbon atoms for 1 nanosecond of physical time using ORNL’s Summit supercomputer.[7]

Moving forward, EXAALT is poised to become an indispensable tool for understanding material behavior for a wealth of applications, and this success would not have been possible without ECP and the network of experts it brought together. Moore says, “ECP broke down the barriers between the laboratories and vendors and allowed us to work together toward a common goal to achieve something much bigger than our small individual project.” Perez is quick to point out that while it may look serendipitous that the team came together the way it did, it really happened because ECP provided the structure. “The special relationships forged through ECP and NERSC, under NESAP, and the ECP-supported hackathons with vendors really made these connections and our success possible. When everyone has the same priorities, the rate at which you can make progress is very different. All the ways to measure success are compatible with each other and that makes it much easier for teams to work together. This ‘all hands on deck’ approach pushed by ECP, where everyone involved is pushing toward one goal, has really proved to be something worth doing.”

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC., a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA-0003525.

Los Alamos National Laboratory is operated by Triad National Security, LLC, for the National Nuclear Security Administration of U.S. Department of Energy (Contract No. 89233218CNA000001).

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.


Source: Caryn Meissner, ECP

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's latest weapon in the AI battle with GPU maker Nvidia and clou Read more…

ISC 2024 Student Cluster Competition

May 16, 2024

The 2024 ISC 2024 competition welcomed 19 virtual (remote) and eight in-person teams. The in-person teams participated in the conference venue and, while the virtual teams competed using the Bridges-2 supercomputers at t Read more…

Grace Hopper Gets Busy with Science 

May 16, 2024

Nvidia’s new Grace Hopper Superchip (GH200) processor has landed in nine new worldwide systems. The GH200 is a recently announced chip from Nvidia that eliminates the PCI bus from the CPU/GPU communications pathway.  Read more…

Europe’s Race towards Quantum-HPC Integration and Quantum Advantage

May 16, 2024

What an interesting panel, Quantum Advantage — Where are We and What is Needed? While the panelists looked slightly weary — their’s was, after all, one of the last panels at ISC 2024 — the discussion was fascinat Read more…

The Future of AI in Science

May 15, 2024

AI is one of the most transformative and valuable scientific tools ever developed. By harnessing vast amounts of data and computational power, AI systems can uncover patterns, generate insights, and make predictions that Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top500 list of the fastest supercomputers in the world. At s Read more…

Google Announces Sixth-generation AI Chip, a TPU Called Trillium

May 17, 2024

On Tuesday May 14th, Google announced its sixth-generation TPU (tensor processing unit) called Trillium.  The chip, essentially a TPU v6, is the company's l Read more…

Europe’s Race towards Quantum-HPC Integration and Quantum Advantage

May 16, 2024

What an interesting panel, Quantum Advantage — Where are We and What is Needed? While the panelists looked slightly weary — their’s was, after all, one of Read more…

The Future of AI in Science

May 15, 2024

AI is one of the most transformative and valuable scientific tools ever developed. By harnessing vast amounts of data and computational power, AI systems can un Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

ISC 2024 Keynote: High-precision Computing Will Be a Foundation for AI Models

May 15, 2024

Some scientific computing applications cannot sacrifice accuracy and will always require high-precision computing. Therefore, conventional high-performance c Read more…

Shutterstock 493860193

Linux Foundation Announces the Launch of the High-Performance Software Foundation

May 14, 2024

The Linux Foundation, the nonprofit organization enabling mass innovation through open source, is excited to announce the launch of the High-Performance Softw Read more…

ISC 2024: Hyperion Research Predicts HPC Market Rebound after Flat 2023

May 13, 2024

First, the top line: the overall HPC market was flat in 2023 at roughly $37 billion, bogged down by supply chain issues and slowed acceptance of some larger sys Read more…

Top 500: Aurora Breaks into Exascale, but Can’t Get to the Frontier of HPC

May 13, 2024

The 63rd installment of the TOP500 list is available today in coordination with the kickoff of ISC 2024 in Hamburg, Germany. Once again, the Frontier system at Read more…

Synopsys Eats Ansys: Does HPC Get Indigestion?

February 8, 2024

Recently, it was announced that Synopsys is buying HPC tool developer Ansys. Started in Pittsburgh, Pa., in 1970 as Swanson Analysis Systems, Inc. (SASI) by John Swanson (and eventually renamed), Ansys serves the CAE (Computer Aided Engineering)/multiphysics engineering simulation market. Read more…

Nvidia H100: Are 550,000 GPUs Enough for This Year?

August 17, 2023

The GPU Squeeze continues to place a premium on Nvidia H100 GPUs. In a recent Financial Times article, Nvidia reports that it expects to ship 550,000 of its lat Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Choosing the Right GPU for LLM Inference and Training

December 11, 2023

Accelerating the training and inference processes of deep learning models is crucial for unleashing their true potential and NVIDIA GPUs have emerged as a game- Read more…

Shutterstock 1606064203

Meta’s Zuckerberg Puts Its AI Future in the Hands of 600,000 GPUs

January 25, 2024

In under two minutes, Meta's CEO, Mark Zuckerberg, laid out the company's AI plans, which included a plan to build an artificial intelligence system with the eq Read more…

AMD MI3000A

How AMD May Get Across the CUDA Moat

October 5, 2023

When discussing GenAI, the term "GPU" almost always enters the conversation and the topic often moves toward performance and access. Interestingly, the word "GPU" is assumed to mean "Nvidia" products. (As an aside, the popular Nvidia hardware used in GenAI are not technically... Read more…

Nvidia’s New Blackwell GPU Can Train AI Models with Trillions of Parameters

March 18, 2024

Nvidia's latest and fastest GPU, codenamed Blackwell, is here and will underpin the company's AI plans this year. The chip offers performance improvements from Read more…

Shutterstock 1285747942

AMD’s Horsepower-packed MI300X GPU Beats Nvidia’s Upcoming H200

December 7, 2023

AMD and Nvidia are locked in an AI performance battle – much like the gaming GPU performance clash the companies have waged for decades. AMD has claimed it Read more…

Leading Solution Providers

Contributors

Eyes on the Quantum Prize – D-Wave Says its Time is Now

January 30, 2024

Early quantum computing pioneer D-Wave again asserted – that at least for D-Wave – the commercial quantum era has begun. Speaking at its first in-person Ana Read more…

Some Reasons Why Aurora Didn’t Take First Place in the Top500 List

May 15, 2024

The makers of the Aurora supercomputer, which is housed at the Argonne National Laboratory, gave some reasons why the system didn't make the top spot on the Top Read more…

The GenAI Datacenter Squeeze Is Here

February 1, 2024

The immediate effect of the GenAI GPU Squeeze was to reduce availability, either direct purchase or cloud access, increase cost, and push demand through the roof. A secondary issue has been developing over the last several years. Even though your organization secured several racks... Read more…

Intel Plans Falcon Shores 2 GPU Supercomputing Chip for 2026  

August 8, 2023

Intel is planning to onboard a new version of the Falcon Shores chip in 2026, which is code-named Falcon Shores 2. The new product was announced by CEO Pat Gel Read more…

The NASA Black Hole Plunge

May 7, 2024

We have all thought about it. No one has done it, but now, thanks to HPC, we see what it looks like. Hold on to your feet because NASA has released videos of wh Read more…

GenAI Having Major Impact on Data Culture, Survey Says

February 21, 2024

While 2023 was the year of GenAI, the adoption rates for GenAI did not match expectations. Most organizations are continuing to invest in GenAI but are yet to Read more…

How the Chip Industry is Helping a Battery Company

May 8, 2024

Chip companies, once seen as engineering pure plays, are now at the center of geopolitical intrigue. Chip manufacturing firms, especially TSMC and Intel, have b Read more…

Q&A with Nvidia’s Chief of DGX Systems on the DGX-GB200 Rack-scale System

March 27, 2024

Pictures of Nvidia's new flagship mega-server, the DGX GB200, on the GTC show floor got favorable reactions on social media for the sheer amount of computing po Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire