Argonne’s ALCF Launches Its AI Testbed, Details Systems

November 11, 2021

Nov. 11, 2021 — With an eye toward the future of scientific computing, the Argonne Leadership Computing Facility (ALCF) is building a powerful testbed comprised of some of the world’s most advanced artificial intelligence (AI) platforms.

A graphic showing the various systems in the ALCF AI Testbed. Credit: ALCF

Designed to explore the possibilities of bleeding-edge high-performance computing (HPC) architectures, the ALCF AI Testbed will enable the facility and its user community to help define the role of AI accelerators in next-generation scientific machine learning. The ALCF is a U.S. Department of Energy (DOE) Office of Science user facility located at DOE’s Argonne National Laboratory

Available to the research community beginning in early 2022, the ALCF testbed’s innovative AI platforms will complement Argonne’s next-generation graphics processing unit- (GPU-) accelerated supercomputers, the 44-petaflop Polaris system and the exascale-class Aurora machine, to provide a state-of-the-art computing environment that supports pioneering research at the intersection of AI and HPC.

“The last year and a half have seen Argonne collaborate with a variety of AI-accelerator startups to study the scientific applications these sorts of processors might be used for,” ALCF Director Michael Papka said. “GPUs have an established place in the future of scientific HPC, one cemented by years of dedicated research and well-represented in the first generation of exascale machines. It seems clear that AI accelerators could play just as prominent and expansive a role as GPUs, but the specifics of that role largely have yet to be determined. The scientific community should be active in that determination.”

“The possibilities of ways to approach computing are constantly multiplying, and we want our users to be able to identify the most appropriate workflow for each project and take full advantage of them so as to produce the best possible science and accelerate the rate of discovery,” added Papka.

Offering architectural features designed to support AI and data-centric workloads, the testbed is uniquely well-suited to handle the data produced by large-scale simulation and learning projects, as well as by light sources, telescopes, particle accelerators, and other experimental facilities. Moreover, the testbed components stand to significantly broaden analytic and processing abilities in the project workflows deployed at the ALCF beyond those supported by traditional CPU- and GPU-based machines.

The ALCF testbed also opens the door to further collaborations with Argonne’s Data Science and Learning, Mathematics and Computer Science, and Computational Science Divisions, as well as the broader laboratory community. Such collaborations are of great utility, as they simultaneously deepen scientific discovery while validating and establishing the capabilities of new hardware and software through the use of real data.

The extensive, diverse collaboration with startups is essential for determining how AI accelerators can be applied to scientific research.

“The AI testbed, compared to our other production machines, feels much more like the Wild West,” Papka explained. “It’s very much geared for early adopters and the more adventurously inclined among our user community. This is facility hardware at its most experimental, so while we will certainly stabilize things as much as possible and provide documentation, it’s going to represent an extreme end of the scientific-computing spectrum.”

Nonetheless, a series of DOE-wide town hall meetings over the last two years has helped foster interest in the possibilities of AI for science, inviting a wealth of different perspectives and culminating in an extensive report that highlights the challenges and opportunities of using AI in scientific research.

Diverse Applications

Testbed applications already range from COVID-19 research to multiphysics simulations of massive stars to predicting cancer treatments.

Pandemic research is using AI technologies to address the fundamental biological mechanisms of the SARS-CoV-2 virus and associated COVID-19 disease, while simultaneously targeting the entire viral proteome to identify potential therapeutics.

The CANDLE project, meanwhile, attempts to solve large-scale machine learning problems for three cancer-related pilot applications: predicting drug interactions, predicting the state of molecular dynamics simulations, and predicting cancer phenotypes and treatment trajectories from patient documents.

Bleeding-Edge Systems

“The testbed combines a number of bleeding-edge components, including the Cerebras CS-2, a Graphcore Colossus GC22, a SambaNova DataScale system, a Groq system, and a Habana Gaudi system,” Venkatram Vishwanath, lead for ALCF’s Data Science Group. “The juxtaposition of these machines is unique to the ALCF, opening the door to AI-driven data science workflows that, for the time being, are effectively unfeasible elsewhere. The AI testbed also provides a unique opportunity for AI vendors to target their system—in terms of both software and hardware—to meet the requirements of scientific AI workloads.”

The Cerebras CS-2 is a wafer-scale deep learning accelerator comprising 850,000 processing cores, each providing 48KB of dedicated SRAM memory for an on-chip total of 40GB and interconnected to optimize bandwidth and latency. Its software platform integrates popular machine learning frameworks such as TensorFlow and PyTorch.

The Graphcore Colossus, designed to provide state-of-the-art performance for training and inference workloads, consists of 1,216 IPU tiles, each of which has an independent core and tightly coupled memory. The Dell DSS8440, the first Graphcore IPU server, features 8 dual-IPU C2 PCIe cards, all connected with IPU-Link technology in an industry standard 4U server for AI training and inference workloads. The server has two sockets, each with 20 cores and 768GB of memory.

The SambaNova DataScale system is architected around the next-generation Reconfigurable Dataflow Unit (RDU) processor for optimal dataflow processing and acceleration. The SambaNova is a half-rack system consisting of two nodes, each of which features eight RDUs interconnected to enable model and data parallelism. SambaFlow, its software stack, extracts, optimizes, and maps dataflow graphs to the RDUs from standard machine learning frameworks, including TensorFlow and PyTorch.

Groq Tensor Streaming Processor (TSP) provides a scalable, programmable processing core and memory building block able to achieve 250 TFlops in FP16 and 1 PetaOp/s in INT8 performance. The Groq accelerators are PCIe gen4-based, and multiple accelerators on a single node can be interconnected via a proprietary chip-to-chip interconnect to enable larger models and data parallelism.

The Habana Gaudi processor features eight fully programmable VLIW SIMD tensor processor cores, integrating ten 100 GbE ports of RDMA over Converged Ethernet (RoCE) into each processor chip to efficiently scale training. The Gaudi system consists of two HLS-1H nodes, each with four Gaudi HL-205 cards. The software stack comprises the SynapseAI stack and provides support for TensorFlow and PyTorch.

A Powerful New Resource

Over the next year, the ALCF will continue to ramp up production use for the AI testbed, thereby solidifying its unique position among the facility’s powerful computing resources.

“The initial projects underway are just the tip of the iceberg,” Papka said. “The machines comprising the testbed will soon be leveraged for work that touches on virtually every discipline. While we don’t know exactly what’s going to happen, these systems will have an important role in shaping the scientific computing landscape.”

The Argonne Leadership Computing Facility provides supercomputing capabilities to the scientific and engineering community to advance fundamental discovery and understanding in a broad range of disciplines. Supported by the U.S. Department of Energy’s (DOE’s) Office of Science, Advanced Scientific Computing Research (ASCR) program, the ALCF is one of two DOE Leadership Computing Facilities in the nation dedicated to open science.

Argonne National Laboratory seeks solutions to pressing national problems in science and technology. The nation’s first national laboratory, Argonne conducts leading-edge basic and applied scientific research in virtually every scientific discipline. Argonne researchers work closely with researchers from hundreds of companies, universities, and federal, state and municipal agencies to help them solve their specific problems, advance America’s scientific leadership and prepare the nation for a better future. With employees from more than 60 nations, Argonne is managed by UChicago Argonne, LLC for the U.S. Department of Energy’s Office of Science.

The U.S. Department of Energy’s Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time. For more information, visit https://​ener​gy​.gov/​s​c​ience.


Source: Nils Heinonen, Argonne Leadership Computing Facility

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

US Implements Controls on Quantum Computing and other Technologies

September 27, 2024

Yesterday the Commerce Department announced  export controls on quantum computing technologies as well as new controls for advanced semiconductors and additive manufacturing technologies. AIP’s FYI has posted a good Read more…

IBM Develops New Quantum Benchmarking Tool — Benchpress

September 26, 2024

Benchmarking is an important topic in quantum computing. There’s consensus it’s needed but opinions vary widely on how to go about it. Last week, IBM introduced a new tool — Benchpress — intended to help evaluate Read more…

Editor’s Note: Datanami Is Now BigDATAwire

September 26, 2024

Earlier this week, Datanami completed the transition to BigDATAwire. Loyal readers will notice that we began this journey nearly two years ago. And while the transition may have taken a little longer than expected, it’ Read more…

Launch Codes: Code@TACC Alum Lands at UT Austin

September 26, 2024

For new college graduates, finding a job after earning your degree can take months. And, if the labor market is struggling with inflation, employment opportunities can be scarce. Being patient, staying positive, and expl Read more…

IBM and NASA Launch Open-Source AI Model for Advanced Climate and Weather Research

September 25, 2024

IBM and NASA have developed a new AI foundation model for a wide range of climate and weather applications, with contributions from the Department of Energy’s Oak Ridge National Laboratory. The new open-source model, n Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Granite Rapids, have been customized and validated specific Read more…

IBM and NASA Launch Open-Source AI Model for Advanced Climate and Weather Research

September 25, 2024

IBM and NASA have developed a new AI foundation model for a wide range of climate and weather applications, with contributions from the Department of Energy’s Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Read more…

Building the Quantum Economy — Chicago Style

September 24, 2024

Will there be regional winner in the global quantum economy sweepstakes? With visions of Silicon Valley’s iconic success in electronics and Boston/Cambridge� Read more…

How GPUs Are Embedded in the HPC Landscape

September 23, 2024

Grasping the basics of Graphics Processing Unit (GPU) architecture is crucial for understanding how these powerful processors function, particularly in high-per Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

The Three Laws of Robotics and the Future

September 14, 2024

Isaac Asimov's Three Laws of Robotics have captivated imaginations for decades, providing a blueprint for ethical AI long before it became a reality. First i Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

Granite Rapids HPC Benchmarks: I’m Thinking Intel Is Back (Updated)

September 25, 2024

Waiting is the hardest part. In the fall of 2023, HPCwire wrote about the new diverging Xeon processor strategy from Intel. Instead of a on-size-fits all approa Read more…

Ansys Fluent® Adds AMD Instinct™ MI200 and MI300 Acceleration to Power CFD Simulations

September 23, 2024

Ansys Fluent® is well-known in the commercial computational fluid dynamics (CFD) space and is praised for its versatility as a general-purpose solver. Its impr Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Leading Solution Providers

Contributors

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

Microsoft, Quantinuum Use Hybrid Workflow to Simulate Catalyst

September 13, 2024

Microsoft and Quantinuum reported the ability to create 12 logical qubits on Quantinuum's H2 trapped ion system this week and also reported using two logical qu Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

IBM Develops New Quantum Benchmarking Tool — Benchpress

September 26, 2024

Benchmarking is an important topic in quantum computing. There’s consensus it’s needed but opinions vary widely on how to go about it. Last week, IBM introd Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

The Three Laws of Robotics and the Future

September 14, 2024

Isaac Asimov's Three Laws of Robotics have captivated imaginations for decades, providing a blueprint for ethical AI long before it became a reality. First i Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire