Nebius Open-Sources Soperator to Optimize Slurm for AI and HPC Workloads

September 25, 2024

AMSTERDAM, Sept. 25, 2024 — Nebius, a leading AI infrastructure company, is excited to announce the open-source release of Soperator, the world’s first fully featured Kubernetes operator for Slurm, designed to optimize workload management and orchestration in modern machine-learning (ML) and high-performance computing (HPC) environments.

Soperator has been developed by Nebius to merge the power of Slurm, a job orchestrator designed to manage large-scale HPC clusters, with Kubernetes’ flexible and scalable container orchestration. It delivers simplicity and efficient job scheduling when working in compute-intensive environments, particularly for GPU-heavy workloads, making it ideal for ML training and distributed computing tasks.

Narek Tatevosyan, Director of Product Management for the Nebius Cloud Platform, said: “Nebius is rebuilding cloud for the AI age by responding to the challenges that we know AI and ML professionals are facing. Currently there is no workload orchestration product on the market that is specialized for GPU-heavy workloads. By releasing Soperator as an open-source solution, we aim to put a powerful new tool into the hands of the ML and HPC communities.

“We are strong believers in community driven innovation and our team has a strong track record of open-sourcing innovative products. We’re excited to see how this technology will continue to evolve and enable AI professionals to focus on enhancing their models and building new products.”

Danila Shtan, Chief Technology Officer at Nebius, added: “By open-sourcing Soperator, we’re not just releasing a tool – we’re standing by our commitment to open-source innovation in an industry where many keep their solutions proprietary. We’re pushing for a cloud-native approach to traditionally conservative HPC workloads, modernizing workload orchestration for GPU-intensive tasks. This strategic initiative reflects our dedication to fostering community collaboration and advancing AI and HPC technologies globally.”

Key features of Soperator include:

  • Enhanced scheduling and orchestration: Soperator provides precise workload distribution across large compute clusters, optimizing GPU resource usage and enabling parallel job execution. This minimizes idle GPU capacity, optimizes costs, and facilitates more efficient collaboration, making it a crucial tool for teams working on large-scale ML projects.
  • Fault-tolerant training: Soperator includes a hardware health check mechanism that monitors GPU status, automatically reallocating resources in case of hardware issues. This improves training stability even in highly distributed environments and reduces GPU hours required to complete the task.
  • Simplified cluster management: By having a shared root file system across all cluster nodes, Soperator eliminates the challenge of maintaining identical states across multi-node installations. Together with Terraform operator, this simplifies the user experience, allowing ML teams to focus on their core tasks without the need for extensive DevOps expertise.

Future planned enhancements include improvements to security and stability, scalability and node management, as well as upgrades according to emerging software and hardware updates.

The first public release of Soperator is available from today as an open-source solution to all ML and HPC professionals on the Nebius GitHub, along with relevant deployment tools and packages. Nebius also invites anyone who would like to try out the solution for their ML training or HPC calculations running on multi-node GPU installations; the company’s solution architects are ready to provide assistance and guidance through the installation and deployment process in the Nebius environment.

For more information about Soperator please read the blog post published today on Nebius’s website here.

About Nebius

Nebius is a technology company building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, and tools and services for developers. Headquartered in Amsterdam and listed on Nasdaq, the company has a global footprint with R&D hubs across Europe, North America and Israel. Nebius’s core business is an AI-centric cloud platform built for intensive AI workloads. With proprietary cloud software architecture and hardware designed in-house (including servers, racks and data center design), Nebius gives AI builders the compute, storage, managed services and tools they need to build, tune and run their models. An NVIDIA preferred cloud service provider, Nebius offers high-end infrastructure optimized for AI training and inference. The company boasts a team of over 500 skilled engineers, delivering a true hyperscale cloud experience tailored for AI builders. To learn more please visit www.nebius.com.


Source: Nebius

Subscribe to HPCwire's Weekly Update!

Be the most informed person in the room! Stay ahead of the tech trends with industry updates delivered to you every week!

IBM Develops New Quantum Benchmarking Tool — Benchpress

September 26, 2024

Benchmarking is an important topic in quantum computing. There’s consensus it’s needed but opinions vary widely on how to go about it. Last week, IBM introduced a new tool — Benchpress — intended to help evaluate Read more…

Editor’s Note: Datanami Is Now BigDATAwire

September 26, 2024

Earlier this week, Datanami completed the transition to BigDATAwire. Loyal readers will notice that we began this journey nearly two years ago. And while the transition may have taken a little longer than expected, it’ Read more…

Launch Codes: Code@TACC Alum Lands at UT Austin

September 26, 2024

For new college graduates, finding a job after earning your degree can take months. And, if the labor market is struggling with inflation, employment opportunities can be scarce. Being patient, staying positive, and expl Read more…

IBM and NASA Launch Open-Source AI Model for Advanced Climate and Weather Research

September 25, 2024

IBM and NASA have developed a new AI foundation model for a wide range of climate and weather applications, with contributions from the Department of Energy’s Oak Ridge National Laboratory. The new open-source model, n Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Granite Rapids, have been customized and validated specific Read more…

Granite Rapids HPC Benchmarks: I’m Thinking Intel is Back

September 25, 2024

Waiting is the hardest part. In the fall of 2023, HPCwire wrote about the new diverging Xeon processor strategy from Intel. Instead of a on-size-fits all approach; Intel has opted for a Xeons built with two different goa Read more…

IBM and NASA Launch Open-Source AI Model for Advanced Climate and Weather Research

September 25, 2024

IBM and NASA have developed a new AI foundation model for a wide range of climate and weather applications, with contributions from the Department of Energy’s Read more…

Intel Customizing Granite Rapids Server Chips for Nvidia GPUs

September 25, 2024

Intel is now customizing its latest Xeon 6 server chips for use with Nvidia's GPUs that dominate the AI landscape. The chipmaker's new Xeon 6 chips, also called Read more…

Building the Quantum Economy — Chicago Style

September 24, 2024

Will there be regional winner in the global quantum economy sweepstakes? With visions of Silicon Valley’s iconic success in electronics and Boston/Cambridge� Read more…

How GPUs Are Embedded in the HPC Landscape

September 23, 2024

Grasping the basics of Graphics Processing Unit (GPU) architecture is crucial for understanding how these powerful processors function, particularly in high-per Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

The Three Laws of Robotics and the Future

September 14, 2024

Isaac Asimov's Three Laws of Robotics have captivated imaginations for decades, providing a blueprint for ethical AI long before it became a reality. First i Read more…

Everyone Except Nvidia Forms Ultra Accelerator Link (UALink) Consortium

May 30, 2024

Consider the GPU. An island of SIMD greatness that makes light work of matrix math. Originally designed to rapidly paint dots on a computer monitor, it was then Read more…

AMD Clears Up Messy GPU Roadmap, Upgrades Chips Annually

June 3, 2024

In the world of AI, there's a desperate search for an alternative to Nvidia's GPUs, and AMD is stepping up to the plate. AMD detailed its updated GPU roadmap, w Read more…

Shutterstock_2176157037

Intel’s Falcon Shores Future Looks Bleak as It Concedes AI Training to GPU Rivals

September 17, 2024

Intel's Falcon Shores future looks bleak as it concedes AI training to GPU rivals On Monday, Intel sent a letter to employees detailing its comeback plan after Read more…

Nvidia Shipped 3.76 Million Data-center GPUs in 2023, According to Study

June 10, 2024

Nvidia had an explosive 2023 in data-center GPU shipments, which totaled roughly 3.76 million units, according to a study conducted by semiconductor analyst fir Read more…

Shutterstock_1687123447

Nvidia Economics: Make $5-$7 for Every $1 Spent on GPUs

June 30, 2024

Nvidia is saying that companies could make $5 to $7 for every $1 invested in GPUs over a four-year period. Customers are investing billions in new Nvidia hardwa Read more…

Shutterstock 1024337068

Researchers Benchmark Nvidia’s GH200 Supercomputing Chips

September 4, 2024

Nvidia is putting its GH200 chips in European supercomputers, and researchers are getting their hands on those systems and releasing research papers with perfor Read more…

Ansys Fluent® Adds AMD Instinct™ MI200 and MI300 Acceleration to Power CFD Simulations

September 23, 2024

Ansys Fluent® is well-known in the commercial computational fluid dynamics (CFD) space and is praised for its versatility as a general-purpose solver. Its impr Read more…

Comparing NVIDIA A100 and NVIDIA L40S: Which GPU is Ideal for AI and Graphics-Intensive Workloads?

October 30, 2023

With long lead times for the NVIDIA H100 and A100 GPUs, many organizations are looking at the new NVIDIA L40S GPU, which it’s a new GPU optimized for AI and g Read more…

Leading Solution Providers

Contributors

Quantum and AI: Navigating the Resource Challenge

September 18, 2024

Rapid advancements in quantum computing are bringing a new era of technological possibilities. However, as quantum technology progresses, there are growing conc Read more…

Google’s DataGemma Tackles AI Hallucination

September 18, 2024

The rapid evolution of large language models (LLMs) has fueled significant advancement in AI, enabling these systems to analyze text, generate summaries, sugges Read more…

Microsoft, Quantinuum Use Hybrid Workflow to Simulate Catalyst

September 13, 2024

Microsoft and Quantinuum reported the ability to create 12 logical qubits on Quantinuum's H2 trapped ion system this week and also reported using two logical qu Read more…

IonQ Plots Path to Commercial (Quantum) Advantage

July 2, 2024

IonQ, the trapped ion quantum computing specialist, delivered a progress report last week firming up 2024/25 product goals and reviewing its technology roadmap. Read more…

Intel’s Next-gen Falcon Shores Coming Out in Late 2025 

April 30, 2024

It's a long wait for customers hanging on for Intel's next-generation GPU, Falcon Shores, which will be released in late 2025.  "Then we have a rich, a very Read more…

AI Helps Researchers Discover Catalyst for Green Hydrogen Production

September 16, 2024

Researchers from the University of Toronto have used AI to generate a “recipe” for an exciting new catalyst needed to produce green hydrogen fuel. As the ef Read more…

The Three Laws of Robotics and the Future

September 14, 2024

Isaac Asimov's Three Laws of Robotics have captivated imaginations for decades, providing a blueprint for ethical AI long before it became a reality. First i Read more…

EU Spending €28 Million on AI Upgrade to Leonardo Supercomputer

September 19, 2024

The seventh fastest supercomputer in the world, Leonardo, is getting a major upgrade to take on AI workloads. The EuroHPC JU is spending €28 million to upgrad Read more…

  • arrow
  • Click Here for More Headlines
  • arrow
HPCwire