Networking is the Key to Unlocking GPU/XPU Diversity

Customers Deploying Multiple Generations of GPUs Benefit from Ethernet Fabrics

Introduction
The AI data center market is at an inflection point. Hyperscalers and large enterprises are scaling massively to support AI workloads. Still, as they build towards a world of multiple GPUs/XPUs per supplier and numerous suppliers, they realize that they can’t be locked into a unique ecosystem of suppliers every time they add a new accelerator to their fleet of accelerators. We see organizations actively pursuing diversification and migrating towards Ethernet to mitigate these risks. Networking is emerging as the critical enabler that can tip the scales in favor of GPU/XPU diversity.

The Drive for Diversification
Organizations that look for ways to lower their AI data center costs are seeking to diversify their supply chains and accelerators used. Network performance standardization becomes the equalizing factor. If the performance of an AI cluster is consistent for any GPU/XPU, then the cost and benefits of the GPU/XPU become the determining factor for the overall solution cost. For Hyperscalers, this creates the need to mix merchant GPUs and XPUs, for smaller Neoclouds and enterprises, this prepares them for multiple generations of GPUs. Ensuring a consistently high network performance with any GPU/XPU moves the power into the hands of the AI infrastructure builders vs. the AI solution vendors.

Customers aren’t looking to deploy different accelerators in each rack, but they want confidence that they can select the best or most cost-effective gear for their next. Single-vendor dependence creates risks around availability, cost escalation, and ecosystem control. Diversification is no longer optional; it’s a strategic imperative for resilience and economic efficiency.

Adding AMD to the Customer Fleet
As alternatives to NVIDIA gain ground, their viability at scale hinges on robust networking fabrics. For example, AMD’s MI450 (and even MI350), as well as forthcoming models, are demonstrating competitive performance and looking at networking fabrics to gain rack- and pod-level scale in benchmarks. While differences persist, the unit of compute continues to expand beyond a single server enclosure to the rack and pod. Critically, realizing full cluster potential requires advanced networking to bridge any per-GPU gaps, ensuring seamless scalability and performance parity in distributed systems.

AMD has grown its GPU revenue significantly since 2023 and guided to significant revenue growth in GPUs during its last earnings call for 2027 (Figure 1).




The Core Barrier: Performance Measured in Cost Per Million Tokens (CPMT)
The primary hurdle to widespread adoption of alternatives remains performance. In AI terms, this boils down to CPMT, the key metric for evaluating the economic efficiency of generating outputs from large language models. In times when companies invest billions of dollars in building large AI infrastructure, lowering costs can translate to hundreds of millions of dollars for just one AI deployment. Lowering the cost is also the difference between a Neocloud losing money and being profitable, and adding compute efficiently is key to many enterprise objectives as AI gets embedded into more of their workloads. NVIDIA’s integrated stack has set the benchmark for low CPMT, and alternatives must close the gap at the cluster level to compete effectively.

Networking: The Critical EnablerThe interconnect fabric increasingly determines performance in large-scale AI clusters. InfiniBand has long been the gold standard for low-latency, high-bandwidth, lossless networking in AI training and inference. However, Ethernet is rapidly closing the gap and, in many cases, becoming the performance leader. Future generations of Ethernet Silicon will incorporate AI from the ground up and continue to improve performance and scale.

DriveNets Demonstrates Ethernet’s Superiority
Recent production data from DriveNets highlights the shift. DriveNets’ Ethernet achieved up to 18% better performance in NCCL benchmarks compared to InfiniBand, with job completion time improvements of 10-30% over traditional Ethernet. These results come from live deployments, showcasing advanced scheduling and congestion control that deliver lossless-like behavior without constraints.

AMD + DriveNets: A Compelling Path
For AMD customers, the combination of MI-series GPUs with DriveNets’ AI fabric solution and professional services offers a transformative opportunity. This pairing can deliver cluster-level performance while providing a significantly better CPMT proposition. The result is a highly cost-effective, open-standards-based solution that can be added to a customer’s existing accelerator fleet.

Market Outlook: Ethernet’s Dominance Accelerates
At 650 Group, we continue to forecast strong growth in AI networking. The market, encompassing Ethernet, InfiniBand, and optics, is on track to exceed $100 billion by 2028, with Ethernet capturing the majority share due to its openness, supply chain diversity, and rapid innovation. As AI workloads evolve from foundational training to reinforcement learning, inference, and multi-accelerator environments, Ethernet will become the dominant fabric.

Diversification and workload specialization are underway, and networking is the linchpin. Organizations that prioritize advanced Ethernet solutions like DriveNets’ will unlock high-performance, cost-optimized AI clusters.

###