Network connectivity continues to be a major bottleneck for AI data centers as up to 33% of elapsed time in AI/ML tasks is often wasted waiting for network availability, resulting in costly GPU resources remaining idle.1
With AI network requirements on an upwards trajectory, choosing the right standard between InfiniBand and Ethernet for your data center architectures is one of the first steps towards optimizing network performance and helping AI reach its full potential.
Ethernet vs InfiniBand
As a near ubiquitous network communications standard, Ethernet should be a familiar standard for network operators. Ethernet facilitates fast, reliable and secure data transfer between millions of wired or wireless network devices in Local Area Networks (LAN) and Wide Area Networks (WAN) worldwide.
Even with its dominance in the network industry, the Ethernet standard has continued evolving instead of remaining stagnant in its development. New breakthroughs have greatly improved the standard's ability to support next generation networks.
What is InfiniBand?
At the opposite end of the spectrum is InfiniBand (IB), a network communications standard that was designed as a spiritual successor to Ethernet by upgrading and fixing many of the latter’s perceived deficiencies.
InfiniBand is a lossless fabric with a unique topology that gives it inherent advantages over Ethernet. The newer network standard offers extremely high throughput and extremely low latency when used as an interconnect between network devices, servers and storage. Although InfiniBand adoption has not been nearly as prevalent as that of Ethernet, its growing popularity in the data center coincides with the rise of AI.
What are the fundamental differences between InfiniBand and Ethernet?
The key differences between the two protocols are as follows:
InfiniBand: Dedicated switch fabric
Ethernet: Shared switch fabric
InfiniBand utilizes a switch fabric which allows multiple switches to connect network nodes and transfer data in parallel. This balances network traffic and improves the flow of data through the network. In contrast, the shared fabric of standard Ethernet utilizes a single communication channel to transfer data between different devices, which may limit efficiency, performance, and scalability.
InfiniBand: Lossless fabric
Ethernet: Best effort network
One of the advantages that InfiniBand has over traditional Ethernet is how it handles network traffic congestion. InfiniBand is designed to be a lossless fabric, meaning that it doesn't drop packets. InfiniBand utilizes link-level flow control which signals to the sender whenever there is congestion on the receiving end is congested. The sender then temporarily pauses data transmission to avoid dropping packets. This native characteristic improves the efficiency, efficacy, and data integrity for training AI models.
Standard ethernet-based networks, on the other hand, are best effort networks. This setup allows the sender to continue sending data even through congested traffic and overfilled buffers on the receiving side. This results in routinely dropped packets, which lowers overall network performance.
InfiniBand: RDMA protocol
Ethernet: TCP/IP protocol
InfiniBand utilizes the RDMA (Remote Direct Memory Access) protocol, which allows network devices to initiate data transfer without having to involve the CPUs and operating systems. With direct access to the data from the onboard memory on each device, the network devices can cut down the communication time to speed up the data transfer process. This directly translates to lower latency and higher throughput.
In contrast, standard ethernet utilizes a TCP (Transmission control protocol) /IP (Internet protocol), which requires two devices to communicate with each other via the operating system in order to exchange data. This extra step slows down the data transfer process and adds latency to the network.
Ethernet upgrades
As mentioned beforehand, Ethernet has evolved over the years that allows it to level the playing field with InfiniBand. Some of the upgrades include RoCE (Remote Direct Memory Access over Converged Ethernet), which is designed to mimic RDMA, as well as PFC (Priority-based flow control), which pauses network traffic independently to avoid overbuffing and dropping packets.
How do the two compare with each other?
Several criteria to consider when choosing between InfiniBand and Ethernet include:
Costs
The costs of InfiniBand tend to be higher than that of Ethernet on average because InfiniBand requires specialized hardware/equipment to be installed. The hardware is often proprietary and vendor-locked, which can limit the availability of hardware and increase costs.
Despite this, InfiniBand is potentially more cost-effective in the long run. InfiniBand can perform at a higher baseline level and has higher scalability. Less frequent major upgrades are needed over time to accommodate for increasing workloads in AI, helping businesses save money on retooling for future AI applications/workloads.
Latency
In terms of latency, InfiniBand generally outperforms Ethernet in the data center. Without any packet loss and the need to retransmit lost packets, InfiniBand offers ultra efficient data transfer with often less than 1 microsecond of latency.
AI performance is largely contingent on feeding large language models massive amounts of data in a timely manner to improve the efficiency of machine learning. With lower latency, AI models can be more responsive and run real-time applications more seamlessly without major delays. In a few metric tests for performance, InfiniBand outperforms Ethernet by 20% for AI workloads.
Network complexity
There should also be a high level of familiarity with Ethernet-based devices for businesses. Most if not all network operators already have years of experience configuring Ethernet-based systems, so Ethernet may not throw as many curveballs to an experienced networking team.
InfiniBand can be more complex to set up and manage because it involves a lesser-known standard to which niche expertise and specialized personnel is often required. Because InfiniBand runs on a dedicated fabric, the architecture is often more sophisticated to configure. The routing and topology are also more complex, and as such, extensive training may be needed to help familiarize personnel with an InfiniBand-based infrastructure.
Versatility
As Ethernet is the most commonly used standard, Ethernet supports a wider range of topologies and has greater flexibility so it can be integrated into a more diverse range of data center environments.
Many systems already have Ethernet equipment in place with backwards compatibility. Network operators can simply build on an existing legacy infrastructure without having to buy specialized InfiniBand hardware for growing workloads.
Applications
InfiniBand is almost tailor-made for high-performance computing (HPC) with its high throughput, ultra-low latency, and ability to facilitate more efficient parallel processing and data sharing. These attributes make it particularly suitable to seamlessly run AI/ML workloads. Ethernet's stable performance, widescale adoption, and ease of use make it a perfect fit in LAN and metropolitan networks.
Axiom for AI networking
Taking into account the aforementioned factors can help businesses figure out which type of standard and approach are most optimal for the data center. Both InfiniBand and Ethernet can be great options in the right environments, it just depends on organizational needs and resources. Axiom supports both InfiniBand and Ethernet deployments to help businesses build a more robust AI data center.
Axiom offers high-speed transceivers, cables, and network equipment engineered for seamless integration into a ML cluster or AI super basepod. From 800G QSFP+/OSFP to 1.6T OSFP transceivers with riding and integrated heatsinks to LPO powered optics that can hit the benchmarks for a low power consumption in a high-performance AI data center, Axiom transceivers are ready for deployment in an AI data center infrastructure.
Power AI data centers with Axiom InfiniBand and Ethernet solutions. Learn more about our Ethernet and InfiniBand EDR/HDR/NDR transceivers as well as Breakout DAC, DAC/AOC options today, contact our team
--------------------------------------------------------------------------------------------
1 2022 OCP Keynote