Date: 04/22/24


 

Axiom network solutions for AI 

AI has risen to prominence in recent years off the back of Generative AI, which is able to generate content in a quality that we have never really seen before on a mainstream level. 

At its core, Generative AI is made possible through the contributions of the data center. Because Generative AI and the data center are functionally intertwined, data centers may need to undergo heavy optimization in order to fully unlock the potential of Generative AI. 

Machine Learning

Generative AI models are based off of Machine Learning (ML), which is a subset of AI that applies learning patterns based on data sets without needing instructions to be issued. In Machine Learning, AI models are trained with millions and millions of sets of data to absorb, process and analyze. Due to the sheer amount of data involved in ML algorithms, robust data center environments are key to running AI applications more efficiently.

Advancing workloads

While AI has the potential to revolutionize entire industries, the traditional data center is not constructed with AI computational workloads in mind. Data centers have traditionally relied on CPUs (Central Processing Units) to do the heavy lifting, however, GPUs (Graphics Processing Units) tend to outclass CPUs when it comes to handling computation for machine learning algorithms.

GPUs (Graphics Processing Units) are better suited to handle parallelism, or multiple workloads running at once. GPUs also come with high tensor cores, which allows them to calculate faster and with greater precision, making them more conducive to the type of computation that is involved in ML workloads. 

In response, AI data centers have shifted towards a GPU-centric environment to tackle the rigors of AI workloads. Service providers have started to aggregate multiple GPUs to form Machine Learning/AI clusters with GPUs on each node, increasing the total processing output. 

Networking bottleneck

The shift from CPU to GPU-centric clusters addresses many of the processing limitations of traditional data centers, but it comes with an important caveat: the level of network connectivity needed to work with these GPU clusters often exceeds what current data center environments have to offer. 

Running these models puts a strain on current networking infrastructures because of the immense amount of data that needs to be transferred back-and-forth in real time. 

In a standard cloud deployment, the maximum bandwidth capacity is around 100Gbps. An AI cluster stretches the limits of these data centers by pushing this bandwidth requirement to 30Tbps, a staggering 30x more than that of standard data centers. 

The amount of latency generated through AI applications is also a glaring issue. In many cases, the data is captured from on-site AI clients, sent to the cloud and the edge servers and then back on-site, repeatedly. 

All of these issues have created an AI network bottleneck. AI is only as responsive and agile as the network allows it to be, much like how a sports car can only travel as fast as the traffic/road conditions allow it to.

Pathways for AI data centers

There are several pathways for building network architectures and data center interconnects that can accommodate the needs of AI. 

Infiniband vs Ethernet

Infiniband

AI networking requires an internal fabric that is not bogged down by networking bottlenecks. Infiniband is designed for HPC or High-Performance Compute environments, making it a perfect fit for AI workloads, which typically require high-bandwidth and low latency deployments. 

Infiniband addresses the latency concern by leveraging a technology called RDMA (Remote Directory Memory Access. Latency in AI networks is often a symptom of the convoluted pathways it takes to get the data from point A to point B. RDMA enables two computers to exchange data from the main memory without using the processor or having to go through an operating system. This vastly speeds up the communication process, bolstering data rates while cutting down latency.

Infiniband is also a lossless fabric, meaning it doesn’t drop packets. Infiniband offers the best fabric for AI, but it does come with few drawbacks however. Infiniband is more expensive and proprietary, which can limit overall network flexibility.

Ethernet

Ethernet for AI has been slowly gaining traction but is still being developed to handle AI network connectivity. Ethernet is slower than Infiniband in most cases, but its lower costs and ubiquity can make it easier to build around. Ethernet also has the benefit of being disaggregated, meaning that the hardware is not bound to other hardware and offers greater flexibility.

200G QSFP56 SR / 400G QSFP / 800G OSFP transceivers

200G QSFP56 SR, 400G QSFP and 800G OSFP optical transceivers are ideal for AI networking. These optical transceivers can move data back-and-forth more efficiently to facilitate data transfer in HPC environments, helping power AI operations in real time and satiating the demands of AI models.

QSFP form factor transceivers are also backwards compatible for greater flexibility and cost savings. AI models will likely generate a greater amount of heat as GPU clusters are pressed with increasing parameters for AI workloads. OSFP form factors, however, are built with improved thermal designs that help mitigate excess heat from higher power consumption to ensure that performance is never a concern in these data centers.

PAM4 modulation

Most 400G and 800G transceivers utilize PAM4 (Pulse-Amplitude Modulation 4-Level) modulation schemes, which helps address the bandwidth and latency concerns of AI as well as the network complexity. PAM4 carries two bits per transmission interval, doubling the throughput per channel at equivalent levels of bandwidth compared to older modulation schemes such as NRZ (Non-Return to Zero). PAM4 transceivers also deliver twice the amount of data compared to an NRZ transceiver, without adding extra fiber. 

Axiom for AI

Axiom is a leader for networking solutions for AI. Axiom offers both Infiniband and Ethernet options with Axiom offers a full transceiver lineup featuring 1G to 800G transceivers and more.

As 200G/400G/800G links are featured in AI environments, Axiom 200G QSFP56 SR,400G OSFP and QDD, and 800G OSFP transceivers are a seamless fit for AI data centers. 

Axiom also offers a Breakout DAC lineup, which is Axiom’s transceiver & high-speed cable lineup featuring 100G/200G/400G/800G optical connections supporting InfiniBand. Axiom DAC and AOC cables such as the 800G OSFP DAC are ideal for AI data centers. 

Axiom’s wealth of next-generation data center options address a wide range of AI needs: from OSFP and QSFP-DD transceivers to DAC/AOC to SFP/SFP+/QSFP+ options and more. Accelerate AI data center performance with Axiom. 

 

Inside The Stack: Trends & Insights