Arista Networks delivers holistic AI solutions in collaboration with NVIDIA

Arista Networks has announced a technology demonstration of AI data centres to align compute and network domains as a single managed AI entity, in collaboration with NVIDIA.

To build optimal Generative AI networks with lower job completion times, customers can configure, manage and monitor AI clusters uniformly across key building blocks including networks, NICs and servers.

This demonstrates the first step in achieving a multi-vendor, interoperable ecosystem that enables control and coordination between AI networking and AI compute infrastructure.

Need for uniform controls

As the size of AI clusters and Large Language Models (LLMs) grows, the complexity and sheer volume of disparate parts of the puzzle grow apace. GPUs, NICs, switches, optics and cables must all work together to form a holistic network. Customers need uniform controls between their AI servers hosting NICs and GPUs, and the AI network switches at different tiers.

All these elements are reliant upon each other for proper AI job completion but operate independently. This could lead to misconfiguration or misalignment between aspects of the overall ecosystem, such as between NICs and the switch network, which can dramatically impact job completion time, since network issues can be very difficult to diagnose.

Large AI clusters also require coordinated congestion management to avoid packet drops and under-utilisation of GPUs, as well as coordinated management and monitoring to optimise compute and network resources in tandem.

Introducing the Arista AI Agent

At the heart of this solution is an Arista EOS-based agent enabling the network and the host to communicate with each other and coordinate configurations to optimise AI clusters.

Using a remote AI agent, EOS running on Arista switches can be extended to directly attached NICs and servers to allow a single point of control and visibility across an AI data centre as a holistic solution. This remote AI agent, hosted directly on an NVIDIA BlueField-3 SuperNIC or running on the server and collecting telemetry from the SuperNIC, allows EOS, on the network switch, to configure, monitor and debug network problems on the server, ensuring end-to-end network configuration and QoS consistency. AI clusters can now be managed and optimised as a single homogenous solution.

“Arista aims to improve efficiency of communication between the discovered network and GPU topology to improve job completion times through co-ordinated orchestration, configuration, validation and monitoring of NVIDIA accelerated compute, NVIDIA SuperNICs and Arista network infrastructure,” said John McCool, Chief Platform Officer, Arista Networks.

End-to-end AI communication and optimisation

This new technology demonstration highlights how an Arista EOS-based remote AI agent allows the combined, interdependent AI cluster to be managed as a single solution. EOS running in the network can now be extended to servers or SuperNICs via remote AI agents to enable instantaneous tracking and reporting of performance degradation or failures between hosts and networks, so they can be rapidly isolated and the impact minimised.

Since EOS-based network switches are constantly aware of accurate network topology, extending EOS down to SuperNICs and servers with the remote AI agent further enables co-ordinated optimisation of end-to-end QoS between all elements in the AI data centre to reduce job completion time.

Click below to share this article

Notifications

Arista Networks delivers holistic AI solutions in collaboration with NVIDIA

Intelligent Technologies

Regional News

Analysis

Content Hubs

Other Websites