LightTrends Newsletter

Optical connectivity in AI clusters

April 2022

LightCounting’s observations from GTC 2022

by Vlad Kozlov

If you have not seen the keynote presentation from Nvidia’s CEO at GTC 2022, you should: https://www.nvidia.com/gtc/keynote/. My personal favorite is the closing video showing a “jazz” version of an AI Cluster. It is a lot of fun and it does show all the new hardware discussed at the event.

The announcements started with the new H100 GPU chip, which increases the performance of A100 by 6x in training applications and by 30x for inference models. This is accomplished in part by increasing the bandwidth of connectivity between GPUs in large clusters.

Figure below, shows the HGX H100 system with eight H100 GPUs, interconnected by four NVLink switch chips on the front side of the board. The most interesting part is that all the NVLink switches on the HGX board can be connected to several other HGX boards with 18 800G ports. This is accomplished with OSFP optical transceivers or AOCs, plugged into a mezzanine card (not shown in the figure). The connections are managed by 1RU NVLink “leaf” switches, equipped with 32 of 800G ports.



The new DGX H100 system, which combines the HGX H100 with a ConnectX-7 NIC, supporting up to 10 400G Ethernet or InfiniBand connections. Accounting for the total bandwidth of NVLinks (14.4Tbps) and 4Tbps of Ethernet/InfiniBand ports, the new DGX system can support up to 18.4Tbps of connectivity – a lot more than any other server on the market.

It is very likely that the next generation of Nvidia GPU-based systems will need 1.6T optics and this may be just 2 years away.

We reported in our November 2021 research note on the OCP Summit that Meta is constructing very large AI clusters using 200G optical connectivity based on Nvidia’s GPUs. These AI clusters form a “back-end” network within Meta’s datacenters, enabling targeted advertising on Facebook, Instagram and other applications running on the “front-end” of the datacenters. Many other Cloud companies also use GPU-based clusters for targeted advertising. This is where all the money is coming from. Expect a lot more investments in AI clusters and the optics supporting them.

Full version of the research note is available to LightCounting clients.

This information is also included in our latest report: High-Speed Ethernet Optics – March 2022.

A detailed analysis of trends in the AI hardware market is part of our December 2021 report: High-Speed Cables, Embedded and Co-Packaged Optics.

Ready to connect with LightCounting?

Enabling effective decision-making based on a unique combination of quantitative and qualitative analysis.
Reach us at info@lightcounting.com

Contact Us