LightCounting releases a research note on Meta’s AI supercomputer and Photonics West Plenary sessions.
We reported in our November 2021 research note on the OCP summit that Meta is constructing very large AI clusters using 200G optical connectivity. Now, we know how large these clusters are: 16,000 GPUs interconnected to boost system performance to 5 exaflops of mixed-precision AI performance, as disclosed by Meta on January 24th.
Nvidia reported in a blog that “It’s the second time Meta has picked NVIDIA technologies as the base for its research infrastructure. In 2017, Meta built the first generation of this infrastructure for AI research with 22,000 NVIDIA V100 Tensor Core GPUs that handles 35,000 AI training jobs a day. Meta’s early benchmarks showed the new AI cluster can train large models 3x faster and run computer vision jobs 20x faster than the prior system.”
Note all the yellow fiber cables just below the ceiling, in the photo above. It is hard to tell from the photo, but it looks like many of the cables are routed to stacks of pizza boxes hanging under the ceiling inside bright blue racks. Are these re-configurable patch panels with optical switches inside?
Google have been using optical switches inside their AI clusters for a few years now. Is Meta following their example? We noticed more collaboration between Meta and Google at the last OCP summit, which is great. It is very likely that Meta could apply some of Google’s expertise in its own AI cluster design. Our latest report titled “High Speed cables, EOMs and CPO” discusses future reconfigurable architecture of HPCs and AI clusters.
Yet, Meta is just starting to catch up with Amazon and Google in AI. We suspect that the performance of AI clusters built by Amazon and Google is well above the current #1 system on the Top 500 Supercomputer list. The development of new AI-powered applications is a very competitive area, so companies disclose very little about the performance of their systems. We do know that both Amazon and Google use 400G and 2x400G connectivity in their AI clusters, suggesting that these may have twice the performance of Meta’s latest system.
Highlights from plenary presentations at Photonics West include:
LightCounting subscribers can access the full text of this research note by logging into their accounts.