LightCounting releases a research note on OCP Summit 2022
by Vlad Kozlov
OCP summit 2022 was a productive event for the industry. It offered a rare opportunity to meet with technology experts from the leading Cloud companies and get a glimpse of their plans for the future. These plans are fluid, but there is a clear direction towards deploying more AI hardware and networks supporting it, while optimizing the power efficiency. The industry is gearing up to take more risk and deploy a range of new technologies: from liquid cooling to co-packaged optics.
OCP has launched its Future Technology Initiative to foster interactions between the research community and start-ups on one side and the experts at Cloud companies and key suppliers on the other to identify new promising technologies. “Seeding new markets” is one of the current priorities for OCP.
Optics is another new priority for OCP. This year’s summit included a half day session for the new “Optical track”, hosted by Andy Bechtolsheim. LightCounting was given the honor of moderating the closing panel discussion.
The keynote presentation by Alexis Bjorlin, VP of Infrastructure at Meta, set up the summit agenda with a clear focus on AI hardware, software and all supporting technologies including optical connectivity.
Alexis summarized developments of AI hardware and architectures in her keynote presentation, which included the chart below. It clearly shows that progress in the bandwidth of DRAM and Interconnects is lagging far behind advances in compute hardware. This situation has to change.
Grand Teton, Meta’s latest AI platform unveiled at the event, offers 4x the host-GPU and 2x the network bandwidth, compared to Meta’s Zion EX introduced less than 2 years ago.
Increasing power consumption of AI hardware poses another major challenge. Meta contributed to OCP a new open Rack v3, designed for both air and liquid cooling. Any new technology offering improvements in the power efficiency of AI hardware and networking has to be looked at seriously, including CPO. Development of CPO was the project that Alexis led in her previous role as GM for optics at Broadcom. LightCounting will be updating its forecast for CPO in December 2022.
Meta’s early missteps in the metaverse have been ridiculed in the media and the company was forced to admit that it is lagging behind its competition in learning the magic of AI. After a few heads rolled in the top management, Alexis was put in charge of re-energizing the company’s AI strategy and the infrastructure supporting it. Given her track record in the optical industry, Meta’s future is in good hands now.
Alexis acknowledged in her keynote speech that optics is still dear to her heart. She is now in a position to take a calculated risk, giving a chance to new optical technologies, including CPO. This is a fantastic chance for the optical industry, which has been looked down at by the industry's executives for decades. Andy Bechtolsheim is a rare exception, but he has a gift of seeing the problems and potential solutions more clearly.
Craig Thompson of Nvidia presented a compelling argument for a 32x increase in the bandwidth of network connectivity needed in AI clusters. He also pointed out that achieving this goal with the current designs of pluggable optical transceivers is not realistic: it would double the cost of the whole system and add another 20-25% to the power consumption. Craig emphasized that new designs of lasers and modulators are needed to enable reductions in cost and power of optical connectivity in AI clusters. CPO can potentially reduce the power consumption by 50%, but an additional 10x improvement is necessary to bring more optical connectivity into AI systems. Craig also mentioned that Nvidia is planning to take a lead in the introduction of 200G SerDes and higher speed chip-to-chip connectivity. He expects that NVlink will become the fastest interconnect technology on the market.
Full text of the research note is available to LightCounting subscrivers at: https://www.lightcounting.com/login