NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary

February 7, 2024

NASDAQ US Information Technology Semiconductors and Semiconductor Equipment special 73 min

Earnings Call Speaker Segments

Gadi Godanyan

executive
#1

Good morning for this NVIDIA workshop, Powering Networks for AI Clouds Using Spectrum-X. My name is Gadi Godanyan. I'm based in London and I'm responsible for the AI networking across EMEA. Joining me today is Jeff Tantsura. Jeff, do you want to introduce yourselves before we begin?

Jeff Tantsura

executive
#2

Yes. I'm distinguished at NVIDIA, working on all things networking related to GPU connectivity, from design to testing to thinking what's next.

Gadi Godanyan

executive
#3

Perfect. Thank you, Jeff. So I'll start sharing my screen, and we'll go over the slides. We'll have 2 parts for this session. First part would be me doing the easy parts, and then Jeff will take the deep dive, technical hard stuff. So if you have any questions, feel free to use the chat. We're here to answer your questions. We have also moderator with us. There will be time at the end of the session also dedicated for Q&A. So feel free to use that platform. So yes, let's start ripping it. So in terms of the agenda, we'll start talking a little bit about the Ethernet ecosystem, giving you information about the components that builds the AI solution -- the talking solution that we have, several types of solutions. Then we'll dive into specifically Spectrum-X, what it is, giving you the idea of what we've been working on in the last few years since we have acquired Mellanox and Cumulus Networks. From there, Jeff will dive into the problem definition. Why don't they inherited problems with Ethernet-based technologies for AI, and he'll take you through the protocols and the mechanisms to end-to-end capabilities to avoid those problems that just standard Ethernet will host. From there, I will talk about the NCCL, which is NVIDIA Collective Communication Library, a very important part when trying to optimize and get the best performance possible from your network is understanding how the GPU communicates between each other. So this is NCCL. And then topologies, different type of topologies, skills and multi-tenancy. One more thing I forgot to mention. I will also talk a little bit about the reference architectures and the idea behind reference architectures for NVIDIA because we know it's a crucial part of understanding the entire end-to-end stack. So let's start right into it. If you take a look at the evolution of the data center, right, we all understand that you need to scale the GP, you need to schedule your computer in order to fit in the high demand of training jobs that you have in the AI market, right? So at the end of the day, the more you can push your network to do, the better performance you will get from your end-to-end GPU cluster. So if as a networking part, we used to be focused on the cloud, you can see it on the bottom left side of the screen in blue. Cloud comes with its own requirement and problems, right? So like multi-tenancy, but at the end of the day, the workload -- the scale of the workload is really small. And then traditional Ethernet fits perfectly to that bucket of cloud. But where we're going and where we have been in the last few years is actually trying to cater for the growing demand of AI connectivity, right, and AI workloads. So in this case, we distinguish between 2 types of deployments. The first one is generative AI Cloud, which is the cloud in a sense, but with much -- with all of the cloud requirement, but with much more capabilities that are needed to be embedded into this infrastructure. And specifically, and the most importantly, it is a high workload with very, very high performance requirement for GPU-to-GPU connectivity, so east-west traffic, in a way -- in a lossless way. So we're focused about this part today with our solution Spectrum-X. But at the same time, we also need to understand that there are different type of requirements, different type of deployments and we also see AI factories, which, in a sense, they're not cloud deployment, but all of the GPU-to-GPU connectivity is just on steroids, meaning that you need to -- you want to squeeze the maximum performance while losing some of the flexibility that the cloud network will give you. This is why the AI factory, if you look at the performance or the network GPU in the cluster, usually -- they're usually bigger. What we see is customers and partners will, in many cases, deploy both options to cater to different requirements from their customers. So you need to take it as a general guidelines, but we've seen a mixture of those things where usually what we'll see AI factory is loaded with InfiniBand and NVLink to get the best performance and generative AI clouds going with Spectrum-X solution. So let's start with the InfiniBand offering. We'll not dive too much into it, but just to give you the idea to set the stage. Using Quantum family, actually Quantum-2 switches, that's a second generation and InfiniBand has become the gold standard for large-scale AI. This is what we've been doing. This is what Mellanox have been doing for HPC and this is what Mellanox and NVIDIA together have been doing in the last 3 years for AI. I won't go too much into the details, but the components are the Quantum-2 switches, which is 25 terabit switch with BlueField-3 DPUs. There's an entire software ecosystem that you can see on the right side in order to give you this smooth telemetry, configurability, manageability on simulation on top of InfiniBand. And this entire stack all-to-all is here to cater for the best performance possible. So InfiniBand as a technology is pure software-defined network, which makes it ideal for AI workloads. And then on top of that, because it was a dedicated technology in a sense, we've developed more technologies to keep enhancing the GPU-to-GPU connectivity with things like SHARP, which is in-network computing. So if you're interested to learn anything more about this part of our technology stack with Quantum switches, please feel free to reach out to me, to us, to your networking reps or new account managers, and we'll be happy to dive into InfiniBand and give you more information on how things working with that technology. So when we're focusing on AI clouds, right, the solution that we are talking about here today is actually the Ethernet-based platform Spectrum-X, right? And the component it has built that is Spectrum-4 Ethernet switch and BlueField-3 DPUs. We'll talk about each and individual part in a second. And also on the right side, all the green stuff, the software ecosystem, we'll also dive into that and give you a bit of information on each and one of these product calls and products in order for you to better understand what Spectrum-X has been built on. So the idea here is that we needed to find a solution to the Ethernet problem when it comes to AI. And the solution is an end-to-end one, meaning that you need the BlueField-3 DPUs and also Spectrum-4 to work together with different type of protocols in order to give you, I would say, is the shortest possible environment to InfiniBand in a sense, because InfiniBand has been doing so great in that space. So that's what we've been doing. That's what we're doing in the last few years. That's what we've focused on. So if we used to be a cloud networking company in a sense, we're still a cloud networking company, but just AI cloud, meaning that we kept all the cloud capabilities that Spectrum-X bring to the table just with enhanced capabilities to cater to the best performance possible for AI. And again, everything being done on top of NCCL with all of our experts that have been shifting the road map for InfiniBand in the last year is doing the same thing to cater our Spectrum-X solution for the best performance for AI recently. So yes. Let's get a bit quicker here. Spectrum-4, this is a 51.2 terabit switch first-to-market, purpose-built for AI switch, right? This is what we've been working on in the last few years. We made all the necessary changes to our road map in order for this switch to be the best switch for AI from hardware to software. So in that specific case, you see the 5600, which is 64 ports of 800 gig model. We have other variants as well. This is the model that we see on the picture. And this is the one that is being incorporated into our reference architecture designs. Because of the high varieties of the switch, it gives you a very, very good footprint and a large scale with -- that Jeff will talk about later on. But again, very high varieties, gives the ability to be very efficient with your network. The second component, as we mentioned, is BlueField-3. It's important to understand that this is a SuperNIC, meaning that it's a smaller form factor [indiscernible] with less power and less requirement. And at the end of the day, each one of your GPUs in an 8-GPU system can push and will push 400-- sorry, 400 gig, many that for each GPU, you can need a specific card. And this is even without talking about storage and out of bank connectivity and so on and so forth. So those cards, the SuperNICs are meant to be very high performance on one hand, giving the abilities to run Spectrum-X and also push 400 gig, And then on the other hand, being a low footprint as possible in order to -- for the service to be able to efficiently feed all of those. We can see some examples here with different type of OPM for the cards and also different information on the NIC. Again, BlueField by itself is a whole all other story. There are so many more capabilities that you can add to the card on top of just the pure performance of the Spectrum. And if you're interested, just like in InfiniBand, this is a completely different topic and we can take it and can have a deep dive forward. Also, I would say now actually at the end of the session, or even now you can look at the resources stuff. And in the resources stuff, there is a webinar that we just ran 2 weeks ago, or a week ago, solely focused about BlueField-3 and the capabilities in cloud environment for BlueField-3, what we can do with it. So I really encourage you to go and watch this video on demand. When it comes to the software stack, this is what we see on top of our Ethernet-based solution, right? So if we have this Spectrum for ASIC that we discussed, and we talked about the SN5600 as a platform, on top of that, we will see a different type of operating system, the most common one is Cumulus Linux, also have open nose, which if you're interested, it's a different discussion like Sonic or even Si with some big vendors that are deploying it in large scale. And then on top of that, we have other tools like NVIDIA or NetQ that gives you the ability to monitor, to debug, to configure and to simulate your environment that we'll talk about in a second. When it comes to Cumulus Linus, so this is a company that NVIDIA acquired about, I think, 4 years ago. It's a flagship NOS from our eyes, meaning that this is the most important network operating system that we have. We have other options, but this is where we focus. So Spectrum-X always leads with Cumulus Linux and then the other operating systems are also being incorporated into Spectrum-X as next space scenario. So Cumulus is definitely leading the chart. It's a Linux-based solution, pure Linux meaning that the box actually acts as any other Linux serving the environment, giving you a lot of flexibility, a lot of customization, a lot of power in automation. And this is proven in large deployments that we have in scale with cloud providers across the globe. Yes, I think that's enough for this one. In terms of NetQ, so this is a visibility and validation tool that we have for a few years now. We're doing a lot of work, we've done, and we're keeping added in order to feed and add capabilities, which are, of course, catered for AI, right? So trying to debug AI cluster, trying to get visibility for AI cluster is very different to just a cloud environment. And this is what NetQ has been doing in the last few years. So giving you -- and using a best telemetry, which is tailored, made for AI workloads and the AI clusters. NetQ use your eyes on the network in terms of visibility and validation, meaning you can ask things like going over the entire fabric and make sure that everything is completed in a certain way, doing a configuration change. You see if the configuration change affected any other areas of the network like BGP neighbors and so on and so forth. It's important to understand that the NetQ monitors both the GPU and the switches, so it keeps the end-to-end visibility of your network. And the last component is NVIDIA AIR. This is our digital twin SaaS solution, which is getting more and more traction. It's being more and more intertwined into other products of NVIDIA in the whole NVIDIA ecosystem. The idea behind this product is that everything which is physical, can be also simulated, and everything that you can do in simulation can save you time, you can save you human error issues and can just get you a much better understanding of your physical deployment. So there are several use cases for NVIDIA AIR. Let's say, right now, we'll talk about Spectrum-X. We also have another workshop, which is a face-to-face workshop where we go deeper into Spectrum-X and we'll do some whiteboarding. Within AIR and in AIR, you will see a demo -- sorry, no demo, live environment of Spectrum-X that you can run with us in the room together, where we'll have you configure and build your environment. So this is a sales tool or training tool, but also if you order something, let's say, and even before the switches and the network equipment arrives, you can go into AIR, spin up an environment, fine tune them to the right environment that you have and then build all the configuration and all the information even before the first switches arrive on site. And once the switches arrive on site, AIR can also give you the ability in a large scale -- again, we're talking about hundreds of switches and hundreds of GPUs using without any performance issues, gives you the ability to keep a digital point of your network meaning that every configuration change, every upgrade, every suite that you're adding to the network, first, you do it with AIR. You make sure that everything is smooth, there are no issues. You can also incorporate into NetQ with expert validations. And only then when you feel safe, you push this learning into the real environment, just like your software CI/CD pipelines that you run new organization. So AIR is a very important tool for us, and it's getting more and more investments from NVIDIA. And you'll see it more and more across the board, not only for networking. So I gave you the idea about the components that we have. Now why should you care, right? Why should we keep on listening and getting to the deep dive? So the value proposition without behind our Ethernet solution, which is the first platform, which is purpose-built for AI on the market is the fact that we give you the highest performance. This is what we've been doing in the last few years. We took all the networking engineers, we took all of the AI experts, we put them together in a room. We build a cluster for them in forward look. What you need is to fine tune all of those legal parameters that can take ages in order to get the best performance for each and every scenario when you're dealing with large-scale, AI performance and workloads. And we can only do that because we're NVIDIA, because we have all this information within our head, we have always expertise. So combining cloud with best performance possible. The second thing is fastest time to AI and zero risk. The idea behind it, how do we achieve that? Is everything that we recommend our customers to deploy, we've done ourselves, meaning that we stick to reference architecture with specific building looks that we've tested ourselves, meaning that we know what's the expected performance. We know how to configure it. Our networking PS teams know how to deploy it because even before we sold it or suggested it as a solution, we've built it ourselves in our labs. So a reference architecture is our way to reduce the risks and enhance their time to AI. And then those 2 together, at the end of the day, leads to better returns on investment, meaning that if your network is more efficient and the performance is better, you can squeeze more out of the GPU and the cluster that you bought just gives you a better, faster speeds for training and higher performance at the end of the day. And on the right side, you see an example here of Spectrum-X networking for AI. We did 2 different variants. On the right side, you see L40S option with our OBX servers and also with the AGX. I want to quickly take the L40S forecast as an example of reference architecture, extend what we see there and then hand it over to Jeff. So when you're looking at the reference architecture, right, of Spectrum-X with L40S in this scenario, what we've done is we said, okay, we want to accelerate AI with the highest performance possible at a massive scale. Before we can go and deploy that, we need to test it ourselves. We need to make sure that every component is working. We do not test everything. So we needed to find the right solution to think is giving you the best performance possible with the least amount of drawbacks and build a solution, qualify it, test it and then build a document around it to tell our customers and partners how to do themselves. So the component in this document is the L40S server. In this example, you can see a [indiscernible] PCI-based cards, but there was also the SM48 cards in the system. And then on the right side, again, it's Spectrum-X switch and a local 3-Super. So this is how the server looks like, server architecture. You can see the amount of GPUs in 8-way server with the amount of BlueField cards for east, west and south. And at the end of the day, this is what every server in that specific reference architecture we need to hold. On top of that, there are the -- outside of the server, there are -- we have the other components, right? Again, like the BlueField, what's the capabilities that we need to get BlueField, what are the benefits in adding each one of those components in the reference architecture. And at the end of the day, what we can tell you is that when we ran this server, this is the benchmarking that we ran and this is the performance that we got. If you're getting anything under this performance level, we should work together in order to find why you're not getting the best performance possible. And because we've tested it ourselves, we can tell you exactly how to fix it. And again, everything we have in reference architecture also with simulation, you can go into AIR, spin up the live environment, tested that we did work with it if [indiscernible]. And when it comes to the reference architecture design itself, you can see here, as I mentioned earlier, we're working in the units, so in code like environment, so you can easily scale it up or down according to the need from the reference architecture. This is an example of 2 GPU clusters that you have here with 30 -- with the 16 SU with 32 servers in each, and we will tell you exactly how to connect it, where to connect it, how many switches, what the board will look like, how many storage nodes do you need and what's the connectivity to storage and so on and so forth. And then you can take this modular configuration of the reference architecture and fine-tune it and fit it to the needs that you have with a customer or a partner. Everything here is, of course, being done through our OEM partners, right? So we cannot do anything within ourselves. And our storage and actually OEM partners have been all over this solution, meaning that you can get it from the Dell, HP Lenovo or Supermicro. All of them are either very close to GA or will be GA in the next quarter, depending on the OEM. But all of them will offer Spectrum-X for H100 and also L40S system as part of their solutions coming forward. And just to give you an example of how we tested RA, how the OEMs are with us, you can see an example here of the largest cluster for Ethernet-based generative AI that we have. It's called User 1, and will be the most supercomputer in Israel. We pared performance of [indiscernible]. This is done in collaboration with Dell, HGX H100 servers, a combination of 156 of those resulting in 2K GPU, 80 Spectrum-4 switches and then 2,560 BlueField-3 SuperNIC in regular DPUs, both for East-West connectivity and North-South. And at the end of the day, no other networking company have the ability to spin up such a cluster. And because we're NVIDIA, we have those clusters for research and advancing human kind with AI breakthroughs. But also, at the same time, we built the scale of AI cloud. So all of the problems with cash [indiscernible]. And we're already doing so. We started building the one end of last year. Every problem that we find, we fix their, every performance tune that we want to do in large scale, we are doing [indiscernible]. So the RA is actually built on a large environment to be tested. And this is the power of our reference architecture. So hopefully, you understand why Spectrum-X at this point is important for you. And why would you like to learn or deploy it. But how let's understand the behind-the-scenes of what's the problem with traditional Ethernet and how Spectrum-X actually solves that. And for that, I will give the floor for Jeff. So Jeff, you can now share your screen.

Jeff Tantsura

executive
#4

Okay. Please confirm you can see my slides.

Gadi Godanyan

executive
#5

Yes, we can.

Jeff Tantsura

executive
#6

And can you hear me? Great. So everyone, let's deep dive into technicalities of Spectrum-X. And actually, the motivation behind building it, and we will also go a bit into machine learning semantics and how they relate to networking. So the problem definition is really, if you look at traditional data centers, the main characteristics of them, they're very generic. They, in a way [indiscernible], right? So the mix for a variety of workloads. They have got really high [indiscernible] or many different flows, UDP, PCP. A lot of variability enhancers, which allows to really load-balance efficiently using basic [indiscernible] load balancing. Fabric utilization is usually pretty low. It's done because, in most cases, it's simpler and cheaper to throw hardware as a problem rather than optimize the network, right? On another side, if you look at the AI fabric characteristics, usually, there is very few and very high bandwidth workflows, and I'll explain to you later why. Single GPU can drive 400 gigabit per second QP. That's eventually translated into flow when it becomes [indiscernible], right? So from this perspective, doing per flow load balancing is not going to work. You'll get collisions, you'll drop package, your performance will go to below 10%, which is completely unacceptable. The operations are very, very sensitive to treater because you are always waiting for last GPU to finish. And imagine you've got 10,000 GPUs working on a particular problem. If last GPU is low because of networking, you are losing all the stand. So in other words, all 9,999 GPUs are going to be idle waiting for single GPU to finish up. So ability to provide nonblocking communication between GPU is mandatory. Since we also want to be cost optimized, ideally, we'd like not to over-provision the network. Since we know that eGPU can generate full line rate of the network interface attached to it, so GPU can generate 100 gigabit per second, the BlueField SuperNIC attached to it can generate or sustain for 100 gigabit per second. So network must be able to allow us to do the same, right? So those are kind of very fundamental differences between basic data center, even if we are talking about something like hyperscale data centers. So I think Amazon, Azure, or CIN, right? Back-end networks are networks specifically built and designed for machine learning are very different in terms of requirements characteristics. And the content of this presentation is really focusing on those requirements and how we are addressing the customers. The problem statement is pretty much what you used today in your regular data center is not good enough for AI. You are spending huge amount of CapEx and OpEx to acquire the hardware, to operate the hardware. So if you cannot utilize it to 99.9%, you're leaving money on the table. And this is what we are trying to prevent here. So Spectrum-X has been developed and optimized for one thing, AI networking. It's an end-to-end platform that has a number of components we are going to over here. So Gadi gave you a high-level overview. I will go in a bit slower. Practically, the goal is to get network out of the way because GPUs are communicating at really high speed and operations are collective. So there are many, many GPUs performance and operation. If network is under way, if GPUs have to wait to perform re-drive operations because of networking, you've just lost your GPU time. So a quick overview of what Spectrum-X is, again, I'm trying to kind of summarize what you've learned from Gadi. It's a spectrum for ASIC Spectrum-4 Switch. We are using Cumulus NOS as of today. We are working on Sonic. It will be announced later on. BlueField SuperNIC is the endpoint, is unique that is attached to GPU. And again, for SuperNIC to be able to provide nonblocking access to the fabric, it must match speed of the GPU. So if we have 8 GPUs that are H100s, for example, H200, that is 400 gig as well, we always match number of GPUs to number of SuperNICs. So HGX server, as Gadi showed, has 8 GPUs. It will have 8 SuperNICs. That's a rule of thumb. If you are trying to save money here, you are doing completely wrong thing because eventually, it's going to be GPU that's going to wait for network. That's the last thing we want. Futures, it's really tuned for AI, I am over cloud only. So since it's an Ethernet IP fabric, we are using [indiscernible], which is our GMA over IP UDP, and fine-tuning is done to provide nonblocking capacity over the fabric for GPUs to communicate. Fabrics are configured to be lossless for a variety of reasons. RDMA is sensitive to packet loss. Some operations have implicit expectations that there is no packet that is lost. So fabrics is configured lossless. You might have read all the white papers, probably most famous ones from Microsoft 2016 with kind of some issues with regards to PFC. Seven years later, there's another paper published by the same people, so 71 people from Microsoft, saying we've solved all the problem. PFC is working. PFC is deployed everywhere in Azure and all the problems that we have experienced -- sorry, Azure has experienced before, are solved by Watchdogs, by proper configuration. Sonic that's used in Azure today has all the parameters built in. So our advice to you, please use lossless. Don't listen to [indiscernible], it's usually coming from the wrong place. The problem of lossless is an effect, and has been solved. It's working, it's deployed at scale and it's operated scale. We ourselves tested lossless as well. So all the tests we do in Israel One are lossless network. So our advice to you, unless you have very particular reasons, which I'm not aware of, go lossless. Practically, we'll get again into more detail about why we advise you to build fabric particular ways. If you are building something that is up to 8K GPUs, usually, we're advising to use tear fabrics of [indiscernible]. If you need to build something larger or your power structure force you to build smaller pots, we advise you to use suture fabrics, so [indiscernible]. Again, we'll go into more details. So let's take a look at components of Spectrum-X. Let's start with congestion control. So GCQCN has been designed around 2013, mostly for storage workloads, right? So it's a great protocol. Again, it's deployed today in huge scale. It works really well. It reacts very well on congestion. However, the conversion speed is medium. What do I mean by conversions? So when there is a failure, you are going to experience queue build up on the switch. You will start marking packets with ECN, as this is pretty much how ECN works, both IP/TCP and any other things to support ECN. So they get to the receiver, receiver sees that there's ECN marked packet. It creates [indiscernible], sends it back, send the receives it and reduces the rate, right? So suddenly, we are sending less traffic. The question is really how long does it take for you to find an optimum transmit rate so you are not leaving money on the table. In other words, you are not undershooting, overshooting in terms of how much you reduce it. So GCQCN from this perspective is somewhat slow, which is a feature built into the congestion control mechanism using the GCQCN. If you are looking at bandwidth requirements for machine learning cluster, as we said, we really want to run a sustainable rate of around maybe 93%, 95%. So it's very, very high. If we are sending CLO rate for too long, we are actually reducing the trade for no reason. And that's why we have developed new congestion control algorithm signaling called CTRCC. It has a number of reasons to come to existence. One of them is completely zero touch. It's done from the host to the host back to the host. So it's completely over the top. We know how difficult it could be to fine-tune GCQCN from right mean mass parameters to probabilities to cable lengths. It's a complex set of parameters. CTRCC removes all of this. It works from beginning. The only thing you need to do on the switch is to allow basic ECM marking and make sure the packets that go back don't get stuck in that because you want to receive notification of congestion as soon as possible. So we advise you to map returning packets into higher queue. In our designs, we usually put them into CS6 queue where they share the 3 priority queue with controls in traffic, right? So you make sure that they get back as soon as they get it. They don't get stuck behind the data packets. Basic idea of CTRCC is that the packets are transmitted with high fidelity timestamps and there are a number of timestamps. When they are received, they are timestamped again and sent back. So sender can look into timestamps, decide what is the round lead time. If round lead time is within expectations, meaning there's no congestion, you don't need to react on it. If we see RTT increasing, that would mean there's either congestion on the network or MIG is slowing down because of [indiscernible] not sure why it's moving. So practically, we added ECN to that to increase the fidelity of signaling, because if there's congestion on the fabric, together with increased RTT, you'll also see ECN marking. If this is unique, you will see on the RTT increase. So algorithm is much more complex. I'm explaining -- I'm trying to give you the logic why we did it, and its conversion time is very, very fast, right? So it's much more suitable for AI workload. And this is the congestion control mechanism we use today at Spectrum-X. Very, very important part of machine learning clusters is communication library, because this is really at the heart and the brain of the system. Communication library decides on operations, how they are distributed, how traffic flows are sent and many more things. In NVIDIA, we use NCCL. It's pretty much de facto standard in machine learning clusters, stands for NVIDIA Collective Communications Library. it's supported across all of GPUs that were produced from consumer-grade to H100s. It's an open-source project. Code is available in GitHub. Let's take a look. We usually release 3, 4 releases a year. There's a lot of innovation going there. So why do we need communication libraries? When we start using more than 1 GPU, and it's a given that we need more than 1 GPU because of size of the models, we need to allow GPUs to communicate with each other. There are different types of operations. We'll go in each very, very soon. So in order to do it efficiently, you need something that's very intelligent that can decide how to send traffic, whom send traffic, how to react on delays or lack of acknowledgment and so forth. NCCL is topology-aware. So when we say typology-aware, we don't really mean in detail topology of the fabric itself, but topology on the server. Because what you have is not just GPUs, you've got PCIe switches, you've got NVLink infrastructure, and eventually, you have outside network, which is either ethernet or InfiniBand from NVLink perspective pretty much to say. And we're looking to NVLink in a second, practically, NCCL understands the differences, and it understands -- it has 3 projects with regards to how to communicate with other GPUs because it understands the performance of each network. Since it's now at the topology, it creates particular flows with different collective operations. So let's take a look at them. And again, this is fundamental to how we communicate, size of communication domain and the way operations are performed. So data parallelism is the most common type of parallelism. It's used pretty much everywhere. Again, the simplest one -- what you do, you just split batch. And then you perform operation on all of it. After you are done, you do backward propagation, and you share gradients. It's a reasonably simple operation. However, in size, it's really large. So here, we are talking about the flows of 100 gigs potentially. This is what creates this really huge single flow things that might choke your fabric if you don't load balance properly. The way operation work, after you've done the computations, you need to do all-reduce, or in more modern workloads, all-gather plus reduce clutter to get all the information across everybody else, right? So in order to do so, you wait for everybody to exchange all the information. So -- and again, if some part of your fabric is slower than another, everyone else is going to wait. Pipeline parallelism allows further data aggregation and increased number of GPUs that participate. The communication pattern is point-to-point. And when used with data parallelism, you also need to do sub all-reduce across all the GPUs that do DP. It's pretty small. It's very latency-sensitive because now you need to synchronize pipelines, which is subpart of your total GPU processing. So again, later on, we'll see how it affects network. Tensor parallelism is yet another way to further distribute and parallel computations. It is sensitive to latency to the degree that we never do these operations over Ethernet or InfiniBand networks. Those operations are always done over NVLink. And then expert parallelism, which is reasonably new type of collectives. It's usually used with MoEs, or mixture of experts. It's a medium size of flows. However, it uses all-to-all. So everyone talks to everybody else in order to figure out what GPU to choose. And on itself, this traffic pattern creates some problems, which we optimize in NCCL. We'll get to that. So let's try kind of to summarize what I explained. In single GPU, there's no networking, right? Everything is performed in single GPU. You compute your parameters. You go backwards. You have your gradients. You are done. So not interesting for us. Let's take a look at data parallelism. And this is, as we said, most common type of parallelization that allows you to build really, really large domains, potentially of tens of thousands of GPUs. So what we do here, we split the batch across all GPUs, so around same model, same GPUs. And after we are done, we go backwards and we share all the local gradients. What's needed here is to increase batch size, because now we've got more GPUs. There's dependency between batch size and accuracy of the computation. So at some point, when batch gets too large, your computation gets less precise and it will require actually more compute time. So it's very important to understand how to position data parallelism versus other types. Pipeline parallelism, as we said, is a further step. It allows us to increase number of GPUs because now we are not putting same workloads on all GPUs, we are splitting layers. So each GPU will run subpart of a model and then eventually, I'm not sure why it's moving, and then we actually synchronize them. So practically, the interesting part of networking here, those are reasonably small flow. So they're single-digit megabytes, but they are very sensitive to jitter, because now we need to synchronize pipelines, not just workflows. So it's very important to provide best networking for those kind of collective operations as well as to reduce the domain, because if domain is too large and you need to go to another type of network, so you go to another port, potentially you're running out of time. And then tensor parallelism, it's yet another way -- now we are distributing tensors, which is the smallest part, and it's a very, very complex process in itself. As we said, since now every GPU performance is on tensors, we need to use all-gather to get all the data to everybody else, so we can move on. This operation is very, very time-sensitive. So again, we are talking about single-digit nanoseconds, and we only do it over internal network, which is NVLink. So let's take a look at all of these technologies to see how we implement so-called 3D parallelism. So it's all 3 techniques put together in order to scale. So here, we are looking at GPT-like LLM with 1 trillion parameters. So if you look at DP, it's actually done across 3,000 GPUs, right? It's important that we group things, again because we are using 3D parallelization, we are not only using data parallelism. So data parallelism size domain is 3,000 GPUs, all the GPUs in the cluster. Now pipeline parallelism, which is more sensitive to jitter, is done on domain of 6 nodes. So it's 48 GPUs. Now tensor parallelism, which is the most sensitive one, is done on 1 node across 8 GPUs. And thus, all GPUs are communicating with each other over NVLink. They are not using external network. That's fundamental to know. Now let's get to our rail optimization and why this is important. Non-rail-optimized topology is what you build today in data center, leaf/spine, super-spine the most fundamental property of non-rail-optimized design is that you've got top-of-the-rack switch, which is in your rack, and all your servers are connecting to it. What it means, if you need to go to another rack and -- leaf/spine topology would do, you'll need to go to the spine to get another rack. So it allows you to use copper within the rack, which is cheap and much easier to cable. However, it lowers your performance and potentially if multi-server communication is needed, which is the case in machine learning, you'll get higher latency and your spines potentially will get more attractive up to the degree it can get congestion. And it doesn't get congestion in Spectrum-X, but it could in any other network. So let's take a look at rail-optimized topology. It's really defined by how GPUs are connected and how NCCL sees the topology. So the main role here, all GPUs of the same rank on each server come to the same switch. So if you have 8 servers in Iraq and all of them have GPU #1, all those GPUs will go to the same switch. In rack #2, you'll have again in server -- 8 servers with 8 GPUs #1, they'll go again to the same switch. So in this case, there is no ToR switch. There's switch that we sometimes call MoR, which is middle of the row, or EoR, which is end of the row. Practically, there's switch that's located somewhere in the middle of your racks, and you use optics to connect to the switch. What it allows you to do? It allows all GPUs of the same rank communicate over single switch. So you don't need to go to the spine to get to another rack. It's a single hub communication, and you only need to get up to the MoR to get down. So there's no congestion. There is no collision. The traffic pattern is pretty much 1:1. And it is very, very fundamental. It gives you exactly same latency, same behavior, 0 jitter and allows much faster communication. And there's a couple of features implemented, features that make it even more valuable. So let's talk about PXN and scale-up vs scale-out. Scale-up is a technique to increase number of resources or capacity within the server. Scale-out is a technology to increase number of devices in parallel. So I've been talking about NVLink a lot. What is NVLink? NVLink is low-latency, high-speed technology that's implemented today within DGX server, so within a server that has GPUs. It has 9x more bandwidth than Ethernet. So each endpoint can generate 3.6 tera of traffic point-to-point as compared to 400-gig if you use a NIC, right? So we would like to use as much as we can NVLink because it's much faster than outside network. By doing rail optimization, we've already optimized outside networks. So when we do all-reduce, all-gather, it goes over the same rail, but now we are going to use only single switch. We are not going to use 3 switches in order to get communication. Now as we said, we see proliferation of all-to-all communication because of mixture of experts, some other technologies. So traditionally, when you do all-to-all, all-to-all means that everybody is talking to everybody else. You would go out to external network and talk to everybody else. So you will fully saturate all your network. With feature we call PXN that have been implemented in NCCL 2.12 about a year ago, now what we are doing, we are not traversing external network to get from one rail to another. We are using NVLink switch or NVLink internal topology to the server to traverse to the destination rail. So as you see here in the picture, when -- there is PXN. They are not going to go out. We are going to use NVLink to get, in this case, from GPU 0 to GPU 3 and then use rail-optimized topology to get to GPU3 on another server. So what we've saved, we've saved bandwidth because now we are using NVLink. We have saved bandwidth a number of hop traverse, because GPU 3 can communicate to GPU 3 in another server over the same rail. So it's actually on the same switch, right? So please do build your machine learning clusters rail-optimized. There's huge amount of improvements in terms of performance, in terms of latency. And when you look at the total cost of ownership perspective, kind of high-level view, our intuition might tell you it's going to be much more expensive because instead of using copper indirect, I'm going to use optics to the middle-of-the-row switch. This is true, however, what you might also do is to build copper infrastructure between leaf and spines. Now you are looking into limitations with regards to how many cables you can put, as now you are cross-wiring across potentially 2 or 3 racks, given that you can go 2, 3 meters with active copper potentially. Practically, it will allow you to keep your cost structure very similar, because now you replace all of downstream links with optical, so server to MoR, but MoR to spines are copper link. So your cost structure is actually the same, because we are using the same number of down-links and uplinks not to create any congestion in the network. So very serious considerations when you are designing your data center. Let's talk about adaptive routing. So as we said, traditional data center, high entropy, a lot of flow, so you can actually do per flow a lot data synchronous pretty well. But usually, your deviation amount are 5%, 6%, 7%. You go into AI workload space, you do data prioritization, you create huge 100-gig flows. And again, we use queue pairs in RDMA that usually are translated to a single flow because UDP destination port is set to RoCEv2. So we're only left with 16-bit space in the UDP source port. And this is what we use for entropy. Since the mapping is 1:1, you'll see very few flows. But practically, if you do per flow load balancing, you will eventually hash 2 large flows into the same link. So you'll try to push 800-gig over 400-gig link. You'll start buffering, you will start broken traffic. Performance is gone. What we do with adaptive routeing in Spectrum-X, we do much more granular load balancing. So it's not per flow. It could be as granular as load balancing per packet. Why we can do it? We understand which flows can be reordered, which flows cannot be reordered. Very importantly, for flows that can be reordered, BlueField SuperNIC provides the right data placement. So DDP as a technology, again, it's always RDMA, remember. So you're actually reading or writing memory. We know where to put packet in memory, because you receive a pointer, if you want, right? So you don't need to buffer it. Some other technologies do in order to do reordering before you give the data to the application. You can place the data directly into memory because you've got pointer into memory allocation. So this allows you 0 copy, low latency, no buffer, ability to receive packets out of order, place them into memory. So a combination of adaptive routing, and DDP allows you to run your packet really, really hot, 95% plus without single packet drop, without delays, without queues. This is kind of a fundamental feature of Spectrum-X. And if you look at what we measured here, what you see here is example -- actually, same fabric, same switches, one with AR enabled, another one with AR disabled. So when no AR is enabled and you've got AI workload with low entropy, as you start receiving additional flows, there's a high degree of probability to put them on the same link. So you see that when this happens, you start to victimize new flows -- so all flow might actually take full link because it's not larger 400 gig. However, this next flow is hashed into the same link. You suddenly see performance going to halve and then even lower. So you start buffering and eventually dropping traffic. If you use AR, since all packets are distributed evenly across all uplinks, you can go up to very close to theoretical limit. And then it results in a much shorter job completion time, higher quality, higher capacity network, right? So the difference could be as much as 50%. Adaptive routing as such is a local to a switch technology. So it's aware of use of uplinks. It's aware of key occupancy on uplinks. It's not aware of anything that's happening somewhere further in network. What we recently implemented in Spectrum-X is resilient adaptive routing. So now we can signal from remote site of the fabric, look if there's some asymmetry or there's a failure, and we have less capacity. And when this happens, we rebalance the traffic, so healthy part of the fabric is getting more traffic and affected part of the fabric is getting less traffic. This feature works in combination with congestion control, so when there is a failure, you reduce transmit rate, but then because you're rebalancing your conversions very fast. In other words, you are not leaving bandwidth on the table. And if you look at comparison, again, exactly on the same fabric with the [ RAR ] feature enabled, the larger the failure on remote site, the more there's advantage to use this kind of technology that combines local behavior with AR on the switch plus remote failure notifications that allows you to rebalance and converge very, very fast. So again, more of a reference architecture, and what we advise you to do. If you build fabric that's no larger than 8K GPUs, really 8K endpoints, which allows you to have -- put your fabrics, you can save on the number of interconnects because you don't need to dedicate links for upstream connectivity to super-spines. So 2 Tier is a good choice. Make sure you have no further plan to extend your network because it's very, very difficult to have 2 Tier fabric and then suddenly to build 3 Tier on top of it. It's pretty much a complete forklift, right? So if you plan to build fabric that's smaller and you are fine with managing the single structure, this is the most cost-optimized topology, can scale up to 8K endpoints. So 3 Tier fabric or leaf/spine super-spine, if you are going to use 8K or more GPUs, this is no-brainer. Otherwise, you just don't have enough PoDs. However, there's more considerations. This allows you to build PoD structure. PoD stands for point of delivery. And in hyperscaler worlds at least, this is how you build your infrastructure. You increase number of PoDs and new version of PoD. So if you want to change hardware, if you want to change feature set, you've got your self-contained structure, that is PoD, you can do it there. And there your NNI is very, very clean to the super-spine level. So again, if you are building larger fabric, if your management structure is scattered over PoDs, you might have some very physical limitations. You might have limitations with regards to power supplies, cooling, which really means you will build separate rooms or separate parts from your data center that can cater towards GPU workloads, right? Remember, GPUs are consuming much more power and cooling than regular servers. So your density is going to be lower. Placement within data center, how you group, it's very, very important. So in this case, this might lead you to 3 Tier anyway because you will have physical limitations up to size group and how you place them. So the super-spine level or groups of super-spines will allow you to provide this physical infrastructure with common interconnect level. So again, network design -- proper network design forever plus is very, very important. Put something thinking into it, how you're going to do it. Think about not only what you do today, but what you are going to be doing 3 years from now. Think about power consumption and cooling, because that generation of GPUs is going to potentially consume more electricity. So your density might decrease. You might use different type of hardware. So all of these are very, very important considerations and span of your physical design might be different than regular data centers. So multi-tenancy on Spectrum-X, obviously, if you are a cloud, you want multi-tenancy. This is how you make money. So in Spectrum-X, multi-tenancy is provided by either BGP EVPN, so technology we all know and love. BGP EVPN could be implemented on the switches themselves. We advise to use route Type 5-only infrastructure, so route BGP EVPN-only, no route Type 2, 1 and 4. We do support bridged EVPN because some people really require it because how they provision their tenants and whatnot. Practically, you want to build most optimized infrastructure, do routing only. So route Type 5-only infrastructure. You don't want to do it on the switches. You have BlueField SuperNIC that have local ports, local memory and capacity and ability to do overlay from the host. So if you look at any hyperscale cloud, they usually do virtualization of the hosts, because they have NIC, FPGA, SuperNIC, something that has more than just basic sending traffic capabilities that allows you to provision tenants, that allows you to provision tenants independently of core infrastructure. So we provide exactly the same capabilities on Spectrum-X. So your choice is, you can do network overlay or host-to-host overlay. This is how we fine-tune, and we'll go into testing in a second. So we start with hardware testing. Really stuff needs to connect, talk to each other. We look at basic RDMA operations, so reads and writes. Then we optimize with AI primitives, which is really NCCL operations. So each layer has its own suite of tools to figure out how well it performs at a particular level. And then the top of the pyramid of needs is really running the model itself. So the faster you can train the model, the better everything is from GPU to network, to any infrastructural components of it. So this is what really matters to end consumers. The faster you can train the model, the faster you can start using it in your business, right, the faster you can retrain it. So let's summarize what Spectrum-X is. It's a networking platform that's optimized and tuned for AI, allows you to run fabric up to 95-plus percent, allows you to program your congestion control in the ways you like. We didn't talk about it, but the infrastructure is programmable to the degree you get APIs -- a layer of APIs to completely program your own congestion control if you wish to do so. And many clouds do so. They have got enough brilliant people to do something better than we do. So if you want to program your congestion control, we give you APIs to do so. Traditional fabric, loosely coupled, built for TCP/UDP applications. Load balancing is done by 3/5 tuple hashing, good for, again, a whole variety of different flows, different characteristics and low utilization fabric. If you go to high utilization, low entropy, doesn't work anymore And CC is something standard you get. [indiscernible] or anything else you use, which is good for kind of basic fabrics, but not good enough for AI fabrics. Let's take a look at evaluation of technologies, what we test and what we see, right? So again, we've got kind of very basic tests and we go up in evaluation. So we'll start with basic RDMA bisection bandwidth and latency writing and reading on both local and remote sites. And we're using overclock. So it's kind of self-contained unidimensional, that's very easy. Then we go into NCCL case, and we test all the operations, so point-to-point, any-to-any and any variation of this. What we also do, we introduce noise. What noise really means? There's more sources of traffic, the congested traffic randomly into the fabric. Why we do this? We want to make sure that workload that is run by particular tenant or subpart of the fabric is not affected by anything else, the so-called noisy neighbor syndrome, right? So we guarantee performance. And then we'll look into end-to-end training times, how long it takes and how we optimize actual training, given that it's sort of in operation and even if you take basic LLM -- so we've been doing LLM for a couple of years, from kind of 3D parallelization to FSDP, which is optimization of 3D parallelization. Now there's a mixed precision endpoints and a lot of stuff that wasn't there before. So all this testing is done to prove that the -- what is actually exposed to end customers, the amount of time to train the model is decreasing. And again, the faster you do it, the more you can do with it. So this is RDMA bisection bandwidth. Basic test is 1 QP, so 1 flow. You could see here the theoretical peak, which is 380-gig on 400-gig network. We are getting to pretty much same value as theoretical peak is. And it's 4.6x higher than what you would get on non-Spectrum-X IP Ethernet that's public. From a latency perspective, we see very similar results, because there is no congestion in the network. There is no queue buildup. The traffic is evenly load balanced across all ports available. The latency, on average, is 4.5x lower than if you don't use Spectrum-X. We are going up-layer now into NCCL testing with noise. And amount of noise is really significant. We've got 22 participating nodes in particular operation, in this case, all-to all; and 52 nodes injecting traffic to create noise and trying to disturb the workload. So in this case, you could see that the bandwidth of Spectrum-X is 1.5x higher than that of non-Spectrum-X. And in NCCL, all-to-all -- sorry, all-reduce in this case with similar degree of noise, you see that it's about 2.5x higher than traditional or non-Spectrum-X traffic. And then the last probably most important thing here, we are testing real -- here, we are testing real workloads or real training models, right? So you see NeMo LLM is 43 billion parameters. It's 1.2x faster on Spectrum-X than otherwise. Very importantly, as we go towards FSDP, which is more -- so it's an addition, again, on top of what you can do with data parallelization with fully shared infrastructure. You see that now the advantage of Spectrum-X, it's 1.7x faster than if you would do it on a non-Spectrum infrastructure. And that's all on my side, and if there's any questions, I'm going to switch to Q&A. Where are we? We are here. Yes.

Gadi Godanyan

executive
#7

Thank you, Jeff. Okay. So just right before the Q&A, I just want to thank you all for joining today and remind you that we are here and the account managers are also here for you in terms of -- in cases that you want to dive a bit deeper and get more information. As I mentioned earlier, in the resource tab, you also -- you should be able to see our BlueField session. It's called Accelerate AI Cloud Computing, and also same session for Spectrum-X, that was recorded as part of this part of this workshop. So feel free to look at that. We'll provide the recording of the session today for you to watch at any time that you want. Last thing I would say, we're also working on a follow-up workshop with white-boarding and using NVIDIA Air into logging into the switches, building your own environment and running your own configuration. This workshop will be done physically face-to-face with partners and customers, whoever wants to and feels the need to go deeper and looking to deploy it sometime in the future. So please feel free to reach out for me or to your account managers, asking for the second workshop of the networking for AI, and we will be here to work together with you and collaborate. Now I'll stay on mute, and we'll be here in the chat, answering a few of your questions. Thank you for joining us today. Have a lovely rest of the day.

Jeff Tantsura

executive
#8

So are we going to questions now or what...

Gadi Godanyan

executive
#9

Yes. So we'll use the chat for questions. Yes. So you can just unmute, close the video. And if we find any specific question, we can broadcast it verbally. But yes, let's see.

Jeff Tantsura

executive
#10

So verbally, it's much easier than typing. So any official documentations on rail? Absolutely, if you go to NVIDIA website and you look for NCCL, there is a description of how NCCL works. There is description of the topology knowledge. So it's all publicly available. [indiscernible] with adaptive routing. Congestion control is end-to-end. So it's from NIC-to-NIC. Adaptive routing is happening on the switches. So as such, there is no direct dependencies, and that's done very intentionally. If you start building dependency between different technologies, the complexity is astonishing. So we know that congestion control will kick on RTT-based pretty much within 2 RTTs, right? So you know, almost immediately, there is a congestion. Adaptive routing as such will react on local condition on the switch. So when we see the congestion on the switch as well as receiving signal from remote side of the public that there's reduced bandwidth, which takes care to propagate because it's in control plane. So practically, there is no tight coupling, but the whole idea of this 24/7 testing is to understand how the different features interact with each other and fine-tune them such that they don't get in the ways of each other, but contribute to each other better performance. I think we talked about RTT [ probe and ] packet. So adaptive routing is not interacting with them. Cisco partnership, you will see further announcement in couple of weeks. We are not going to talk about this year. And I don't see any more questions.

Gadi Godanyan

executive
#11

Okay. Thank you, Jeff. I think with that we're happy to close the session. Thank you, everybody, for your time today.

Jeff Tantsura

executive
#12

Thanks, everyone.

This call discussed

For developers and AI pipelines

Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.