NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
September 11, 2023
Earnings Call Speaker Segments
Operator
operatorLadies and gentlemen, the program is about to begin. [Operator Instructions] At this time, it is my pleasure to turn the program over to your host, Vivek Arya.
Vivek Arya
analystThank you so much, and good day, everyone. Glad you could join us in this afternoon keynote session. Really delighted and honored to have Ian Buck, General Manager and Vice President of NVIDIA's accelerated computing business. Also importantly, the inventor of CUDA, which is a key operating system, underlying every NVIDIA accelerator. So really glad to have some time with him, so he can share his perspective. So Ian, I'll turn it over to you. I think you have one opening remark.
Vivek Arya
analystBut what I would really love to do is get your perspective on how have requirements for AI hardware change throughout your tenure at NVIDIA? And especially, we always talk about hardware, but sometimes you forget that a very key part of that is the software ecosystem. So if you could give us a perspective of how NVIDIA's software capability has really helped to cement your dominance on the hardware side in AI?
Ian Buck
executiveYes. Thank you, and pleasure being with you here this morning. And of course, as a reminder, this presentation contains forward-looking statements, and investors are always advised to read our reports filed with the SEC for information related to risks and uncertainties facing our business. Yes, we've been working on accelerated computing for quite some time. In fact, the dates all the way back to 2006 when we first introduced CUDA. Initially, the goal was to address how to program this new kind of architecture, this new kind of processor. And that had reached a level of programmability beyond just playing video games and rendering -- making beautiful pictures but become a computing platform, a place where we can accelerate not every workload and to this day, we always want to make sure we have the best CPUs matched with our GPUs and the right configurations in the right ratios, but for portions of the competition that can be accelerated, that are typically either highly data parallel or massively parallel or just compute intensive. We work to -- with the community to figure out how to accelerate those workloads to run them on architecture that's designed for high compute, high-throughput needs. It started, of course, with high-performance computing, a community that obviously is looking for using computers, in some cases supercomputers, to simulate nature, to simulate physics, to simulate a problem that can't necessarily be easily identified either in a wet lab or under a microscope or having a scale on -- that's not at scale of the earth or the cosmos, where we just need a computer can be a digital instrument, an instrument of science. We've -- for the, all through 2006, up until that first AI moment in 2012, we've made our platform available as a software platform. We made CUDA available with every one of our GPUs including the gaming and graphics GPUs, the ones everyone had in their workstations and their laptops and their PCs. And that was, of course, pre -- before a lot of the cloud traction by building a software platform that engage developers rather than strictly a harbor platform, which defined an ISA -- and ISA, we met the developers where they were. So it made it very easy for researchers, PhD students, engineers and companies to take their NVIDIA GPU, download CUDA for free, all the libraries and the software that have been developed over time and figure out how to apply it to their problem to port their code, whether it be C, Fortran and today, Python, Java others and move that [indiscernible] portion over. That decision up-front to make it a software platform in combination with a hardware platform was really important for a couple of reasons. First, it met the developers where they were. And we didn't have to wait for others to build a software system around, frankly, it would have been difficult to do so and take them a long time given the bootstrapping problem. Second, it expanded the innovation space. We can innovate at the hardware layer, at the compiler layer, the system software layer, the library layer. And of course, everyone else had their opportunity to also contribute to it so that the performance delivered over time is the compounding of all of those innovations at the hardware, system driver and developer software and, of course, all the libraries on top. And if you track that progress over time, it's quite dramatic. That's the benefit of accelerated computing. It allows for compounding value to be delivered. It also allows NVIDIA to innovate in an extreme click. We are not constrained by being -- by the interface at these lower levels, like instruction sets, we're only constrained by the problems we think we can address up here. And if it requires us to change our architecture, to change our instruction set, to build a totally different kind of GPU or build a GPU that kind talk to other GPUs over NVLink can scale across the GPU in a system or GPUs in a rack or across the entire data center. Because we have defined the interface up here and how we engage, we can we can do all that and do it in an extremely rapid click, which allows our engineers to produce new GPU architectures roughly over 2 years now, in some cases, sooner. It allows us to think differently about how CPUs and GPUs want to be connected and also allows us to expand to the entire data center being our canvas for innovation, for making, for influencing. So that first decision, I think, to basically think about it from a different engagement point up here has allowed us to really innovate, move quickly, and we invite everyone else to participate in that ecosystem. And we've been doing it, well I guess, now for -- we'll I'm approaching about 20 years in NVIDIA. So there you go.
Vivek Arya
analystWhat part of that software stack, Ian, is substitutable? So for example, in the early days, it made a lot of sense, right, to couple the two. But now you have so many other people who are also involved in the ecosystem, whether it's the hyperscalers or whether it's the R&D -- software R&D teams of many of your hardware competitors. So what part of your software ecosystem is substitutable? Can I take an application written for NVIDIA and find a way to port it over somebody else's hardware, as an example, using a combination of these third-party tools and other open source software?
Ian Buck
executiveYes. Great question, and I get asked this a lot. Certainly, it is possible to take 1 workload or 1 AI model or 1 specific algorithm and get it working on anyone's hardware platform. What makes it hard is to make it a platform for continuous optimization and evolution and be a platform that can run all the workloads that one would run inside of a data center. So today, if you look at our software stack, we have, of course, multiple hardware platforms ranging from PCI cards that are 70 watts, fit in any server [indiscernible] to larger 300-watt PCI cards up to HGX-based boards, which have multiple GPUs talking over NVLink. And we even have shared how we can scale to entire rack scale or even row scale GPUs effectively. So the -- then on top of that, you have, of course, the system software, all the compilers and all the libraries that then get integrated to all of the open ecosystem. And these include the hyperscalers software like PyTorch, software like paxml, and the wonderful part about AI is it's so open that we can all innovate together in that ecosystem. So it's certainly possible to spike different implementations of different models into those stacks. What makes it hard is those platforms that I've mentioned, that is in the community, you need to run on all the different workloads to operate today across the entire data center. You don't build a data center on 1 model. You're going to run a data center to run to do large language models, to do all of generative AI as well as other data science or other use cases we need to do. You also want to accelerate end to end. And you -- often, I'll see someone inspect a particular layer or a particular model. But to deploy an AI service, we have to do all of the ingestion data prep, run the query, run the model as well as produce the output and in some cases, perform multiple other stages of AI like I want to have it talk back to me instead of just replying in text. That's also now being done in AI. The other part of the it has to be a platform for innovation because large language models are and Generative AI is not standing still. few a years ago, I'd still be talking to you about ResNets or talking to you about convolutional neural networks, I'd be talking to you about some of the units and other recommender things. These things are still important. But with gen -- with so many people innovating inside of LLMs and some form generative AI, what models are being -- they're innovated there at a click that's way faster than we're actually producing new architectures. So in order to be a platform for that, you have to -- and of course, be investing in the data center scale, which, of course, is a huge capital investment and it takes lots of time. You need to be a platform where you can trust that the innovations that are happening in generative AI. You're going to be able to run really well. And again, that comes to the end-to-end performance and automations that we're trying to make. Certainly, you can pipe one model to get the, to run all the models and be the innovation platform is a much more challenging task. And one that requires a connection and a benchmarking and all of those customers that are giving you that input in order to continue to make your platform improving over time. And we find optimizations from everywhere. One of the benefits and fun parts about working at NVIDIA is we get to work with all the different AI companies. So we get to optimize those layers of the stack that matter. There isn't just one part of the stack that needs to be -- can be simply replaced in order to port. You really have to get to the end-to-end workload. And again, it is possible, but it's challenging to be sustainable.
Vivek Arya
analystNow let's talk about generative AI, obviously, it has caught everyone by surprise in a good way, right? And demand still seems to be exploding. So maybe talked first about training and then generative AI inference. So on the training side, it seems like every day somebody is launching yet another large language model, and NVIDIA dominates the market for training a lot of those models. Do you see a point at which we get to some kind of cliff or maturation for demand for training? And do you think as people start to then look at optimizing the size of these models that, that actually somehow puts pressure on the demand for training hardware? Like how sustainable is the demand for AI training when we are already producing so many large language models?
Ian Buck
executiveYes. Large language models are a different -- what made them so, and the why are they so large? Here's one question that you could ask. Large language models, unlike computer vision models in the past or simple -- the more simpler recommender models, large language models are effective because they're directly are acting with humans typically. And in order to directly -- they need to understand human knowledge. So one of the reasons why GPT is so large is it's trained on -- it's trying to represent, interact with us with the corpus of human understanding so that they take the -- they download the Internet, if you will, and they teach it what humans know so we can have a start-up baseline, a foundational model that captures human understanding and knowledge, which, obviously, is much larger than perhaps what you would need for a computer vision model, which is still very important, but can be trained on a set of images and eventually, it can beknown of those sets of images what they are. So they tend to be very large models. And they also tend to be a foundation models for specialization, so you can specialize for different workloads, and you can specialize it when starting from that foundational model toward perhaps your data. So you're starting from something that understands humans or understands how to interact with humans or understand based on these and then take it to your proprietary data and then be able to interact with it, to ask questions of that data. And of course, leverage the general capability. So when you ask about the capacity and how this is going to grow over time, this is that. It is how you interact with the with computers, with the cloud, with their data. And that's hugely immensely valuable. It's immensely valuable for improving how customers want to interact with companies, how people who are helping customers want to understand and have an assistant sitting right with them to be able to ask guys questions and get the prompted information from knowledge base and other things to have a better experience. It allows -- large language allows -- enable recommenders, people who want to give content, provide the right content as you're running your news feed or in your e-commerce to have -- to get the right words in the right context being shared with you. So it literally touches every part of e-commerce, of company interactions with customers and is sort of the answer to understanding the decades of big data we've been living in. Does this tail off? I think it becomes a continuous space for innovation. Just across the board. And there's no, there's not going to be 1 model to rule them all. It will be a -- there'll be a large diversity of different models based upon the innovations that are going to continue down this space and also specialization across all of these fields. And by the way, we're seeing it in health care, in science, in drug discovery. Large language models doesn't have to just be the language of humans, it could be the language of biology or physics or material science as well. The -- and so what is the growth vector, what does it look like? It becomes that how many -- it's the rate of which how many innovators are adding and defining and inventing, new optimization techniques, new kinds of models. It may start from these heroic amazing models coming from people like OpenAI with the GPT models and what we're seeing from there. But much of this research is being published or the models are being published then to influence and create alternatives or derivatives. So that is the scale we should be thinking about for generative AI and large language models. And the scope isn't necessarily the size of the model per se. They're going to remain large in the sense that they have to remain large nor it be -- have a baseline level of foundational intelligence. It is really -- the scale will grow as more and more industries and more and more companies and the rest of enterprise beyond -- adopts this technique for how they interact with customers' data and apply it to their businesses. And certainly, the hyperscalers, were the first to jump on it. They obviously had the talent and the capital and ability to basically invent much of this technology side by side with NVIDIA, I got to experience that. It was a fascinating experience. They continue to do so and continue to push the limits and figure out how to apply it. And basically, we can see them starting to scale AI across their businesses, and now it's branching out to the rest of enterprise, the rest of the industry, and you're seeing a whole tier of both -- more cloud offerings. We're seeing specialty regional GPU data centers being popping up everywhere to serve the market that operate differently, a little more agile, perhaps a bit smaller but can be more focused. And then a large litany of middleware end solutions and software companies that are trying to help enterprises and other companies deploy this technology across the board. So there's definitely a broadening of the large language model ecosystem. The adaptation of generative AI and language models to business is really the scaling factor that we experienced, and that will continue for sure.
Vivek Arya
analystNow kind of a similar question, but now applied to the generative AI inference side, what is NVIDIA's strategy for generative AI inference? Because the perception is that on the training side, the company dominates, but most of the products are very expensive. So when it comes to really scaling generative AI inference, which is really, I think, the way your customers will monetize that, right, at the end of the day, how are you going to help them monetize that, right? What's your product pipeline look like to help them with gen AI inference? And does the competitive landscape change as you move from training to inference?
Ian Buck
executiveSo this -- thank you for that question. And I think people often get a little bit confused perhaps. Certainly, your starting point for some of these models for deploying them begins with their training clusters. And so they'll stand up infrastructure, previously, A100s, HGX systems. These systems are designed for 8 GPUs, NVLink connected, the maximum possible performance, and of course, have InfiniBand to scale across an entire data center. Today, it's being deployed right now with Hopper [indiscernible]. What you is the natural platform for what -- to do inference on. Since training and inference are highly related, the model -- in order to train a model, you have to first infer and then calculate the error and apply the error back to the model to make it smarter. The first step of training is inference and with every -- and repeatedly. So it is natural that customers are deploying their inference models with their training clusters with their HGX. It's not the only place where we see and -- the only place we see inference. We see inference happening across the spectrum from all the way down to the L4 GPU which is -- I should have brought one. It's a 72-watt GPU. It's a half by -- half length about a candy bar side. It's smaller than my phone and fits in any server, any sort of the PCI slot can now become an accelerated server. In fact, we have seen the clouds adopt it and the OEM, the rest of the system because it's great for inferencing. It's a -- it has the video encode and decode capability. So we're seeing it used for smart city applications and image processing. You can also run small LLMs for recommenders of small tasks. And we also see it for generative AI, for image generation, for running Stable Diffusion like models. And it provides -- and it's at a price point that's very comparable to CPUs. So in fact, in many cases, a better, much better TCO than as the CPU were in the same model. We have plenty of material on that. If you need to go up a click, you have the L40, which is a full-sized PCI card and it runs -- which is often used for larger inferencing and fine-tuning tasks. So you can take an existing foundation model and then to fine-tune it to do that sort of last-mile specialization for your data workload is a much lighter task than the larger training cluster and can be done on an L40 or an L40S PCIe-based server, again, available in -- across -- with every OEM system. So these provide different price points and different capabilities, and you're all the way to click up to NVLink connected system. And for NVLink connected systems, we often see people running on a single load. And there, you just need to get a model to a certain size that just needs to execute certain not latency, you can say to be interactive, do half a second latency response for Q&A, for example. So by connecting them with NVLink, we can basically build 8 GPUs and turn it into 1 GPU and makes it just -- and to run the model that much faster to provide that real time latency. So our inference platform consists of many choices to optimize for TCO for workload and for delivering performance and in the case of inference, it usually is about data center throughput as certain latency. And that's important. The other part of the road map is software. So I want to go back to that because it's easy to look at a benchmark result and see a bar chart and assume that's the speed of the hardware but is often underreported in numbers is the investments NVIDIA makes in the software stack for inference. It's actually even -- you can find even more optimizations that you can do it just in training because in inference you're coming out the last mile, so you can do further optimizations of the model beyond what is perhaps capable of training to optimize further. For Hopper, for example, we're using -- we've just released actually last week a new piece of software called TensorRT-LLM, TensorRT is our optimizing compiler for inference and [indiscernible] LLM version. The optimizations we made in that software just in the last month doubled Hopper's performance on inference. And that came through a whole bunch of optimizations in both optimizing for the Tensor Core, that's H100 using 8-bit floating point and improving the scheduling and execution software of managing the GPU's resources to increase its effective throughput and computational efficiency. It's really hard to ask. You're trying to basically optimize by using reduced precision, by using -- by serving all different sized requests from quick Q&A to summarization tasks to write me a long e-mail or generate a full PowerPoint. A data center that's going to be running Hopper or a data center running inference is generally -- it's going be asked to do all those things, getting that to run efficiently and be able to that workload and keep the GPUs 100% utilized is actually pretty hard. Mathematical, statistical, AI system software and even hardware level optimization. So we will continue to do that. Just in the last month, we've doubled our performance on Hopper for inference, and we'll continue to do so. And you'll see that as we continue.
Vivek Arya
analystIan, do you think that the industry has the right cost structure for Generative AI inference at scale? I see that more as a user when I go to, take your pick of search engine, right, with Bard or ChatGPT or what have you, even when we put in queries today, it takes several seconds to get an answer, right? It's a very different experience than we are used to in traditional search engine. So do you think the industry is there? Today, it seems like everyone is training a lot of things and trying a lot of things, but do you think the industry actually has the cost structure to take generative AI and scale the inference side because I imagine that's what it will take to really grow this industry in a very sustainable way over the next several years.
Ian Buck
executiveYes. It's a great question. Today, most of the live inferencing you're experiencing, of course, is on previous generation GPUs. That's just naturally what it was originally developed and optimized and deployed on. And many of our large customers actually just now are bringing on their Hopper variants. So you'll see that 8x from -- so in terms of our performance where we were with A100 a year ago, to today, it's about 8x improved performance going from -- to H100. And that, again, is a bump from the hardware side and activating those capabilities and another bump again from the software side. So I expect that interaction that you guys are experiencing to get better. And get more intelligent. I think there's a fixed latency that we all want to experience and then it becomes a question of the size and capability of the model that can fit in that latency window. So it can be a process of continuous improvement. You asked about search. Can every search you type in be -- take advantage or be fully optimized if it takes this long? There are aspects of generative AI and language models that already being used today that you may not know. When you type in the search, they're not using those words literally to index in. They actually are applying language models to generate more optimized query string, if you will, to search on based on your history and other things. So we are seeing aspects of that. We also see things like Transformers and large language model technologies being applied in last-mile recommender systems. So as they get down to the last 100 documents that are pieces of information they wanted to understand or ingest to produce a result -- can I run a smaller and constrained Transformer-based model in order to provide that last-mile recommender from the tens or hundreds or thousand whatever I can afford last mile for the recommender? So you are seeing some of that technology being deployed today and being deployed on GPUs today. The next click up, of course, will be having a more richer experience with search. I expect to see more of that with Hopper. It may take a few clicks. With every generation of our GPUs and with our invention of new software optimization techniques and every invention by the community, whether it be the -- what's the next Llama 2, GPT, we bring down the cost of inference. Hopper brought it -- from A100 brought down by 8x, TCO is also on the order of 5x. And the -- you compound that with continuous software improvement and compound it with new model and algorithms techniques. There's an order of magnitude more capability that's going to be available to everyone. And the best part, it's on the GPUs they've already purchased. They -- it's already there. In fact, this performance we're delivering with every one of these new pieces of software or the performance that's capable with this more optimized GPUs more -- I'm sorry, more optimized AI algorithms and models. It's free, continued investment, improvement in that TCO, in that performance and that experience. So it's a fascinating time. It's super busy. We've seen new innovations come in all the time. And it's definitely keeping NVIDIA and the community busy optimizing -- continuously optimizing the platform.
Vivek Arya
analystRight. I wanted to get your perspective, Ian, now on the competitive landscape. When we look at the demand profile for NVIDIA's accelerated products like tens of billions, right, and expect it to increase next year, doesn't that give a lot more incentive to your hyperscale customers to create more custom ASIC solutions? 1 customer is already with the TPU product, they've had a custom solution for a long time that are -- the others are -- there's a lot of headlines about them wanting to have internal. So first of all, what is the right positioning of your product versus their internal solution? Do they use one for -- one kind of workload and one for the other? Or does it become a greater competitive threat for NVIDIA going forward?
Ian Buck
executiveWhen we look at this would be what just happened at the GCP Next conference, Google's conference, I think it was about 2 weeks ago. They announced the new variant of their processor on that day in their keynote. And in that same keynote, Jensen joined them on stage, and talked about all the innovation that we're doing together with Google, both at GCP and not just new instances, bringing the announced GA -- of their availability of their A3 instance, but also the integrations of GPU into their Vertex AI platform. Many of the research innovations that are happening on GPUs inside of Google elsewhere. It gives you an example of how the fact that while hyperscalers absolutely have the means to invest and optimize and build something that may be tailored for obvious important workloads for their business. They continue to partner deeply with NVIDIA and our GPUs and our software teams as 2 big companies advancing what we can do together, helping us helping them -- they, us partner together on many of the software platforms to continue to innovate. And what you see -- and you see that, you see NVIDIA out there as an open platform of course, available in any -- every cloud and as an open software ecosystem to help advance state of the art in AI, in data science and accelerated computing holistically. That lift comes from almost 20 years of investing in a software developer ecosystem. And you also continue to see some of the hyperscalers, of course, building their own silicon if they have the means to, to optimize for specific workloads that perhaps are -- they can focus on for their businesses, but they still remain in close connection with NVIDIA because they see the opportunity to not just serve a broader ecosystem but also innovate and be a platform for accelerating computing across the board. And that is something we've been -- is we're quite comfortable with, and it's been a good partnership and I think -- it was really evident in that keynote.
Vivek Arya
analystAll right. Do you see that change at all as we are moving more towards generative AI, just where the cost of training is so expensive. The cost of inference is also going to be quite expensive that do you think it increases their desire to bring on more ASIC solutions than they have done in the past?
Ian Buck
executiveThat's a choice for them to fit where they want to optimize and invest. One thing that is NVIDIA is spending and investing billions in R&D to optimize for generative AI for training and inference scale. And with every generation of our GPU, with every generation of our interconnect InfiniBand and CX networking technology, with every innovation of NVLink, those things bring the TCO and increase performance dramatically and also bring down the cost of training. Now they're obviously motivated to scale up what they can possibly do in order to develop something uniquely advanced or uniquely new or different that they can capitalize on. But by working with NVIDIA, they can basically up -- leverage the billions of dollars of investment that we're doing on core workloads of training and deploying for inference for large language models for generative AI in that workspace. And it's a question for them of where they're going to decide to optimize and to take that step further to do something which may be different, doesn't necessarily take advantage of all the time and energy and investment that NVIDIA is making. So that's a trust that they have to consider and make. We're going to continue regardless to innovate at a pace that they'll need -- and the benefit of them and the entire community. So we will continue to see those things happen. I'm sure it would make sense, but the -- our focus doesn't change. It continues to swarm and innovate to increase our performance, to lower cost and also increase capability of forward generative AI and large language model.
Vivek Arya
analystMakes sense. Next topic, Ian, I wanted to approach was this emerging class have kind of converged CPU or GPU products, for example, Grace Hopper, and your competitor is also announcing some of their own products. So what are the pros and cons of using those kind of -- I don't know whether converged CPU, GPU is the right way to refer to them, but how do they stack up against the more discrete solution where I'm just using standard x86 CPUs with 1 or many GPUs? What's the pros and cons of moving to this kind of converged architecture?
Ian Buck
executiveYes. It's -- we've been optimizing and the community has been optimizing for accelerated computing and AI for about 20 years. We've moved a huge amount of competition to the GPU at this point. So for many workloads, including many of generative AI, 95% -- 99% of the computing is done, of course, on the GPUs and are directly communicating with each other or the [ NVLink ] or across InfiniBand, they never touch -- and all of the CPU workload can be either -- is either small or can be optimized or done in parallel and overlap with the GPU computation. What -- combining this, of course, you have to have a high-performance CPU there to do the other tasks, usually around data prep, scheduling, managing and coordinating the execution. And every time we increase our GPU performance, of course, we need to make sure that our CPU performance keeps up or we find, so we don't have the bottlenecks [indiscernible]. Ways to manage that. One, of course, is the best possible CPUs, which we encourage and do use ourselves, you can also adjust the ratio of CPU to GPUs. Today, if you look at our DGX system, it's 2 CPUs for 8 GPUs, but we can do -- we can go 1 to 4, we can do 1 to 8, we can do, of course, do 2 to 1 or in Grace Hopper, we went all the way to 1 to 1, next to each other. That's 1 angle. The other part, though, is about conversion. What happens when you combine CPUs and GPUs and do something different than a traditional [indiscernible] architecture, where is sitting over here, and you go over to PCIe to a GPU over there. First, by bringing the 2 converge together, you can dramatically improve the bandwidth, the communication between those 2 processors, today to hundreds of gigabytes a second versus the 60 or maybe 100 gigabytes from PCIe connection. You can also be much more coherent. So you can bring the 2 memory systems together. The memory on the GPU, which today we ship an 80-gigabyte HBM GPU, we're going to -- and we have announced going to up to 144 gigabytes per GPU. But you can connect it to Grace. And because the connection is so fast, the 600 gigabytes of GPU memory around the CPU basically comes a combined fast memory platform. While even larger models, basically effectively making 600-gigabyte GPU. This activates certain different -- both large language models with a single platform, a single GPU-CPU complex, and it opens up new avenues for new kinds of workload acceleration with, especially working on large data. Applications like vector databases, application graph neural networks, which is used a lot in the finance and fraud and e-commerce, also used for recommenders. These are very large data sets that often want to either be run on -- they can run [indiscernible] a day across many GPUs but could be run more perhaps optimally or a different TCO point by having a much larger GPU like Grace Hopper, 600 gigabytes combined in 1 because they've been tied together. The third thing about convergence is that it allows us -- it's another vector for innovation. We can add things to a CPU that can optimize for the workloads we already know about or other opportunities we see in the future to innovate in the CPU ecosystem -- in the CPU space, in addition to the GPU and addition to networking data center scale. And that -- you see that in the work we're doing with our DGX GH200 by connecting even more GPUs together, having an excellent CPU-GPU ratio, having the NVLink, having the large memory, really gives a vision of the future of infrastructure for generative AI, one where you have basically 256 GPUs connected all to NVLink, fully backed by 256 Grace CPUs. And because it's all NVLink and effectively acts as 1 exaflop GPU, which is an amazing generative AI platform for both training and extreme large language model inference where we might need multiple GPUs connected optimally for the [indiscernible]. So it's really those, some of those three things, larger -- provide larger memory as a starting point for a building block. It's a great scale out platform for inference as a result. Grace Hopper fits in any server. It's a complete complex for CPU, GPU and memory. It allows us to play with the ratios and explore different ratios for different kinds of workloads with CPU, GPU, and it's an innovation space. Some of the innovations that we've made in Grace, and while we're using an ARM-based core, the SOC architecture of Grace and how those crores can talk to each other is quite powerful and it's showing up in many of our benchmarks, and it provides a great companion to those compute bridge workloads that NVIDIA has been focused on for the last 2 decades.
Vivek Arya
analystGot It. I know we only have a few minutes left, but I wanted to get your take on the last 2 questions, Ian. One of which is -- what's the role of the networking stack in the optimized Generative AI cluster? So how much of an advantage does NVIDIA have because you're able to leverage InfiniBand. But when that InfiniBand changes over to Ethernet, then does it mean conversely you lose some of that advantage also because hyperscalers want to move more to Ethernet. So first, what is the role of that networking as part of the cluster? And does anything change when it moves from InfiniBand to Ethernet?
Ian Buck
executiveYes. Great question. So there's basically 3 interconnects at this point that are a choice on how to design and deploy AI. There's NVLink, which previously was inside how GPUs can talk directly inside of a system, now going to this more of the rack and room scale. You have InfiniBand, which is originally developed for HPC and for supercomputing industry, the lowest possible latency at data center scale. And it's really what's designed for that. And then, of course, you have Ethernet, an industry established, designed, of course, for manageability and capability and comes with a rich ecosystem of all the features that not just enterprises, but the cloud need in order to do a managed and software-defined infrastructure. What you will see is, of course, NVLink will continue to be very closely tied to the innovations we'll be making inside of our own GPUs. And there, we -- it's as fast as we can go because we know bringing those GPUs -- as GPUs get faster, and we want to connect as quickly as possible in order to continue to allow them to operate as one. And to get the lowest possible latency for inference on some of these giant models, you need to be doing techniques around model parallelism, which have the extremely high intercommunication requirements so that basically, it can split the model this way instead of just that way to decrease latency. As InfiniBand also continues to grow, its design point, of course, is the lowest possible and shortest latency. And as a result, as well as providing the excellent bandwidth that it does. And we see that it does provide a significant performance improvement over leveraging perhaps a RoCE, a converged Ethernet stack, which still has a lot of the management -- which can deliver comparable performance. In fact, we support many clusters and many deployments in our cloud at scale, with Ethernet, with RoCE. And it works great for the best possible performance Infiniband gets that extra click up and that basically comes from its HPC heritage of having lowest latency with high bandwidth. And we can do other optimizations as well as in-network computation, we can actually do some math inside the switch into the network fabric with InfiniBand. I fully expect Ethernet, and we are working with the community to actually improve Ethernet's performance as well. and, which is great because it all comes with all that manageability and that software refinements. And the 2 -- all 3 will exist in the system for a while and continue to get to -- as 3 layers of performance and scale and, of course, requirements between our reliability or manageability or security or enterprise deployment versus maximum possible performance over time. The road map will continue. Performances will go up, I expect it to be staggered, but they'll continue to learn and absorb from each other those technologies.
Vivek Arya
analystGot it. And I would love to get your perspective on where are we in terms of rolling out a generative AI? Because when we look at applications, they seem to be in their infancy, right? There are not that many applications. But when we look at just the rate of the growth of your data center business, that seems to be a very big proportion of what is the total spending pie, right? So what gives you a pause in thinking about we are already such a big part of the spending pie. How sustainable this growth rate for NVIDIA over the next several years?
Ian Buck
executiveSo it's a fascinating question to think about. Today, if you think about where we are and the growth we're experiencing right now, it's people taking their existing data centers and make -- optimizing them to incorporate more and more GPUs, more and more LLMs and Generative AI workloads, and that may be coming from the hyperscalers themselves, enterprises want to get on board using the cloud, for example or now seeing the GPU regional specialty providers also standing up infrastructure. But largely going into the data centers that are already exist because you can't just build data centers overnight. It takes 2 years plus to build up the infrastructure. What I see is the world looking to pivot their -- how they're building data centers in the future. And we're seeing really exciting growth in -- they all realize they need to build out more capacity and of course, be able to build not just the data centers they have before, which were generic in nature, perhaps more CPU focused because that's the majority of the servers going in and now building everyone from hyperscale to regional to on-prem to basically building out GPU data centers at scale. So if I look at the growth of data center build-out, you can kind of see the opportunity for OEMs continuing to grow beyond what's being able to -- in some cases, quite literally brand into data centers they already have and then to establish where we are today versus the size of the opportunity, the size of the market just from a data center footprint growth capacity. We've gone from being a corner of the data center into being what data centers are now being designed for which is really exciting, and it gives me the confidence in the continued growth of our business to see how much companies are investing, the world is investing and building out that infrastructure for all the different demand, all the different needs.
Vivek Arya
analystExcellent. So on that exciting note, Ian, thank you so much for taking the time to be with us, sharing your perspective. Really appreciate it, and thanks to everyone who joined this webcast. I've got another 45 questions on the chat. So I'll see if I can work with Simona to help answer some of those questions. But really, thank you so much, Ian, for the time. It's immensely useful to get your perspective. Thank you so much.
Ian Buck
executiveAlways a pleasure and thank you very much.
Vivek Arya
analystPleasure. Take care. Thank you.
This call discussed
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.