NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
December 5, 2023
Earnings Call Speaker Segments
Amr Elmeleegy
executiveGood morning, again, good afternoon and good evening, everyone, depending on where you are around the world. Thank you so much for tuning in to today's webinar, where we focus on helping you move your enterprise AI use cases from the development stage to production with the NVIDIA full-stack inference platform. My name is Amr Elmeleegy, I'm a Product Marketer here at NVIDIA in our data center AI group, and I'm thrilled to be one of your presenters on today's webinar. In about 30 minutes or so, I will be handing off to my colleague, Phoebe Lee. Phoebe is a Product Marketing Manager in the NVIDIA Enterprise AI Software team. She will be walking you through the production runtime offering of the NVIDIA inference platform. As a reminder, this is the first installment of our NVIDIA AI-inferencing webinar series. And just a couple of days on December 7, we will be having our second webinar where we will be doing a 30-minute demo for the NVIDIA inference platform. So for all the developers out there on the call today, on the webinar that want to see actual code, I definitely recommend you sign up and register for that second webinar on December 7, and you will find a link to it in the resource list at the bottom of your screen. Now in terms of our agenda for today, we pulled together here what we hope is an exciting and informative content for everybody. We will start from the very top by explaining what is AI inferencing and where does it sit in the end-to-end AI work stream. Next, we will explore some of the challenges that enterprises face today as they try to take their AI inferencing projects from the development stage through production. We will then spend the majority of our time today during this webinar on bullet point #3, which is a deep dive into the NVIDIA AI inference platform, explaining exactly what it is, what it does and what value does it deliver to your enterprise. And then finally, we will wrap up with some useful links that can help you get started with the platform. So you can put any learnings from today's webinar directly to use. Now we have a large number of attendees on today's webinar from a very diverse set of industries and different corporate backgrounds and roles. So before we dive into the technological aspects of the platform, I thought maybe it would be beneficial to start with a very simple introduction of what is AI inferencing. And really, the easiest way to answer that question is to just simply look around us at some mainstream digital experiences that have permeated our lives in just the last 2 years, where we are essentially just interacting with an AI-inferencing engine. And I'm sure everyone on the call recognizes and relates to the pictures and the images that you see here on the slide. We've all used voice assistance. In the past, we've all been in or at least seen on use, autonomous driving vehicles. And for those of you with a little bit of an artistic bend, maybe perhaps some of you have ventured into the world of stable diffusion and image generation. All of these and many others are examples of AI inferencing. We're a model or an algorithm that has been trained on a very large data set, is essentially now stepping into the real world to solve real-world business challenges on datasets that it has never seen before. In a typical end-to-end AI work stream or process consist of 2 distinct but very interconnected phases. There is the AI training phase that you see on the left-hand side of the screen and the AI inferencing phase on the right-hand side of the screen. The training phase, of course, is what has been getting a lot of media attention lately, primarily because of the very fast advancements in AI networks and transformers that have produced some very, very large language models that are now pushing the boundaries of trillion parameters. However, what we are seeing today from an enterprise perspective is a massive shift in attention and focus from the training phase to the inference space. And that is primarily driven by the explosion in the number of available datasets and the number of training models that are now available on open source marketplaces. Just an example, there are over 80,000 trained datasets today on the Hugging Face platform and more than 400,000 models on Hugging face. And what that does is it simply takes a lot of the burden and a lot of the pressure off of enterprises from having to invest heavily in building their own models and building their own datasets, they're now thinking about, how do I take some of these datasets existing out there in the market and how can I take some of these trained models, perhaps fine-tune them on some internal ERP or CRM or datasets that are sitting within my organization? And then really focus on putting this model into production in a scalable, cost-effective way that delivers a solid customer experience and user experience for my business users. And the inference phase is also where a lot of enterprises are seeing the recurring cost of their AI workloads. So that's also one of the reasons it's driving a lot of the shift in focus from the training phase to the inference space. Now at this point, I just would like to take a quick second to get to know some of you on the webinar today. If you can please just very quickly respond to this easy survey question and just tell us a little bit about your experience with AI inferencing in your enterprise. Do you already have a standardized inference platform deployed within your organization? Are you developing bespoke models, one-off models for AI inferencing? Or maybe perhaps you're still experimenting, you're still in the training phase or in the fine-tuning phase and still trying to get into the inference space? So just take a few seconds and help us get to know a little bit -- get to know you a little bit better. Now here at NVIDIA, we are incredibly fortunate to have had the opportunity to work with some of the world's largest enterprise brands and companies, helping them solve their AI inferencing needs and challenges, including some companies like Amazon that is using NVIDIA's inference platform to power its Amazon.com e-commerce platform as well as its Amazon Music platform. Microsoft is also using the NVIDIA inference platform to power its Bing search engine as well as its document translation services that are part of the Microsoft Office speed up solutions. And we've also had the opportunity to work with financial institutions like American Express and Wealthsimple that are using NVIDIA's inference platform to detect fraud across millions of transactions, we've also similarly had the opportunity to work with manufacturing companies like Siemens on predictive maintenance use cases as well as autonomous driving use cases with automotive companies like Neo. Now I just want to share very quickly some of the responses from the users today. So 17% of you have selected that you have a standardized enterprise-wide approach to inferencing. That's great. That's awesome to hear. 15% are using bespoke models, 23% are still in the training phase. 32% have not started AI products yet within their organization and then 11% have not been successful at taking or deploying AI models from pilot to production. That's a great distribution actually. So we have a very wide and diverse audience today. I hope everybody is going to be walking away with some sort of learning. Even those of you that have are standardized enterprise-wide approaches, that is great. Would love for you to stay on tune because we have some interesting pieces of the platform that might supplement that. And for those of you developing bespoke models, I think we also have a lot of information to share with you today. So thank you for sharing this with us. Now as I mentioned, here at NVIDIA, we've had the opportunity to work with some of the world's very largest brands like Amazon, Microsoft and Siemens. And through these interactions, we have identified 6 main common challenges that enterprises today face when trying to deploy inferencing workloads in production. Now none of these challenges are unique to AI inferencing. A lot of these are challenges that organizations and IT department face across a wide array of applications and workloads. But there was a particular nuance to AI inference that adds some complexity to these challenges. So the first challenge is, of course, latency. Most enterprises today that are venturing into the AI space are looking at multiple applications across a very wide and diverse set of use cases. Some of these applications might be customer-facing, which means that they have a direct impact on the revenue of the organization, on the customer satisfaction, on customer churn. Other use cases might be back-office related, looking at optimizing some internal business processes in finance, HR, supply chain, et cetera. Each one of these use cases has its own latency requirements and criteria. So being able as an enterprise organization and as an IT department to find a unified inference platform that can intelligently understand the latency requirements of each one of these applications and being able to serve them simultaneously becomes a challenge for a lot of companies. The second challenge is interoperability. Of course, each one of these use cases and models that you might be deploying in production could have been developed by a specific AI framework. Some models are developed using PyTorch. Others are developed using TensorFlow. Yet others are developed using other AI frameworks like OpenVINO, for example. So once again, finding a unified inference platform that is able to work and serve only different models regardless of their back-end AI framework becomes a challenge for a lot of organizations. The third element is integration. Many organizations today have invested heavily in building large internal teams and skill sets that are certified on particular pieces of technology. Some companies have invested very heavily in certifying their IT departments on Google Cloud Vertex AI. Others have certified their teams and developers on Amazon AWS SageMaker. Others could be using Oracle's data science, AI platform or Microsoft Azure ML Studio amongst many others. So being able to find an inference platform that is deeply integrated into these AI tools that you don't have to reinvest in certifying your teams again and reinvest in additional learning, set up time and skill set becomes critical for a lot of organizations. And then, of course, at the very bottom, scalability, efficiency and cost effectiveness. As I mentioned, these are challenges across every enterprise workload, not just AI inferencing, but because the AI topic is very new for a lot of organizations, a lot of companies simply don't have the forecast of what the demand will be on some of these AI use cases. In some cases, they might start with 100 predictions per week, then very quickly scale up over the next few months to reach millions of predictions. So being able to find an inference platform that can scale over time to meet this additional demand while still at the same time being highly utilized and cost-effective becomes a challenge for a lot of enterprises. So these are just some of the challenges that we are seeing as we work with our customers that we have built our inference platform to help solve. Now what we will discuss over the remaining part of the webinar today is how the NVIDIA inference platform addresses a lot of these enterprise challenges. And really, the main takeaway that I hope all of you walk away with today is that addressing these inference challenges really requires a whole-stack approach. And the NVIDIA inference platform is really the only platform out there in the market that is purpose built for NVIDIA GPUs and with a full-stack approach addressing these challenges. In addition, the platform is also fully interoperable, which means that you can deploy that platform in any cloud service provider, you can use it with any AI framework and you can also use it with any target hardware. You could use it with GPUs or you could use it with CPUs as well. It's, of course, engineered to serve large language models, that are very -- specifically for latency critical applications. And it is today being adopted by leading enterprises and Fortune 100 companies around the world, and we just mentioned a few of them earlier. And then last but not least, a platform is now is made available to you, free of charge with open source software, but we're layering on top of that platform also enterprise support to help you with security requirements and production requirements, and we're going to dive deeper into that on today's call. So these are just the 5 main takeaways that I hope everybody would walk away from the webinar with before we dive into additional details. Now as I alluded to earlier, the NVIDIA inference platform really takes a full-stack approach to solving enterprise AI inferencing requirements, and we will dive into each 1 of these 6 players of the platform in the next slides, but I just wanted to provide you with a quick rundown, just in case some of you on the call might be interested in a specific particular challenge, so you know exactly where it's going to sit in the stack. At the very bottom of the stack, you have, of course, the inferencing hardware accelerators. These are NVIDIA's GPUs that offer massive parallel processing capabilities, along with dedicated engines specific for AI workloads like the transformer engines, and they also offer high-power efficiency. Sitting immediately on top of the hardware accelerators are the compilers and the onetime libraries that introduce some very specific optimizations and memory management capabilities to ensure that your AI model can run and take full advantage of the underlying hardware and the underlying accelerators that is running on top of. Next comes the model servers that are built purposefully for AI inferencing and should have or can have the ability to intelligently understand the latency and throughput requirements of your application and then being able to serve it concurrently with other models. And of course, everything is packaged in environments that can be deeply integrated into your existing IT environment, MLOps tools, top service providers and so on. And then finally, as we mentioned, it's one thing to build 1 or 2 models in experimentation phase in your organization, it's a completely different challenge to put 20 models in production for long periods of time, especially if these models are customer-facing and are revenue generating for your organization. That's what we're going to be addressing with the production runtime layer of the stack that we're going to be discussed offering security, enterprise support as well as reliability for your workload. Now starting from the very bottom of the stack. We, of course, have the NVIDIA GPUs that offer a complete portfolio of hardware accelerators depending on your use case and depending on your requirements. I'm going to go through these GPUs, but starting from the very right-hand side of the slide, with our entry-level universal AI inferencing GPU, the L4 GPU with 24 gigabytes of memory. This GPU is optimized for a wide variety of workloads starting, of course, from AI inferencing, but it also extends out into other workloads like video, RAPIDS and virtual workstations. The L4 GPU delivers 120x better performance compared to CPUs and is also tremendously cost-effective with a thermal design power of only 72 watts. The L4s are ideal for AI use cases, including computer vision, AI models, think Microsoft ResNet 50 image classification model or the DALL-E image generation model. And it's also ideal for large language models under 5 billion parameters of the GPT3 Ada with 350 million parameters and the GPT3 Babbage with 3 billion parameters. Next, immediately to the left of the L4 GPU is the L40S GPU. This is based on our Ada Lovelace Architecture, and it's the most powerful universal GPU for the data center today with 48 gigabytes of memory, 18,000-plus scores in a dedicated transformer engine for specific matrix multiply operations that are very common in AI workloads today and inference former models. The L40S can be used for both inferencing as well as training, along with the diverse set of accelerated compute workloads like 3D graphics, rendering and other video applications. The L40S can be used today for inferencing of large language models up to 175 billion parameters like the GPT3 davinci model, for example. Immediately to the left of the L40S is our H100. This is based on our Hopper architecture, and it is followed immediately or succeeded by the H200 GPU. These GPUs come with transformer engine and come with our fourth generation of Tensor Cores delivering unprecedented performance, scalability and security or literally every workload from training of very large language models all the way up to inferencing delivered 30x performance gains on large language models of 175 million parameters like the ChatGPT-3 davinci. The H200 that we just announced a few weeks ago raises the bar even further, delivering twice the performance gains on inferencing compared to the H100 for the Llama model. And lastly, of course, we offer the NVIDIA GH200 Grace Hopper Superchip, which combines a CPU processor based on the Arm Neoverse architecture with the Hopper GPU. The Superchip GH200 delivers 284 more performance compared to CPUs to address large language models, recommend our systems, vector databases and more. As you can see, we offer a whole portfolio of hardware accelerators and GPUs for your AI workloads and needs today that are part of the NVIDIA inference platform. Now the next layer up in the NVIDIA inference stack is our NVIDIA TensorRT, which is a software development kit designed specifically to accelerate inference of trained AI models on NVIDIA's GPUs. So as you can see in the visual that I've included at the top over here, TensorRT takes your trained model, your trained network, which consists of the network definition as well as the trained parameters and then applies memory optimization to teach to it like layer and sensor fusion, kernel auto-tuning to select the best algorithm, time fusion in case the model was an RNN, or recurrent neural network, and then produces the output is a highly optimized runtime engine that can run on NVIDIA GPUs with high performance and low maintenance. We also just recently announced also in that same layer, the general availability of TensorRT-LLM, which builds on top of TensorRT adding specific optimizations for large language models. Most notably that TensorRT-LLM supports multi-GPU and multi-known inference allowing you to scale out your inference workload across multiple GPUs and multiple nodes. This is particularly important for large language models that might not fit on a single GPU. And it also introduces in-flight batching, page attention, memory bandwidth optimization, all with the goal of helping you increase the utilization of your GPU and lower the latency of your application. And on the GitHub repository, we have also made available fully optimized and ready to run compiled version of some of the most popular large language models that are in production today. So you don't have to optimize these models yourself. You can simply implement a simple Python API call and pull some of these optimized models like Llama 2 model, Falcon, Mosaic, BLOOM and dozens of other models. Now again, the ultimate goal of TensorRT and TensorRT-LLM is to help you improve the utilization, the throughput of your models and decrease your latency. You can see here on the bar chart an example of the latency performance and TCO gains that you can achieve by compiling your model using TRT or TRT LLM depending on the model. What you're seeing here at the very top of the slide is the optimizations of a 6 billion parameter GPT model and a Llama 2 model that we're running on an H100 system, which was one of our most powerful GPUs. But after optimizing them with TensorRT and TensorRT-LLM, we were able to expect even 2x more performance enhancements from the GPU. Of course, compiling your model using TRT can also help you lower your energy use and reduce your possible ownership by between 3 to 5x because now you can use your GPU more efficiently with techniques like in-flight batching and more efficient memory management. So instead of having to spin up additional virtual machines or additional servers for your models, you can utilize the same virtual machines and the same servers that you have for additional models, reducing TCO and becoming more memory energy efficient. Now once you've optimized your model, the real fun, at least in my opinion, begins, which is now taking that optimized model and deploying it into production. And this is really where the NVIDIA Triton open source inference server comes into play. Triton is a very popular inference server. It's been downloaded hundreds of thousands of times since its release in 2018. As I mentioned, it's an open source application built from the ground up to serve AI models in very high complex production environments that could be running dozens of models concurrently and serving millions of predictions. I was personally just speaking to our Triton customer just last week, and they told me, since deploying Triton, they were able to reduce their time to production for AI models from 3 months to 15 minutes. Now I'm not saying from 3 months to 3 weeks or from 3 months to 3 days, I'm saying from 3 months to 15 minutes. That's how fast we can accelerate time to market for your models. And the reason behind that is it eliminates the need to build up bespoke AI inferencing platforms and frameworks for each and every model that you deploy. So you can simply take your model and deploy it under existing AI inferencing platform powered by Triton. Triton supports multiple frameworks like TensorFlow, PyTorch, XGBoost, OpenVINO and many others. It also supports different AI models, deep learning models, tree-based models, model ensembles, large language models, and it supports different period types, the real-time, batch or streaming for audio and video applications. And most importantly, it runs both on GPUs as well as on CPUs. Now I just want to take a few quick minutes and double click into 3 very unique features of Triton. The first is concurrent model execution. Triton's unique architecture allows multiple models to run concurrently on the same GPU and to run in parallel. This feature becomes tremendously useful particularly, if you're working with small models that can fit -- where you can fit multiple models on the same GPU. When deploying Triton in production, you can simply specify how many parallel executions of a model you would like to allow. And then if an incoming inference request arrives to your GPU and the GPU is busy serving a different request, it will pass that request to a second instance of the model that is running on the same GPU plus reducing latency and increasing throughput of our GPU. Now again, the goal of this feature is to really allow you to extract the most utilization from your hardware. And this is particularly, as I mentioned, helpful for small models like the ResNet-50, which you see here on this slide, where you can fit well instances of that model on an entry level 16-gigabyte GPU. So this is a feature that we're seeing a lot of enterprises take advantage of today in their production environment. Now the second feature I just want to double-click on is the model analyzer. Now in a production environment, AI applications can have very specific latency and throughput requirements. And within IT departments, there are typically 2 main levers that you can adjust to impact how long an incoming request is sitting in the inference server before it's processed by the GPU. Primarily, these are increasing or decreasing the batch size of the inference request or loading more inferences on the GPU, which we saw in the last slide. These are primarily the 2 levers that you have control over. Now playing around with these two variables or 2 levers sometimes requires a lot of trial and error and can be laborious, especially if you're working with a large number of models. And this is really where model analyzer comes into play. It helps you solve that challenge. You can simply provide model analyzer with the constraint of your model. This particular case that you're looking at here on the slide, we provided model analyzer with a 10-millisecond latency constraint. And then model analyzer with just one suite through different model configurations and just tell you what the optimal deployment of your model would be. In this particular case what you're seeing on the right-hand side is the model analyzer running suites to 4 different configuration, and it's showing you that blue line that you see is providing the highest throughput for your GPU while still maintaining that 10-millisecond constraint that you see in the vertical blue line on the graph. So again, this feature is one of the features that we see very frequently in production environments today and customers are using it to help reduce that time of taking their models from experimentation phase into production. The last feature I want to double-click on for Triton is the model ensembles. Now many production AI use cases require a pipeline of different machine learning models to be executed in parallel to deliver on the final use case or the final output. A very popular example is a conversational AI. Typically, you're deploying a conversational AI chatbot. It consists of 3 different models. An automatic speech recognition model that converts the input audio to text, and then there's a large language model that interprets that text and then there is a text to speech model that converts that text back from the large language model into audio for the user. This is a typical example of a model ensemble. Another example is very popular in computer vision, where when you load an application or you load an image into a computer vision model, typically, there might be some preprocessing steps that need to be accessible like resizing the image, cropping it, maybe changing its format and so on. Triton has a model ensemble feature that automates that process for you, allowing you to focus more on your use case that you're building and not all the back-end plumbing that is needed to coordinate a lot of these preprocessing steps or the synchronization and the workflow between these different models. Again, this is one of the unique features of Triton that we are seeing being heavily used in production environments today. Now lastly and personally, my favorite feature of Triton is that it's just simply super easy to install and super easy to deploy. You can literally install it right now on your computer as we speak. If you simply have a Docker Engine installed on your workstation, you can literally execute these 3 lines of code that you see on the slide here today, and you would have an image classification model deployed powered by Triton server, and you can even test it and play around with it. In this particular example, these 3 lines of code, all they do is they pull the repository from GitHub, you run a simple batch file that downloads the DenseNet model in this classification model. You deploy the Docker container, that's the second command, and then you run the inference on the Triton server. It's really as simple as that. And it also runs on a CPU-based workstation. So if you don't have GPUs, all you should do or all you could do is just edit that simple flag that you see in the second command line that says GPUs equal 1, just remove that flag, and you would have a Triton server deployed on your workstation that you can start experimenting and using. It's literally as simple as that. Now at this point in time, I just want to take a quick pause and get another pulse track or pulse read of the room, and I'm just curious, how many of you here attending the webinar today have used or are using any of the NVIDIA products in the inference stack that we covered so far? If you could just take a quick second and answer this question and then I'll, of course, share the results anonymously in a couple of slides. I'll just give you maybe 5 seconds for everybody to enter their responses. Okay. Now sitting, of course, on top of the model server is the NVIDIA NeMo framework. Now everything we talked about so far assumes that you already have a trained model that you want to put into production. But if you're still building that model, you're still fine tuning, if you're still looking at use cases related to retrieval augmented generation or you maybe want to fine tune take a trained model from an open-source library and maybe change some of the loss functions in that model, introduce some changes to it. Or if you're simply still trying to build the use case itself, then the NeMo framework comes in. Our NeMo framework is an end-to-end cloud-native framework build to help you customize and deploy generative AI models anywhere, on-premise, cloud, et cetera. And it also includes training and inferencing framework. So everything we talked about from the Triton Inference Server to the TRT -- TensorRT and TensorRT-LLM compilers is already baked into the NeMo framework as well. And it also comes with guardrailing toolkits, data curation tools, fee train models, offering you a very easy and cost-effective way to adopt generative AI. So again, if you're still building your model, then definitely encourage you to think about NeMo framework and some of the components and the libraries that we offer in there. Now lastly, before I hand off to my colleague, Phoebe, I just want to talk about one more important feature -- actually, before we do that, let's look at some of the results that came back from the last survey question. So only 6.5% of you, okay. So very few folks are using the Triton Inference Servers. So hopefully, you can go try up the 3 lines of code I just shared and maybe we can push up this percentage point a little bit further. 3% use the Triton Inference Server and TensorRT in production. 0% use the TensorRT-LLM, definitely encourage you to take a look at it. It's available on GitHub today. And once again, you can -- there's tons of tutorials, how-to guides available for you on GitHub. So again, hopefully, after this webinar, we want to push that number up a little bit. 25% use an alternative inference solution. I would be very curious to get to learn some of these solutions, perhaps you're using a native AI framework solution out there. These are great solutions. Would love to kind of learn a little bit more. So maybe perhaps you can share some of that in the Q&A section. And then 64% of you are new to inferencing and have not used any inference solution. So hopefully, this webinar is useful and informative and can help you get started with some of our technologies. Now as I mentioned, before I hand off to my colleague, Phoebe, I just want to talk a little bit about the integration of the NVIDIA inference platform into existing MLOps tools and particularly existing cloud service providers. As you can see here, the NVIDIA inference platform is deeply integrated across all main cloud service providers. And we've listed their providers here in alphabetical order. And you can see from the very bottom of the slide, the wide and broad support of the NVIDIA GPUs across the various cloud service providers. In fact, there's a total of more than 90 GPU-powered configurations of virtual machines available to you today across or collectively between these different cloud providers that are powered by NVIDIA GPUs, covering more than 5 generations of our GPUs. You'll see at the very bottom as well that we've had some very interesting recent announcements just over the last few weeks of a cloud providers bringing in NVIDIA's latest generation of H100, H200, L40, L40S GPUs to their clouds and to their users. Sitting, of course, on top of the GPUs is the NVIDIA Triton Inference Server that is today deeply integrated across all of the different cloud provider AI tools, the AWS SageMaker, Google Vertex AI, Azure ML Studio or Oracle Data Science. It's literally as easy as adding a couple of line of code to your deployment using these various AI tools and you can have a managed endpoint running powered by Triton on these cloud providers. And we're going to dive a lot deeper into that particular area in the December 7 webinar. So once again encourage you to register if you're curious or if you standardize on any one of these cloud providers or using their AI tools. And then lastly, of course, at the very top, we make virtual machine images available powered by NVIDIA GPUs with all the dependencies and all new operating systems bundled together available on the cloud marketplace of these cloud providers. And once again, we're going to double-click into the cloud deployment options of our inference platform a lot more on our next webinar. So with that, I'd love to hand off to my colleague, Phoebe, to talk to us about the last but most important layer of the inference stack, and that's the production run time. Phoebe, over to you.
Phoebe Lee
executiveThanks, Amr. Can you hear me okay? Just want to double-check.
Amr Elmeleegy
executiveI can from my side.
Phoebe Lee
executiveAwesome. Thanks, Amr. And my name is Phoebe Lee, Product Marketing at NVIDIA, very happy to join Amr today to share NVIDIA solutions for inference. As you heard so many information from Amr and inference is really where AI models are put to work and make predictions. It is a crucial process for enterprises who bet their business on AI because it allow business to address challenges, questions and make evidence-based decisions for their -- to advance their business. Yes, inference is hard, many discrete, diverse pieces must work together in harmony from beginning to the end point to really reach the successful inference deployments. And on top of that, the complexity of maintaining security and stability of AI software stack with ever increasing dependency is a massive undertaking. So that's why NVIDIA creates NVIDIA AI Enterprise is the software platform service production run time for mission-critical AI applications. As you can see this slide, this is diagram in terms of how -- what's included in NVIDIA Enterprise and how the pieces components work for inference. NVIDIA Enterprise is an end-to-end cloud-native software platform that streamline development to deployment for production AI applications. It includes the foundational libraries and model serving frameworks for inference such as TensorRT, TensorRT-LLM and Triton Inference Server as well as the application frameworks that you can see on the diagram that allows enterprises and customers to build their AI applications and also the infrastructure management software to scale AI deployments. So I'd like to double-click on why NVIDIA Enterprise is our runtime solution for inference. As we all know, open source software plays a very important role in AI adoption because it allows developer to easily consume complex algorithm developed by a broad community. However, the diverse range of software components and associated dependencies make maintaining a reliable AI software stack a serious endeavor. Let's take NVIDIA Triton Inference Server as an example and zoom in into the -- its software package dependencies. As you can see on the right-hand side on this slide, so the complexity of just this one single tool Triton Inference Server can consist of 556 third-party software packages and 6 NVIDIA software packages. And as you can imagine, if you integrate any of these tools into your enterprise application, the entire software stack, the complexity would dramatically expand so do the vulnerabilities. So the effort to maintain a secure stable software stack is very heavy lifting task by continuously scanning for common vulnerability exposure, which is short for CVE and identifying the impacting components and also fixing the security vulnerabilities. Any single change such as security patches or regular software update to the entire software stack can cause API breaks and also application failures and downtime. Within NVIDIA Enterprise, we rolled out the software branches in a multiple way, and I want to double-click on the production branches, which is only exclusive to NVIDIA AI Enterprise. The production branches are purposely built AI software stack to support the full AI workflow for model optimization to deployments. It releases every 6 months with a 9-month life cycle, allowing 3 months of overlap between 2 production branches for transition. Within the 9-month life cycle, NVIDIA continuously monitoring critical and high vulnerabilities and exposures and releasing monthly security patches. And by doing so, the AI frameworks, libraries, models, tools included in the production branches are secure and stable to eliminate the risk of breaking the API while managing the complexities inherent in today's open source-based AI software stack. The production branches also include security advisory explanatory information such as vulnerability details and remedy guidance. So this really allows our customers when they're building up the AI and infuse AI into their business operation to have a peace of mind. As mentioned earlier, NVIDIA Enterprise is designed for production inference. There are 3 key value proposition of why you should choose NVIDIA Enterprise for your production run times. First, accelerated computing increase productivity while lowering TCO. The value of accelerated computing, as Amr just covered earlier, is to do more with less. As we continue to evolve AI software, even without change anything to your hardware, the advantage in AI software are able to deliver more throughput over time and continue lowering cost of TCO while comparing to CPU-only platforms. This enable organization to see the business value much faster at a fraction of the cost. Second, enterprise-grade security, stability, manageability and support as we just discover NVIDIA Enterprise production branches for security and stability, NVIDIA Enterprise also includes infrastructure management software to scale your AI deployment as well as enterprise support for entire -- your entire AI journey and also production deployments through the full backing of NVIDIA experts globally with SLA responsive times. And lastly, cloud native and certified to run everywhere. As you see all the deployment options that Amr just shared earlier, NVIDIA Enterprise is actually exactly the same. Since NVIDIA Enterprise is containerized, this is exactly how enterprise can have a consistent environment irrespectively of where they choose to deploy. This also enable multi-cloud or hypercloud. NVIDIA Enterprise is available and certified to run everywhere. It could be from CSP marketplaces, OEM systems, to where you're having your data center or even in NVIDIA DGX platform and your preferred MLOp platforms. All right. Now I'm going to turn it back to Amr to cover how to get started.
Amr Elmeleegy
executiveAwesome. Thank you so much, Phoebe, for this helpful overview of the NVIDIA AI Enterprise suite of solutions and offering. So team, once again, we hope that you found the content for today's -- I'm just turning on my camera here. We hope you found the content for today's webinar useful and informative. As I mentioned earlier, there are 5 things I want you to walk away with today here is the fact that really solving AI inferencing challenges in the enterprise requires a full-stack approach. The NVIDIA inference platform is the only platform that is built from the ground up specifically for NVIDIA GPUs, but it also runs on non-NVIDIA hardware, it runs on CPUs, it supports multiple models, it runs on different clouds, it supports different AI frameworks, it's widely used across large enterprises today, and it's also delivered with the NVIDIA AI Enterprise support and security offerings that Phoebe just walked through right now. And here in NVIDIA, we really invest very heavily in creating tutorials, how-to guides, technical documentation for all our products and solutions. So I just wanted to include here a few links to help you get started with the NVIDIA inference platform. On the right-hand side, you'll find links to NGC suite of solutions and catalogs that include Docker -- fully containerized docker solutions for the Triton Inference Server as well as everything else we've talked about today, from the NeMo framework to the TensorRT, TensorRT-LLM. You can start immediately downloading these docker containers and testing them out within your organization or if you need additional support on the left-hand side, you'll find links where that will take you to the NVIDIA AI Enterprise offering. So you can also interact and engage with us and we'll come in and we'll help you deploy some of these applications and software into your organization to meet your specific requirements and specific architectural needs. Now I think with that, we are at the end of our webinar today. We really want to give some time for any questions that you might have. So at this point in time, we're just going to open it up for Q&A.
Amr Elmeleegy
executiveSo let me navigate here to the question section. Q&A, here we go. Okay. So we have a large number of questions here. Okay. Just skim through these very quickly. So there is a question here, is NVIDIA finding that at this phase of deployment or change with strategic partners and customers that are more incremental projects and delivering incremental versus broader organizationally and transformative? Look, if I understood your question well, is whether or not we're seeing customers leverage the platform to deploy incremental projects? That is really the purpose of the platform is to help organizations add additional AI models to their portfolio of use cases and applications in a short amount of time as possible. And as I mentioned earlier, we're seeing customers, our customers are telling us that after deploying the inference platform, they're able to reduce their time to market for their AI models from 3 months to literally 15 minutes. And the reason they're able to do that is, they are now -- every time a new model comes into the organization or they want to push the production, they do not need to spin up or build up an entire dedicated inference framework just for that particular model. They can deploy a single inference platform within their organization. And then any time a model comes, they just add that model incrementally onto the platform. And the platform is intelligent enough to be able to understand the requirement of the model. It'd be interoperable enough to work with all the different models, and that helps organizations accelerate their time to value for your model. So hopefully, I was able to -- like if I understood that question, hopefully, I was able to respond to it. What is -- another question coming in is, what is the cost, if any, associated with the use of Triton Inference Server software only? Is the cost primarily on the hardware GPU side? So the answer to that question is, obviously, it depends. As I mentioned, Triton is an open source software, which means it is free of charge. You can simply go and deploy it right now after the webinar with the 3 lines of code that I shared and discussed. So it's free of charge. It can run on CPUs and GPUs. So if you're deploying -- if you have CPUs -- CPU servers that you want to use for your use case, you can deploy it right away. That means there's no cost, neither on the Triton Server side nor on the CPU side. If you want to run it on GPUs and if you don't have GPUs within your organization, then you will go have to spin up virtual machines powered by GPUs in the cloud. So the essential cost that you're going to be experience is the cost of the GPU-powered virtual machine. And that, of course, will depend on which GPU you select, which configuration of virtual machine and which cloud provider you are using. And so that will -- the cost -- if you don't have GPUs internally within your organization and you want to deploy your model on GPU, then yes, the cost will be the virtual machine powered by the GPU. And of course, if you want to layer on top, support from NVIDIA, then the NVIDIA AI Enterprise offering will be also a charge element or a charge item to your solution. And you can -- there's obviously opportunities to dive a little bit deeper into that if you are interested. So please feel free to reach out, and we can help with that as well.
Phoebe Lee
executiveYes, Amr, I just want to quickly add some point to your answer. So NVIDIA is -- we love open source. So for developers, if you're trying to do in PLC, I think going to GitHub or NGC, download Triton Inference Server is a great way to get started. But once the Enterprise gets to the point that they need to scale out AI, they need to run business on top of the application that built on, that's the Triton Inference Server, that will be the phase that will recommend the pay software, which is NVIDIA Enterprise to ensure that the stability enterprise support in order to uptime your application to drive awareness and -- sorry, to drive advantage to your business.
Amr Elmeleegy
executiveAwesome. Thank you so much, Phoebe. So there is one question here I just want to double-click on is, could you speak to the forward and backward compatibility across old new GPU hardware, new revisions for TensorRT and lower layers of the stack? I think this is a very important question. In NVIDIA here, we pay tremendous amount of attention and effort towards ensuring that all our software and solutions that we build are backward compatible with our GPUs. If you go to the TensorRT GitHub page, you will find specifically that we call out all the GPUs that are supported by TensorRT and TensorRT-LLM and the lower layers of our software stack. So we made this information very publicly accessible. It's all documented on the tutorials, on the links that I shared earlier. And once again, we pay tremendous attention here at NVIDIA on topics like backward compatibility. So hopefully, that answers your question. I don't know them off the top of my head, but it's all documented on our GitHub repositories, and I would be happy to follow up with you as well. If you are looking for a particular GPU that you had in mind and you want to ensure that it's compliant or can operate with TensorRT, please drop it in the Q&A or we can take that offline afterwards as well. So with that, we've got only 1 minute left. So let me see here, and of course, we're going to get back to everybody where we didn't have the opportunity to answer your questions. We're going to follow up with you afterwards, but let me see one last question. Can we run the model online with GPU available? I think by what -- If I would understand this question correctly is, can we run the model in the cloud with GPUs? The answer is yes. As I mentioned, the inference platform is deeply integrated across all cloud providers. So yes, you can run it online. And yes, you can run it on a GPU cloud virtual machine. So the answer for that was easy. It's a yes. With that, I think we're at the top of the hour. So I want to thank you, everybody, again, for sticking along all the way to the end of the webinar. If you haven't registered for the December 7th webinar, please make sure to do so. We're going to cover a lot of interesting content there. With that, thank you so much, and we're going to end today's session.
This call discussed
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.