NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary

February 24, 2023

NASDAQ US Information Technology Semiconductors and Semiconductor Equipment special 60 min

Earnings Call Speaker Segments

Margaret Amori

executive

#1

All right. We're going to get started in just 3 seconds. Why don't we go ahead and jump in? Thank you, everyone. Thank you so much for joining us today for our webcast, How to Optimize AI Models for Faster Inference. Before we begin, we just wanted to cover a few quick housekeeping items. At the bottom of your screen, you can find various widgets. Once open on the screen, the widgets are resizable and movable. [Operator Instructions] A copy of today's slide deck and additional materials are available in the resource list. You can actually download any resources or bookmark any links that you find useful. Now without further ado, let's go ahead and get started. I'm one of your hosts today. My name is Margaret Amori, and I lead the North American region for the Inception program here at NVIDIA.

Dhruv Singal

executive

#2

And I'm Dhruv Singal, Solutions Architect for AI and Accelerated Inference at NVIDIA. Our main focus today is to delve into the challenges that come with developing AI models as well as exploring the wide range of use cases they cover. We will also be introducing the new class of generative AI models. We're excited to showcase how NVIDIA's cutting-edge tools and SDKs can expedite the development process and boost the performance of your models, ultimately saving you time and ensuring your product success.

Margaret Amori

executive

#3

And we'll talk about how some of these tools are actually being adopted and deployed today through some of our Inception members and talk about the great outcomes and benefits that they're seeing. And then I'll also spend a few minutes to talk about the Inception program here at NVIDIA. We'll leave a couple of minutes at the very end in the event that we haven't answered all your questions via the chat. We'll make sure to get to all of them. Thank you so much for joining us. And now, Dhruv, why don't we get started?

Dhruv Singal

executive

#4

Thanks, Margaret. Developing an AI solution involves making many decisions with the model being at its core. For developers who are just getting started, deciding where to begin can be challenging. Should they begin with the data, the model's architecture or the training framework? To get started, developers must first select and learn one or more frameworks, such as TensorFlow and PyTorch. Next, they need to choose a suitable model, an architecture that fits their application requirements. With so many models available, such as YOLO, EfficientDet, Detectron or GPT, making the right choice can be a daunting task. Once a model architecture is chosen, the developer must collect and label large data sets for training. However, this process can be costly and time consuming, potentially impacting the product's time to market. Once the model is created, it must be optimized for reduced inference latency and increased throughput. This optimization helps to reduce hardware and deployment costs while also providing a better user experience. Different platforms require different optimizations, making it challenging to have a single solution for all. Additionally, for various use cases, developers may need to rinse and repeat the entire process for each model. The complexities involved in creating AI applications are vast. To tackle this, developers need to be strategic and informed in the decision-making, constantly researching and testing various options to ensure that they are using the best solution for their needs. Today, we're going to talk through exactly how developers can solve these problems with the help of NVIDIA. We'll start with the strong foundation in the NVIDIA AI platform. Next, we'll dig into SDKs, including NVIDIA Triton and TensorRT, the core building blocks of an accelerated solution. Then we'll go into specifics, learning more about NVIDIA SDKs and tools for data preparation, training of models, optimizing training models and finally, deploying the models for inference at scale. We'll learn how to use TensorRT to optimize AI models and the various flexible ways to add it to an existing DevOps pipeline. We'll also learn how to optimize large language modules using NVIDIA FasterTransformer and the order of magnitude and performance uplift it provides. Last, we'll talk about NVIDIA Triton Inference Server and how it lets developers deploy their models at scale with framework flexibility, features to maximize throughput and minimize latency, and health checks, all through a standard API. We'll end with a couple of anecdotes showing how NVIDIA Inception helped companies accelerate their products using NVIDIA SDKs and tools by improving performance, reliability and lowering cost. Let's delve deeper into the NVIDIA AI platform. The foundation of this platform is its hardware stack that develops -- that provides unparalleled performance across cloud, data center, edge and embedded environments. NVIDIA offers a unified software/hardware stack that supports deep learning, training and inference. Built on top of this accelerated infrastructure are orchestration libraries that enable seamless scalability. Above this layer, developers work with core development frameworks, such as PyTorch, TensorFlow, NVIDIA TensorRT, NVIDIA Triton Inference Server and NVIDIA TAO Toolkit. By utilizing these core development features, developers can create AI workflows using NVIDIA SDKs, NVIDIA pretrained models and NVIDIA tools that optimize and deploy at scale. Together, these components form the NVIDIA AI platform. Next, let's learn about the end-to-end AI workflows made possible by NVIDIA SDKs, highlighting the advantages of the NVIDIA platform in deploying AI rapidly for production. The first step in the end-to-end AI workflow involves preparing the data prior to training the neural networks. NVIDIA RAPIDS is a data science tool that significantly reduces ETL processes from hours to seconds. Optimized for GPU acceleration, an NVIDIA A100 GPU can perform 70x faster than a CPU-only configuration, and RAPIDS can also be 20x more cost effective than a similarly configured CPU-only server. After data preparation, the model can be trained using GPU-accelerated computational frameworks, like PyTorch and TensorFlow, at scale. For conversational AI models, like those for TTS, ASR, NLP, NeMo and NeMo Megatron can be used. If you need to retrain and optimize a model for your specific data set, NVIDIA provides a tool called TAO -- train, adapt, optimize -- Toolkit, which is a simplified AI training toolkit to reduce training time and reduce your time to market. TAO abstracts away the deep learning framework complexity, allowing developers to leverage the power of transfer learning to fine-tune NVIDIA pretrained models with their custom data in hours rather than months. Once the model is trained, it can be deployed for inference. TensorRT can be utilized to optimize the model, resulting in faster model execution than in the native AI framework. For transformer-based models like large language models, NVIDIA FasterTransformer provides a recipe for optimized inference. Lastly, NVIDIA Triton Inference Server allows for scaling the deployment of the model, automating all deployment steps like model loading, request batching, multi-GPU and multi-model instances. With an NVIDIA A100, our 266x performance increase is possible compared to running the same workload on the same server with just the CPU. This enables you to focus solely on getting AI deployed at scale. The MLPerf training and inference benchmark represents popular AI workloads, which include tasks such as image classification, detection, segmentation, medical imaging, ASR, NLP, recommendation systems and reinforcement learning. NVIDIA is the only platform capable of running all MLPerf workloads and has achieved first place ranking across all of them, showcasing their versatility and top-notch performance. Furthermore, with architectural improvements, these workloads run up to 6.7x faster on the new H100 GPUs and 2.5x faster on the existing A100s due to optimizations enabled by software. Let us look at some emerging models not yet covered by MLPerf. A transformer model is a neural network that learns context and thus meaning by tracking relationships in sequential data in the words -- like the words in this sentence. Transformer models apply an evolving set of mathematical techniques called attention or self-attention to detect subtle ways even distant data elements in a series influence and depend on each other. An important advantage of the transformer architecture over other neural networks, such as RNNs, is that the presence of context around a word over a longer distance is done in a more efficient way. Also, unlike RNNs, the input data doesn't need to be processed sequentially, enabling parallelism. First described in a 2017 paper from Google, transformers are among the newest and one of the most powerful classes of models invented to date. Stanford researchers call transformers foundational models in an August 2021 paper because they see them driving a paradigm shift in AI. "The sheer scale and scope of foundation models over the last few years have stretched our [ imagine ] of what is possible," they wrote. Transformers have come to revolutionize the world of sequence-to-sequence modeling and started to percolate into a number of exciting applications where the encoder/decoder model provides advantages over neural network architectures. Hinging on the concept of attention and self-attention, transformers are able to provide more context in a given sequence, improving results for important natural processing -- natural language processing tasks, such as speech recognition or machine translation. In comparison to other sequential models, transformers are able to provide improved results as they are able to use parallelism, letting us make the most of our GPU's pushing performance with great results. Transformers are now doing more than just improving existing model architectures. They are also leading to the rise of a new type of AI that can broadly be classified as generative AI. Some examples are text generation with large language models like GPT-3 that can summarize, translate, write code; image and video generation with diffusion models like DALL-E; and accelerate life sciences and drug discovery with models like MegaMolBART. The NVIDIA AI platform supports past, current and future GPU infrastructure to future-proof your AI platform and maximize GPU utilization. This allows customers to turn their existing infrastructure into AI factories across multiple GPU architectures -- Volta, Turing, Ampere, Hopper -- and manage long-term support for their AI investment, whether on-prem or in the cloud. Inference is complex. Let's talk about how NVIDIA TensorRT and Triton Inference Server simplify deploying models at scale while maximizing throughput and minimizing latency. NVIDIA TensorRT takes models from any training framework and optimizes them to run across NVIDIA hardware like our data center and workstation cards as well as the NVIDIA Jetson AGX and DLA. Why should we use TensorRT with NVIDIA GPUs? TensorRT works with the frameworks we already know, like TensorFlow, PyTorch, ONNX, among others. TensorFlow compiles your models with the architecture of your hardware in mind. For example, GPUs now have multiple memory hierarchies and different code patch and accelerators for floating point, half-precision floating point, integer and double operations. Tensor Cores are specialty cores built for multi-precision computing and inference. There are several algorithms for matrix modification. TensorRT, [ knowing ] the core, cache and memory layer of a GPU, times several algorithms for each operation in your neural network, stitching together the final complete -- compiled model using the algorithms fastest for your hardware. In doing so, it ensures optimal performance for models for your hardware. TensorRT also works seamlessly as a Triton back end, providing the best end-to-end performance when used with Triton. Here are the performance benefits of using Triton and TensorRT with NVIDIA GPUs compared to a CPU-only server. For computer vision models like EfficientDet, using NVIDIA GPUs and TensorRT leads to a 36x better performance, reducing latency under 7 milliseconds per frame. For ASR models like QuartzNet, you're looking at a 583x uplift in performance, bringing latency under the 100 milliseconds required for a natural-feeling conversational AI. Similarly, TensorRT and NVIDIA GPUs lead to an order of magnitude improvement compared to CPUs on tasks like NLP, text to speech, recommender systems and reinforcement learning. There are 2 recommended ways to optimize a model with TensorRT. The first one is through ONNX. ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators, the building blocks of machine learning and deep learning models and a common file format to enable AI developers to use models with a variety of frameworks, tools, run times and compilers. The model needs to be converted from your deep learning framework into the ONNX format, which can then be converted to a TensorRT engine through the TensorRT optimizer. The generated engines can then be used through the TensorRT run time in your code. The second one uses TensorRT in framework optimizations. Here, we can optimize models in PyTorch or TensorFlow by adding one line of code. TensorRT find subgraphs in the computational graph of the model and replace them with TensorRT subgraphs. Instead of the tedious work involved with converting your model to ONNX and then TensorRT engines, you can leverage the speed up with a single line of code. This method allows TensorRT to accelerate the arbitrary and novel model architectures with fallback to the framework. The TensorRT subgraphs have all the optimization of TensorRT, allow execution in TF32, FP16, INT8, bit quantization and require no change to the existing developer workflow. Once the model is optimized, inference can be performed in the training framework like PyTorch or TensorFlow. It is more efficient as you no longer have to deal with potential memory copies or a glue code to stitch together TensorRT engines with the framework-specific code. Lastly, TensorRT graphs and subgraphs can be serialized in the same format as the models themselves, like TorchScript and SavedModel. NVIDIA Triton Inference Server is an open source SDK that simplifies deploying your model with inference at scale with a number of features. It supports all major framework back-ends like TensorFlow, PyTorch, ONNX, OpenVINO, TensorRT and a Python back end for any model that exists as Python code. In this way, Triton meets you where you are: continue developing in your language or framework of choice, or being open source, you can also build your own framework back end. It supports GPUs, CPUs and other accelerators like AWS Inferentia for hybrid deployments and mixed workloads. The client libraries let you communicate with Triton to send requests over standard HTTP and gRPC through C++, Python, Java or anything else you'd like to use. Triton has features for maximizing throughput and minimizing latency like concurrent model execution, multiple model instances and dynamic batching. Model Analyzer profiles all the different configurations, letting you choose the best one for your throughput, utilization and latency constraints. Finally, Triton provides health checks and metrics through the client API and Prometheus. Transformers have grown up. Large transformers perform better, as shown by GPT-3. That's revolutionized search, writing, question-answering and other text-generation tasks. It has 175 billion parameters, up from 1.5 billion for GPT-2. There are even bigger models. An example of that is the Megatron NLG model, a 530 billion parameter model revealed in 2021. Today, many AI engineers are working on trillion-parameter transformers and applications for them. To be able to use these large language models for inference, we need to optimize them. This is where FasterTransformer comes in. By optimizing the transformer encoder and decoder and with kernel auto tuning, it provides an order of magnitude acceleration for large language models like GPT, GPT-J, GPT-NeoX and T5. Most large language models are transformers with an encoder and decoder and a task-specific head. This allows the flexibility of using a lot of model architectures and sizes for the same task based on the task and the amount of compute available. Question answering has many applications, like chatbots for customer support, FAQ bots for enterprise and for improving search, retrieving results from knowledge bases. For question answering, you might choose a bot model to choose spans within the text corpus that contain the answer to your question, or you might choose to use a GPT-3 model that reads the passage and generates the answer from scratch using its own knowledge and memory. Similarly, for summarization, you might want to use a model like BART to write simple summaries, or something larger like Pegasus that write summaries with extrapolation based on what it already knows. Translation can either support just one language pair or be multilingual, like BLOOM, which is a GPT-3 class model but trained on 46 languages and 13 programming languages. Therefore, using the appropriate task head, it can even be used for multilingual translation, code generation and even code translation. Code generation is a very interesting task since code differs from language in a few ways. One example is how code has reserved keywords and custom symbols using a tokenizer or model trained on language to generate code [ failures ] since it doesn't understand the [ importance ] of such keywords or tries to associate names to variable and symbol name that don't generally appear in the language. For that reason, you have to use models with much larger memory and knowledge, unless you constrain the domain to something like auto completion instead. You can also retrain these models and their tokenizers with a coding task in mind. Now let's look at how some Inception partners have used these tools to improve their product. Segmind, the generative AI optimization platform, wanted to accelerate diffusion models. At a high level, diffusion models work by destroying training data by adding noise and then learn to recover the data by reversing this noising process. In other words, diffusion models can generate coherent images from noise. Diffusion models train by adding noise to images, which the model then learns to remove. The model then applies this denoising process to random seeds to generate realistic images. Combined with text-to-image guidance, these models can be used to create a near-infinite variety of images from text alone by conditioning the image generation process. Inputs from embeddings like flip can guide the seeds to provide powerful text-to-image capabilities. Diffusion models can build -- can complete various tasks, including image generation, image noising, image denoising, in-painting, out-painting and bit diffusion. Popular diffusion modules include OpenAI's DALL-E 2, Google's Imagen and Stability AI's Stable Diffusion. Using these models in PyTorch consume over 5 gigabytes of VRAM and require nearly 5 seconds to generate an image even on leading-class hardware. Diagnostics also showed lower hardware utilization. This showed us that there was plenty of room for optimization as the hardware wasn't being tasked to its maximum. There existed a bottleneck, which could have been the computation in the denoising step or the memory bandwidth involved in the self-attention step in the transformer. Working with NVIDIA, Segmind converted the diffusion model into separate TensorRT engines, fusing layers and operations, choosing optimal compute kernels and optimizing the memory copy for their tension mechanism. Separating the model into separate TensorRT engines helped split the compilation into multiple steps, allowing the optimizer to optimize each to the fullest within the GPU's memory size. Finally, the model was deployed on Triton using the Python back end. Using the Python back end let us avoid the performance cost of deploying the 3 TensorRT engines on 3 different back ends and made the entire deployment much simpler and flexible for the developers. This sped up how fast they could release their product. Furthermore, due to these optimizations, the model ran more than 8x faster, consuming less than half the memory. This resulted in a 16x improvement in TCO for Segmind as they were able to fit their 50-plus models on 16x fewer GPUs while improving latency. Another Inception company, Tabnine, was working on cogeneration with GPT class models. They noticed low levels of hardware utilization and slow inference, leading to unacceptable latency. This was a common pattern, but the fix here was not TensorRT. It was FasterTransformer. Working with NVIDIA, Tabnine optimized their text encoder to work more efficiently on GPUs. Finally, they accelerated their model with FasterTransformer's kernel auto tuning and deployed it on Triton with the FasterTransformer back end. As a result, they were able to get their models running over 40% faster. Using Triton Model Analyzer, Tabnine was also able to profile their solution. Model Analyzer demonstrated that the model could be optimized for minimal latency while maintaining excellent throughput. The solution involves multiple model instances and current execution and dynamic patching. Multiple model instances let you instantiate multiple instances of the same model if there's enough memory available. In doing so, you can serve multiple requests for the same model without modifying it to add batching. For models that support batching, dynamic batching lets you decide a latency threshold. Any requests that take longer than that threshold gets served by the model on their own. Any requests that come within the threshold get batched together. Therefore, all of your requests get served by the model within the threshold. And with dynamic batching, you're able to maximize throughput when multiple requests come in within the threshold. Concurrent execution uses CUDA streams to let you serve multiple models from the same Triton instances. For example, if you have an application that uses AI to generate art, you can serve a text-to-image model as well as an in-painting model with the same GPU and Triton instances. These huge improvements came from features that are native to Triton and available to everyone, from initial deployment, thorough analysis and finally, optimization and deployment with Triton and FasterTransformer. This is an ideal combination for anyone deploying text generation and code completion models. Let's look at some resources to get you started using these tools to accelerate your solutions. For TensorRT, we have 2 developer blogs showing how to optimize your models. In the first post, Speeding Up Deep Learning Inference with TensorRT, we define an image segmentation model with PyTorch. We then convert it to ONNX, and then optimized it with TensorRT. Finally, we used the optimized model for inference, adding batching and profiling to see their improvements. We also experiment with a mixed precision for better performance. This should be of interest to those working with image networks but is also a good introduction to the workflow, converting models from deep learning frameworks like PyTorch and TensorFlow to TensorRT. The second post, Optimizing and Serving Models with NVIDIA TensorRT and NVIDIA Triton, covers multiple ways to convert a model from a deep learning framework like PyTorch and TensorFlow to TensorRT. They start off looking at trtexec for converting ONNX to TensorRT, Torch-TensorRT for accelerating PyTorch models with TensorRT within PyTorch, and TensorFlow-TRT for accelerating TensorFlow models within -- with TensorRT within TensorFlow. Then we look at the steps involved in deploying the generated models with Triton Inference Server and running a configuration file that tells Triton about the model to serve along with the inference configuration, like the maximum [ BART ] size, number of model instances, among others. Finally, it shows how one can write a client using the Triton client library to perform inference with the Triton server using simple HTTP codes or a gRPC [ grid like that. ] This should be of importance to those wishing to use TensorRT in framework optimizations using a single line of code to leverage all the TensorRT optimizations in the deep learning framework of their choice. Similarly, for Triton, we have 2 blog posts showing how to deploy your models. In the first post, we go through the steps setting up a simple Triton server to [ serve a ] BART model. We then use the Triton Model Analyzer to run an automatic sweep across configurables like [ BART ] size, concurrency and number of models per GPU, getting the throughput and latency achieved for both. Finally, we add additional constraints like latency, maximum memory usage and GPU utilization to find the optimal configuration for those constraints. Triton Model Analyzer enables finding model-serving configurations, satisfying the application SLAs and requirements of various stakeholders. This should be of interest to those working with simpler language models like BERT, BART and those working with Hugging Face models. But it's also a good introduction to the various configurations Triton provides for inference and how to use them. In the second post, we go deeper into some Triton features like dynamic batching, Model Analyzer, Amazon SageMaker integration and multi-GPU, multi-node inference, the NCCL and Magnum IO for topology aware communication for model parallelism. This is useful when contiguous model layers are split across multiple GPUs and then individual layers are split across multiple GPUs. This should be of importance to those working on multi-GPU and multi-node inference and is a good introduction to model parallelism for inference. Finally, for FasterTransformer, we recently published a blog about accelerating GPT and T5 models with FasterTransformer back end and Triton Inference Server. In it, we learn the benefits of using FasterTransformer for inference of large language models. We dig into the supported models and the optimizations provided by a FasterTransformer, like layer fusion, memory optimization, model parallelism, mixed precision and kernel auto tuning. We also showed the multifold performance gained by using FasterTransformer for T5 and GPT-J models compared to standard PyTorch. This should be of interest to those working with large language models and is a good introduction to all the ways to optimize transformer models. All the SDK scripts and resources I just mentioned are available through developer.nvidia.com. Here are some ways you can connect with us after the webinar. You can join our developer program to access the tools covered in this webinar and technical training to get you started. If you're a developer, ISV or an enterprise company, you can also contact us to the e-mail address provided on screen. We are happy to answer any questions and connect you with the right resources here at NVIDIA. We also have our GTC conference coming up, from March 20 to 23. This is where you can hear about our latest technology announcements, attend sessions and get hands-on technical training. Here are a few sessions listed on the slide that may be of interest to you around generative AI. Now I will pass to Margaret to talk about the Inception program.

Margaret Amori

executive

#5

All right. Thanks, Dhruv. So now we're going to do a bit of a transition. Dhruv told you all about our inferencing tools that are broadly available for all types of developer series. And now I want to give you a couple of more examples of how some of these solutions are being deployed within our start-up partners today. First example is an Inception member called Ex-Human. Ex-Human has a popular chatbot called Botify AI. With any type of consumer-facing entertainment app, latency is critical, especially when it comes to language interactions. Too much delay just kills the whole user experience. They are using a popular model called GPT-J with 6 billion parameters to generate language for their chatbot. And one challenge for them, as Dhruv talked about earlier, are the memory requirements for these models. They had originally been using deep speeds optimizations but then found that Triton had a FasterTransformer back end which fit their use case perfectly. So using NVIDIA Triton with the FasterTransformer back end, their memory requirements went down 1/3, from 23 gigs to 15, and their median latency went from 2 seconds to 0.67 seconds, which was a huge boost for their users when engaging on their application. We've enjoyed partnering with Ex-Human in the Inception program. We've written a couple of blogs featuring their work. So if you're interested in learning more, I would invite you to go check out the NVIDIA corporate blog [indiscernible] read about them. Another example is a company called NLP Cloud. NLP Cloud, out of France, is a provider of high-performance speech AI models as a service. They've got about 25 different large language models which they use to serve up services such as text generation, summarization, sentiment analysis, speech recognition and more. Running these models in production efficiently across multiple clouds was a big challenge for them, both from complexity to response time to cost. They tested many options and ultimately decided to use Triton Inference Server as well as TensorRT. After moving to Triton, they found that they could process 10 requests at a time on a single A100 GPU, which was twice as much as they were able to do before, effectively cutting their inference time in half. And as far as reducing complexity goes, thanks to the FasterTransformer back end, they are able to automate complex tasks, like splitting up their model across multiple GPUs. And another benefit was reducing their response times by 1/3. They can process requests in as little as half a second with Triton on an A100 GPU. And then finally, I'll just plug that they have a great blog as well that they publish on their website. For anyone interested, it's nlpcloud.com. They provide a lot of really great practical kind of how-to advice on building and serving large language models. So what do these 2 startups have in common? They're both members of the NVIDIA Inception program for start-ups. We have the privilege of working with some of the world's top technology providers. Here's just a snapshot of a few of the incredible Inception members out of the 13,000 active in the program today. You may notice that who we've highlighted here, a lot of notable start-ups at the forefront of using the foundation model architectures that Dhruv was talking about earlier, who are applying these generative AI use cases. We've seen how generative AI has caught the world's imagination. What I think is going to be so transformational about this is the way that it's being applied to so many different use cases across so many different industries. And there are new ones coming all the time. Not even included here, but already very prevalent are music and audio generation or drug discovery, for example. And I'll just pick out 2 startups that we work with very closely in the program that are having a lot of success with Triton and TensorRT. Cohere, a leader in NLP and chatbot technology, is getting up to 4x speedups on inference using Triton on their custom large language models. And Writer, who is focused on enterprise-quality text generation, is being used by Twitter's employees, for example, to be able to write in a voice that's authentic to the company's brand. Because they use Triton, Writer is able to achieve 3x lower latency and up to 4x greater throughput compared to what they were doing before. Hugging Face, WOMBO, OctoML, there are so many great generative AI start-ups in the Inception program that are having huge success with these same inferencing tools and approaches. And speaking of Hugging Face, they and OpenAI are some of our original generative AI Inception members. They've been with us in the program almost since the beginning. And let me actually tell you a little bit more about the program. Our CEO and founder, Jen-Hsun Huang, fully understands the potential of the start-up. And even though it's been decades since he founded NVIDIA and we ourselves were a startup, believe it or not, we've still retain a lot of the same characteristics of a start-up, ability to innovate, to be agile, to put mission-first for example. So it's no surprise that 7 years ago, NVIDIA put in place the Inception program. And here's Inception out of grants. We represent over 13,000 start-ups from over 100 different countries around the world, and our members have raised over $94 billion in cumulative funding. Despite having such a large program, we're still rapidly growing, partly because of a huge range of use cases and technologies where GPUs and accelerated -- or DPUs and accelerated network [ band ] are relevant. And fundamentally, what do we do in the program? We want to help those startups that are building on our platforms. We want to help them build faster. We want to help them accelerate their growth and ultimately scale their reach in the market. And we do this effectively through offering a variety of programmatic benefits. Building a company with very few resources is an immense challenge for startups. We get that. So we've designed the benefits in a way that there's something in here for everyone regardless of your stage, whether you're stealth or Series D. Our early-stage start-ups tend to really enjoy benefits such as access to free cloud credits, discounted training or connections to our EC community, where our growth stage or more mature start-ups take advantage of our lab environments, our discounts on GPUs and the ability to co-sell with our own sales team or be featured in different marketing programs. And all of our start-ups get a ton of value from the engineering support that we offer. Here's a great quote which points out just that. "Helping product teams add new features, new capabilities, perhaps reducing costs or helping start-ups get to market faster." That is what we are all about in the Inception program. And we're super grateful to have phenomenal partners like Graphistry to work with. We do get questions a lot about some of the marketing programs where we engage our members. We've got a lot of options when it comes to marketing. Taking a look at some of them from the corporate blog or the dev blog, to the podcast, a lot of different industry publications and sales assets, for example, these are great resources to leverage. The podcast tend to be heard between 25,000 and 35,000 times each. The blog, especially the corporate blogs, they tend to get shared all over social media by NVIDIA employees. And the best part is these are completely free for you, and they massively extend your reach in the market. So how do you get selected for some of these marketing opportunities if you were already in the program? We're always looking to feature those start-ups that are having success with our software tools, not just our GPUs and hardware. Tools such as Triton Inference Server or TensorRT, for example, or one of the hundreds of other developer tools freely available from NVIDIA. We've got our pre-trained models perhaps or RAPIDS libraries for accelerated data science. During the program and you're leveraging our SDKs, our frameworks, our libraries and having success, we want to hear from you. So I would recommend reach out to your Inception partner manager. We can advocate for you and look for promotion opportunities on your behalf. Of course, at the end of the day, every start-up wants to be able to connect with potential customers. We regularly hold these types of exclusive connection events. They're typically invite-only, and the way that you get invited is through having a relationship with your partner manager so that we know about you. We know what you're building, how you're using NVIDIA, we know whether you're launched, and what type of customers you're trying to reach ultimately. And this way, we can be looking for the right opportunities to showcase and invite you. Now one of the things that we love is our GTC conference. We love to highlight our Inception members at GTC. And one of the ways we do this is through the keynote. Our keynote is delivered by Jen-Hsun. And over the past few years, the keynote has been viewed over 40 million times. So it's just a really incredible way for a small start-up to get a massive lift in the market and a lot of brand exposure. And here are some of the great start-ups that were featured in the keynote last year. And finally, we've heard some quote from founders. Here's one from the VC community that I love. Many of you are familiar with In-Q-Tel, a premier VC firm. We appreciate the close partnership that we've built with their team over the years and this very kind quote. And with that, I really hope that I've convinced you to sign up for the program if you're a start-up. And we hope that we've also convinced you to check out both TensorRT and Triton Inference Server. And I'll end here with just plugging the GTC conference again, March 20 through 23. I hope that you'll join us. There are a number of really fantastic sessions. If you enjoy this type of material, there's a lot more there. Dhruv talked about a couple. Here's 4 more to consider that are specifically oriented for start-ups and those looking to build gen AI solutions.

Margaret Amori

executive

#6

And it looks like we've got about 10, 15 minutes remaining to answer some additional questions. I know we got through quite a few via the chat during the session, but happy to answer the remaining. And if anyone has a question that they haven't asked yet, please feel free to put it in the chat. All right. So why don't we go ahead and transition over to our live Q&A. There were a couple of really great questions that we thought warranted a better, more thorough discussion versus just, like, a quick answer. So we saved a couple of those, and we've got Dhruv here standing by to answer them. I don't know, Dhruv, if you wanted to go in any particular order, if you wanted to just kind of pick one and start with that, but handing over to you to do that.

Dhruv Singal

executive

#7

Let's see. There's one question about using -- quantization with Torch-TRT and TensorRT. Yes, TensorRT supports quantization if you -- in both cases, either you have Quantization Aware Training, or if you have a more distillated training that have encased layers in them or if you have a more, like, it was not trained with Quantization Aware Training, but you still want to quantize it after, you can do it with TensorRT. All you need is collaboration file. It gets generated when the model runs through -- when TensorRT runs through the model with some inputs in order to figure out the scaling factor needed for quantization. With Torch-TRT, it's much easier to do this because it can run the inference through the PyTorch model when figuring out the scaling factor. The other question is how can one create a pipeline of models using Triton, like an ensemble? Triton supports ensembles. And I believe you had a question about ensembles earlier on. But an ensemble is basically when you have a model where the input of the first model turns into an output, and the output of the first model turns into the input of a secondary model. And then doing this, you create a pipeline of model where the output of one is the input of another. Triton supports ensemble models in 2 different ways. The easiest way to do it is to use a Python backend to tie other back-ends together. The Triton Python backend lets you host any model or any Python code as an inference target for the client. What you can do is, let's say you have an example of a model that first runs a classification model on an image and then runs a detection model on it. Actually, you know what, let's think of a more generative AI example. Let's think -- let's say, you have a model that generates an image, and then you next have a model upscales the image. So what you might do is have the Python backend defined as a pipeline, and you might have 2 other models in the Triton server, one for image generation, the other for upscaling. Your client would call the Python back-end a pipeline model, and then the pipeline model within Triton would call out with the input it received from the client to first, the image generation model. It would get the output buffer from the image generation model and then send that buffer within the pipeline of Python backend as the input when it next calls the upscaling model backend. And then it would take the output from the upscaling model and then return that through the pipeline Python backend to the client. So your client would see it only as a pipeline. It would send a prompt, like a text prompt, and it he would receive back an upscaled image and the pipeline Python backend within Triton would take care of receiving the input buffer from the client, converting them to GPU tensors, sending them to your text-to-image model, retrieving the outputs from the text-to-image model, converting them to input tenors for the upscaling model and then getting those -- the results of that back and then finally sending those back to your client. That's one way to do ensemble models. Someone rightly pointed out that some of the models in the pipeline can be a bottleneck. And there are a couple of approaches to solving it. Triton Model Analyzer should tell you how much -- what is the latency versus throughput trade off when you're deploying these models. So what you might want to do is if there's a model that's a bottleneck, what you can do is, either have multiple instances of that model, so it's not a bottleneck anymore; or if the bottleneck is computational, such that the model is -- such that the model doesn't occupy an entire GPU when it's in use, you could have a few instances in the model, each on separate GPUs, which you can also do in Triton, so it's no longer a bottleneck.

Margaret Amori

executive

#8

Hey, this is probably a good segue to another question. You probably touched on a couple of these things, but can you talk a little bit more about how Triton Inference Server will allow you to set up kind of load balancing of your models?

Dhruv Singal

executive

#9

Oh, I see that question.

Margaret Amori

executive

#10

I don't know if it's specific to workstations or maybe just kind of in general, but that was the question.

Dhruv Singal

executive

#11

Let's imagine a single workstation with, like, 2 A600s. So automatic model serving. That's one of the core features that I like about Triton, which is it has gotten really good at letting you bring your own model for the most part. So in order to deploy model with Triton, here's what you need. You need the model and you need to be able to tell Triton about the model. So you put -- you create a model repository. You put your model file in there and you write a config.pbtxt. In recent versions of Triton, you no longer write -- need to write a config.pbtxt. Essentially, the config.pbtxt tells Triton like, hey, this is the backend that we'll use, which could be the Python backend, the PyTorch backend, the TensorFlow backend or the ONNX backend or a different like TensorRT. You need to tell it the input buffer names, the buffer sizes and the buffer data type and the output names, buffer sizes and buffer data types. There are other parameters you can add in the config to, like, optimize it for ragged badging, dynamic batching, but that's what you need. But Triton now can figure all those out from the model file itself. So for automatic model serving, what you could do is have Triton pull down the model file from S3 and automatically generate the configuration file when deploying it. For load balancing, what you could do is -- when you -- you might have to actually write a config file then, unless you go for the Triton managed service. The config file can -- you cannot have automatic balancing, but what you can do is you can run the Triton Model Analyzer for the models that you plan to deploy to figure out the latency versus throughput trade-offs and get a bunch of graphs for GPU utilization, GPU memory utilization versus -- and what the -- what those are when you have a certain amount of throughput and certain amount of latency. Using those graphs, you can figure out a configuration for Triton that meets the needs of your stakeholders and then write a configuration file that meets those needs. Obviously, Triton now supports multiple GPUs. It supports multiple model instances across multiple GPUs. And these are all simple tunables that you can add into your configuration file. I think that answers that question. I'm just going to wait around for a little bit to see if we get any others. Else, we can probably call it.

Margaret Amori

executive

#12

There was an earlier question about what kind of models can I train with TAO? You had talked about TAO earlier.

Dhruv Singal

executive

#13

Right. TAO supports -- the number of models it supports is increasing constantly. But last I checked, it supports most of the common computer vision and conversational AI models. There is a link to the model that it supports that I can try to put into the chat. Let's see.

Margaret Amori

executive

#14

I can add it. I can add it in the chat.

Dhruv Singal

executive

#15

Okay. Let me put this in the -- okay. I have in put it in the chat.

Margaret Amori

executive

#16

Oh, okay.

Dhruv Singal

executive

#17

So those are the models that TAO supports. TAO also supports exporting your models in their optimized form, which is .etlt or which is an optimized backend TensorRT. TAO -- the models that TAO supports are automatically converted to TensorRT, if you'd like, and they're very easy to deploy with Triton.

Margaret Amori

executive

#18

Can I read you another question, Dhruv?

Dhruv Singal

executive

#19

Sure.

Margaret Amori

executive

#20

What is the difference between latency and throughput optimization? Wouldn't lower latency have higher throughput?

Dhruv Singal

executive

#21

Sure. So by lower latency, you would have higher throughput. That's only for a single inference case. So if you have a single inference and you've higher throughput, you'll obviously have lower latency as well. But when you have a model deployed and you have 100 inference requests coming in, if you have your batch as 100, you'll be maximizing throughputs. It'll be 1 inference passed across the entire GPU. However, the request that appeared first in the 100 requests will have turbo latency, because it will have to wait for the 100th request to get -- to appear before it gets batched and sent through the GPU. In this case, your throughput will be really, really high, but your latency will be really high, too. So your latency will be really bad. As an alternative, you could have, hey, infer my request as soon as they come. In this case, your throughput would be really bad because your first inference will be going through the GPU and your GPU utilization might be something like 4%. And then the second inference comes in, third inference comes in. Python helps this by letting you set up dynamic batching where you can say something like, hey, I want you to batch inferences up to 10, or let's -- well, let's do a multiple of 8. Batch inferences up to 16 inferences, but never let an inference take more than 20 milliseconds to get batched before you send it to a GPU. What this will do is it will make sure that all the inferences that come in within a 20-second interval -- 20-millisecond interval get batched together and sent in to the GPU. But any inference, any requests that are outside of that gets sent in on their own so that no request takes longer than 20 milliseconds to serve. And so your latency is not -- latency is always suitable for your stakeholders. Looks like we have a question about T5 models. So if someone wants to -- okay. So if they would like to -- I would like to optimize a T5 model for inference and run it on short text samples. But if they only the model output, not the a specific task, how would they go about more optimizing the model? There are some ways to optimize a model based on where it's coming from. If it's a native T5 model from Hugging Face, you can look at FasterTransformers and then deploy it with NVIDIA Triton Inference Server. If it's a more custom model, then you might look at something like the Python backend within Triton, and then optimize the model either with ONNX or PyTorch just-in-time, Torch-TRT or Hugging Face Optimum. Foremost models I would actually recommend Torch-TRT over Hugging Face Optimum and ONNX Runtime, just because the ONNX Runtime will be using Torch-TRT or something like that underneath and Hugging Face Optimum has a lot more configurables than -- that do make it harder to convert than newer class of models into an optimized format. Torch-TRT is simpler to understand in the sense that all it does takes the computation graph, finds the subgraphs that can be accelerated with TensorRT and replaces them with TensorRT subgraphs. So overall, it's a much simpler workflow to understand and the performance is better as well. Another question. How do I run multiple models on a single GPU with Triton? So this one's fairly easy. You just schedule them. When you write your configuration file or don't, when you launch Triton with a single GPU and load multiple models, multiple models get loaded on to -- on that GPU with Triton. Some things to take care of here would be to make sure that all the models that you're trying to load into Triton split into the GPU frame buffer for the GPU that you're using. If not, model loading might fail. You should also make sure that if you do have multiple -- access to multiple GPUs, you should schedule your models on multiple GPUs based on what the Triton Model Analyzer tells you about how to deploy them. So you could have something like 1 model per GPU or you could have 2 models per GPU. But 2 GPUs, you can be serving 4 model instances. Looks like we're right about on time.

Margaret Amori

executive

#22

Yes. I was going to say I think that might be about all the time we've got for questions, unfortunately. But I want to again thank everyone for spending some time with us on this webcast. I hope you found it informative. Don't forget you've got access to all the resources in that panel. So the links, the presentation should be in there, and you can download that. We hope that you'll join us for GTC, March 20 through the 23rd. And Dhruv, thank you so much for coming and sharing all your knowledge and expertise with us. Really appreciate it.

Dhruv Singal

executive

#23

Thanks, Margaret. Anytime.

Margaret Amori

executive

#24

All right. Have a great rest of the day, everyone. Bye-bye.

Dhruv Singal

executive

#25

Bye.

This call discussed

For developers and AI pipelines

Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.