NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary

February 27, 2025

NASDAQ US Information Technology Semiconductors and Semiconductor Equipment special 94 min

Earnings Call Speaker Segments

Kavita Aroor

executive
#1

Good afternoon, good morning, good evening, depends on which part of the region and world is joining us today. I welcome you all to the NVIDIA webinar on NIM Microservices. This is Kavita Aroor. I head the Developer & Startup Marketing for this part of the region, NVIDIA South Asia. In today's webinar, you're going to pretty much get the landscape of generative AI, understanding the latest trend and challenges in deploying your generative AI models. And also, I would see this webinar as a game changer for you. And you would possibly understand from the speaker today how you can significantly improve the performance and efficiencies of your GenAI models and applications that you're building. Finally, I think there's going to be a lot of demos and real-world use cases that the speaker is going to take you through. So hang in for that and engage with most of those. As some of the ground rules for today's session, I would like to take your attention to the navigation bar on the right side of the console. You would possibly would love to see who your speaker is today, so go and explore his profile under the speaker bio. You would also have the option of Q&A. So if you have any questions around the webinar, beyond the webinar or NVIDIA stack and our technology, feel free to ask your question there. There's also a feedback and survey tab. We would love to have your feedback at the end of the webinar, so feel free and go and share your experience there. Within the Resources section, there are ultimate amount of resources in terms of GitHub resources, in terms of the blog stories, articles you have for NVIDIA NIM Microservices. Anything around the topic of the webinar, you will find there. So please go and explore the resources under that section. Within the Q&A chat window, we have the solution architect team present there to answer some of your questions. So during the webinar, don't hesitate to ask any of your questions to our moderators who are just aligned to give you the feedback or give you the right answers for your questions. So engage with them under the Resources section and under the Q&A section. Beyond this, one of the other aspects I want to get started with this before I introduce Bharat. We have all got you engaged to the DLI promotion. And this webinar is going to give you an opportunity to go and explore the DLI courses. And Bharat will talk more about that. But look out for the QR code and the URL under the Resources section to sign up and gain access to that free DLI course for you. With this, I welcome and introduce our speaker for today, Bharat. Bharat is a very seasoned senior architect within the NVIDIA solution architect team at India. He has a lot of experience around and his deep -- and his expertise is around large language models plus multimodal AI as well as RAG. And his proficiency, I would say, is more of engaging with the customers, providing them the end-to-end solutions and also optimizing GPU performance through the effective resource management and cost optimization strategies. Bharat also comes with a larger background of MLOps and cloud architecture. Beyond this, he is available on LinkedIn. He has a great fan following on LinkedIn as well. Feel free to explore and join him on that part and engage and ask him any questions you might have on NVIDIA stack. With this, I open up the forum and hand over the session to Bharat. Bharat, over to you. Thank you.

Bharat Giddwani

executive
#2

Thank you, Kavita, for the nice introduction. So -- let's get started with the session, team. So we are going to talk about accelerate your AI development life cycle with NVIDIA Microservices, NIM Microservices. By NIM, we mean NVIDIA Inference Microservices. This is somewhat a new technology we have recent -- we have released last year that helps you optimize your generative AI and conversational AI pipelines up to the mark on NVIDIA GPU hardware. Along with me, we have a few moderators, as Kavita mentioned, who would be able to answer your queries in the chat. Feel free to ask them during the webinar. And at the end of the webinar, we'll also take some of the questions and discuss it live by going live as well. So with that, let me just talk about the agenda today. Today, initially, I'll talk about the basic introduction about generative AI, like how it started and what all you can do now with generative AI. Moving forward, we'll talk about it -- moving forward, I'll talk about NIM Microservices that explains about the basic overview to advanced architecture and then how you can deploy it across different platforms, be it in the cloud service providers, data centers or any workstation. And moving forward, we'll talk about what kind of performance benefits you can get with NIMs, and we'll also talk about recent development that we are doing with NIMS called as NIM Blueprints that helps you identify what all types of modern applications who can build the agentic workflows you can build with NIM Microservices. There are certain examples. We have enabled it. Feel free to test it out. It's all available on our website. We'll then share the resources where you can access all of them. And finally, we'll talk about certain custom demos that we have built in India and also a case study with one of the customers that is being successful in using NIMs and other framework that allows them to build the entire ecosystem of their generative AI application. In between, I'll also take you through some of the experimentation, like how easy it is to just download on NIM and run it, and then what kind of complex application you can build. So a few live demos as well we'll talk about here and see how it works. So with that, let me just start with the video. [Presentation]

Bharat Giddwani

executive
#3

So you have seen the idea about the NIMs, like how it actually simplifies your entire workflow of GenAI building. Majorly, we'll talk about the nuances in the later slides. But as you could have seen that this was one of our blueprint from the NIM Agent Blueprint suite that we have called as digital human that allowed you to create avatar, which involves multiple sub-NIMs working in simultaneous way in an accelerated shift mode like some speech to text, text to speech, audio to face; then on top of it, large language models, vision language models and whatnot to make the entire pipeline look interactive and conversational. If anyone wants to create such pipeline, this could be one of the accessible NIMs available on the Agent Blueprint suite. Now moving forward, let's just give a basic introduction about generative AI. It started with ChatGPT in like last 3 years back around when you might have noticed that about within just 2 months, people have started seeing the usage of AI has boomed, right? This was one of the first application of AI that crossed 100 million users within just 2 months. But -- and then people realized that this can be a possibility to make it into production, can be used in various different industries. And then where from 2023 to 2024, people started experimenting with lots of different applications around AI and started building open source technologies likes of models by Meta, by [ Dense ] and Whatnot and Mistral. And they produced equally competent models in the domain. Alongside with that, people also started creating the optimized frameworks to build and deploy these applications into production that could help achieve masses. And this year, all you could see, there are lots of open source technologies available now to make it -- that are widely available to audiences. But now, is it easy? The answer could be yes, the answer could be no. So to make it a reality, NVIDIA has created a lot of different suites of frameworks. One of them is NIMs that helps you deploy generative AI application, be it in the domain of text to text, image to video, video to image or text to image, any domain or even in the domain of health care, we are enabling all these open source models into its accelerated form with various open source available backends like TensorRT-LLM, vLLM, SGLang and Whatnot to create their optimal version, but also in a much more easier way so that you could be able to host them in no time and also generate industry-specific endpoints that can help you create end applications much more easier. So let's just understand more about them. First, as I just already explained, NIMs or the -- so there can be 2 ways you can deploy any application, either through you can take a managed generative AI service, which are working -- which are highly accurate. Lots of providers are providing them and also easy to use. Whereas the other way could be around like you deploy your own model from the open source and try to build the entire stack of optimization to serving to then writing industry-standard API endpoints so that the end data scientist would be able to create an application or the software engineer can create the final software stack. That leads to lots of time consumption. So to make it much more easier, we make it possible with NVIDIA NIMs. So I'll just take a quick overview of how it looks like. This is the base architecture of NVIDIA NIMs. It works on all the different kinds of platform providers, be it hyperscalers or local CSPs. Or you can make it work on even your own hardware, which consists of NVIDIA GPUs. But for the -- we always make sure that performance-wise, we always make sure from the performance-wise, you are able to achieve the right optimal results so that you can have the lowest possible TCO and best ROI. Alongside with that, we also make sure that you have the industry standard endpoints coming with NIMs. Someone is mentioning that do it use Triton, do it use not. In few of the cases, yes, it does use and in few cases, it doesn't. We always make sure wherever we are getting the best performance, we would be able to achieve that. And the most important part is, it is available in the form of containers so that it is portable and shareable to anyone, any other device and easily scalable as well; which also alongside just the containerization, we also make sure you're also able to get the right metrics for you to benchmark the numbers as well as for you to scale it further to the next level by providing the metric -- the system-level metrics so that you can be able to auto-scale your containers and ports in your actual deployments. So to summarize, what is the difference between NVIDIA NIM and do-it-yourself way? NVIDIA NIM can help you reduce your time to market. In just 5 minutes, you would be able to pull a container along with the optimized engine model that is available for most of the well-known models in the domain. If I talk about specific LLMs likes of Mistral, Qwen, Meta, Meta's Llama, StarCoder, et cetera, and some of the NVIDIA's own models are available. You would be easily able to pull and run them with ease. Alongside with that, we make sure that you get the industry-standard API endpoints. It's not just in the domain of LLMs. If I talk about translation or speech to text, the API endpoints will look very much similar to what you are currently already using with the hosted APIs. Plus it supports -- it has support for additional features, which are crucial nowadays to build agentic workflows to build pipelines, which uses custom models. So you would be easily able to fine-tune a adapter model like using LoRa. And it is easily able to support hundreds of adapters depending upon the GPU memory. In your deployment case, you just have to tune a model with any of the open source method like NeMo framework or so. Once you have the adapter model trained, you can simply add it in the directory where NIM is downloaded, and it will automatically load it. And on the client side, you just have to add one more flag with the folder name where actually the model is deployed. And you would be able to run your custom inference with LoRa. We are trying to add more features along with that. You'll see in the coming releases. Apart from that, you can also find there are techniques like function calling, agentic calls are available within it, which helps you run the entire process of tool calling agentic workflows with ease. And it's available in the documentation, you would be able to see it's very much similar to what you're currently already using, be it like Llama Stack API or the hosted providers are providing. So this makes the entire process easy and available for go-to-market. So now as you already know, many star software stacks are being used in building such kind of application, likes of Triton Inference Server, TensorRT, vLLM, then there are libraries for FastAPI and whatnot. And also the right version of CUDA, cuDNN, cuBLAS, all these will also be required to make sure the entire process works well. And this maintenance is high, highly important and cumbersome. So to make sure we run it properly and all our dependencies working well, we do a lot of testing, and we also manage these as containers in one of our platform called as NGC. So you'll see, for every new release, we have everything updated and tested well before we publish it. So I'd like to take you to a page and also, I would suggest if you are already not -- if you have not already tried it out, do try out tools, try that out to log in into the page and see how it looks like. So there is a page called as NGC. If you just simply search ngc.nvidia.com. Full form of NGC is NVIDIA GPU Cloud. It's not a cloud platform, but basically a platform where you will find all the cloud-related frameworks, which involves containers and the models and the resources that we host for you to quickly run your test on your own environment or any other cloud service provider, be it hyperscalers or local CSPs. So if you just simply search a NIM, so you can already see there are a lot of NIMs that are available on the screen. Initially, you would also be in the welcome guest mode, but I would suggest you to log in. If you want to pull any of the container from NGC, you have to log in once, then there is a setup guide available within it. You have to follow that, generate an API key. That API key you'd be using to download the models. Once downloaded, you can easily run it on your own system without connecting to the Internet also. So there are more -- not just containers, also the Helm charts are maintained. So if I just simply search NIM container, say Llama, I have to name it. So if I just simply take Llama 3's SQL coder 8 billion model, here you can see when I log in, you will be able to see that. But when I log in, you'll see all the different versions of it will be available. I'll just quickly do it. So you can see we have released the 4 versions of the StarCoder Llama 3 container. So once you pull this container, this is the image name and once you pull it and alongside with that, you will also be able to download the model's right profile based on your hardware -- system hardware that you have. So I'll cover this in detail in the next slides. Let me just stop sharing and then go back to the next slide section. So when you -- so we will cover more about like how basically we have divided the different engines, different optimized models for different hardware in the back end. Meanwhile, before moving forward, if you -- on your screen, you might be seeing a poll. If you can answer this question, we'll be able to gather more information like how you are participating. So it would be great if you answer this question, and we'll talk more about like how its architecture looks like in the next slides. So we'll wait for 30 seconds. So it's a multiple choice question. You need to select 3 benefits of the NIMs. Awesome. So moving forward, let's see the answer. Most of you are correct. The first 3 are the right answers where you can deploy it anywhere on any hardware, speed time to market is low as well as the -- it provides that industry-standard API endpoints for you to get started. Now moving forward, let's understand again the nuances, what all things are involved within NIM. It's not just an optimized container, optimized engine file. It is more than that. It involves the entire life cycle, be it if you have a system with drivers already installed. You can use Kubernetes or the Docker way to run a NIM. And on top of it, alongside with that, it provides you the industry-standard API endpoints and the usage of all the necessary NVIDIA software stack, be it TensorRT, TensorRT-LLM, Triton Inference Server, depending upon the NIM that you are using, be it text to text, image to video or which domain of the model you are using. It uses the right frameworks, right libraries and optimizes the model and runs it on your environment. It can run on one GPU to multi-GPU because it has the support of Tensor parallelism, pipeline parallelism and so on available based on your hardware that you are using accordingly. Alongside with that, you can attach multi-LoRa adapters to it so that you can get different responses from different users if they are using a different adapter altogether. And at the end, it provides you the feature of extracting all the metrics and logs, be it time to first token, end-to-end latency, request per second or the system-level logs so that you can visualize them in Prometheus or Grafana-like environment. And this is the overall architecture of where user will be just sending a request to the industry-standard API endpoints like where it can be in the usage of LangChain, LlamaIndex, OpenAI directly or curl command. And then it will be going to a cloud-native environment where it is hosted through either Kubernetes, Docker and all the metrics are coming in. And in the back end, we have also optimized the preprocessing and post-processing steps on the GPU and try to make them parallel as much as possible. You'll see a few more updates in NVIDIA GTC with this -- for this architecture, where you'll see a lot of new improvements are coming in. We'll share the link in the end and to attend that and find out the latest updates and the latest features. But this is the overall architecture. It currently supports 2 backends. You'll see more backends are available for optimization step for your large language model. And we also make sure that all the optimizations are already predone for the respective hardware. And the optimizations are available with 2 engines, one is TensorRT-LLM. You just go and search and find out more information about it. You'll get to know it's one of the optimization engine that helps you run, like optimize your models in the terms of Cavy caching and then parallelization techniques and then precision calibration and so on. Alongside with that, we have vLLM support as well available for day 0 model deployments. And at the end, you would be easily able to run both chat completions and also, you'd be able to see the metrics from the NIMs. So this is the overall overview. And I already talked about the LoRa concept, that once you have the model fine-tuned with one of the techniques, be it installed, the NeMo or so on, you would be easily able to deploy it with your current NIM deployment. And it all depends upon how many GPUs you are using, you can easily be able to scale accordingly. And each user with different LoRa adapter can be able to get the different results. One great important features we have added recently is if you are fine-tuned your own custom Llama 3 model or open source any model, you would be easily able to optimize it with TensorRT-LLM. The steps are mentioned there in the documentation, and you would be able to transfer the customized model as well into the NIM architecture and with just single-click deployments. Finally, one more feature is tool calling and usage of Llama Stack APIs within NIMs. So you would be able to do the function calling features, create agentic workflows and talk to your -- talk to the real-time web pages as well with the usage of NIM and without you writing any additional code on top. This is how actually it identifies which system to -- which particular profile to use, whether it is TensorRT-LLM or vLLM, whether it should be FP8 precision, BF16 precision or FP16 precision. We make sure for each GPU, we have all the types of profiles available. And based on the architecture and the customers' need, they can select the type of profile from the same docker run command, which pulls the container and runs it. But you have flexibility to select it. By default, it automatically selects the FP8 precision if you are using H100 or L40S. If you are using A100s or A series of the GPUs, it will pull the BF16 format of the container. And that way, you would be able to get the offer. And mostly, it was the throughput, high-throughput profiles. But if you have a requirement where you want lower latency, there are profiles available for the lower latency as well. So accordingly, you can test based on your use cases. Now one more quick question from the slides before we just discussed. If you can answer. Again, there are 3 answers to it. Let me know. So we'll wait for around 30 more seconds. Okay. So now let's see how it looks like. Again, most of you are correct, the first, second and the fourth option is more suitable one, but it can be the third as well, but it's not just about chatbots. It's more than that. You can even work with NIMs in other generative Ai applications. That's why I have not selected that as the answer. So let me again reshare my screen and now talk to about how actually you will find out in real world, it works, right? So how actually it will be looking like. So let me share it. Again, as I showed you, from NGC, you'd be able to pull the container. First thing first, you need to set up your environment and setup guide is present here. Go, feel free to explore that. You would find out mechanism to generate an API key so that you would be able to pull a container from NGC. And you have to just do Docker login and also set up NGC CLI command. Once you have done that, you would be able to pull any of the container in your docker environment. So this is one of the systems I am just using to explain you how it looks like. So the simple -- the command is very much simple as well as similar to what you might be currently already using, simple docker run command with the mention of the number of GPUs as we are currently using a Llama 3.3 70-billion model which, in this case, requires 4 GPUs because we have created the profiles for TP with TP4 enabled, which is Tensor Parallelism 4. You need to have at least 4 GPUs to make it work. You need to mention the name of the profile. Once you go to the documentation page of the NIMs, you'll be able to find out like how you can find out what all profiles are available for my GPU hardware for this particular NIM container. I'll show that as well in the documentation page. But on high level, you have to first mention the profile name. In this case, I'm using a vLLM BF16-formatted PP4 and PP1 profile. Then you have to mention your NGC API key, which we just generated. If you go to the setup, you will find out the ways to generate it. And finally, where you want to store these engine files and optimized preprocessing and post-processing modules entirely. Once you have provided the path, you would be able to launch it. Generally, by default, launches on 8,000 ports, so you can map it to any port as you want and mention of your actual image name and that. So I've already done that setup. You can see the logs. So it will look something like this. Takes some time, like hardly 30 seconds, and then it would be able to deploy. So if you just see NVIDIA SMI, it will come up in a few seconds. So as you can see, it uses Triton in the back end for the usage. So it has started initializing the model. Once you are pulling it for the first time, it will download the model as well. In my case, it has already downloaded. So download time is obviously something you need to take care of for the first time. Once you have downloaded, you can be easily able to scale. So we are now able to see that it has occupied the system memory and it has downloaded -- it has deployed the complete model. Now you would be able to do inference with it with ease. You don't have to do any other kind of optimization because the engines that we have put has already done that. So I've created a simple Streamlit UI for showing a demo. So if I just say hi, currently just LLM is working. You can see the speed is pretty good. The response time is well even for a model likes of Llama 3 70 billion. Now on top of it, if you want to do further. So I want to show further optimization with NIMs. It's not just about LLMs, as I told you initially, it's all about other domains of the model as well we have enabled. So I'll just show you a demo for a RAG where I have already encapsulated certain documents. You can also, in this case, I have also added a feature for adding a few more documents. So I'll just add one random document, which is a scheme document from Indian government. There are multiple schemes available. So I've just taken this PDF. As you can see that this document is also not rightly actualized. It's -- we have to apply certain kind of OCR or certain kind of ways to build the final text. So that is being done within the itself in the back end. Again, for that as well, we are using vision language model as a NIM for converting this images into the right text format. So I build the index. So I'll just clear the previous chat history, and I'll start asking question. So you can see it has now -- it can now answer more, not just related to a basic hi, hello, what a normal LLM could do. Here, it can be able to answer the questions related to the document. So if I just ask it related to, say, for example, something. So for example this. So it has generated the response from the document. And in this case, I'm using majorly 3 big NIM models. One is Llama 3.3 70 billion, which is LLM for the generation task. Before that, for embedding generation and then re-ranking task as well, I'm using 2 different NIM models available on NIM documentation page. I'll show that where you can find all these information. One is Llama 3.2 1 billion model-based embedding model we have released, which is pretty accurate, has the total limit up to 8k to ingest and also supports dynamic embedding so that it's not -- you don't have to generate only the larger embedding size. Currently, I'm using 2048 as the embedding size, but you can reduce it to 512 or 1024 and so on to accelerate the performance further. And the re-ranker model is again based on 1 billion Llama 3.2 model. We have released these all 3 models along with the vector database accelerated on GPU is working and giving you this final results in an accelerated way. Along with that, you can add more types of NIMs, be it in the domain of speech translation, text to speech and whatnot. So just a quick example for the speech recognition as well. I think I need to enable -- so let's -- I'll show the demo at the end again for the speech-to-text part by enabling that first. So this was an overview of the workflows and the demos. Now how and where you can find this documentation that I was talking about. We just have to search NIM documentations NVIDIA. This is the way I search. You have to go to the NVIDIA NIM. There are multiple NIM subsections are available. The most important one today we are focusing on is large language model. Here, you'll find all the information that I'm talking about and how you can use tool calling, adapters, et cetera, et cetera. Apart from that, if you are also interested to build something like RAG, there's something called NeMo Retriever retriever. The NIMs available for embedding and re-rankers are available. We have support for 3 to 4 models, which mostly we have built with high accuracy. Alongside with that, there are models in the Guardrails domain as well if you are applying -- if you want to apply at the end some guardrailing to reduce the topic control and content safety, you want to apply and then jailbreaking, then you can do this with NIM itself. Now moving back to one more page called as build.nvidia.com. So there is one more page called as build.nvidia.com where the goal of this page is to explore and discover how NIM actually works and will it be suitable for my use case or not. So on this page, you'll be able to find that all these models that I'm talking about, we have already hosted in our environment for you to come and have a feel about it and explore how -- what all features would be available. When you do it on yourself, on your own system from a cloud service provider or on your workstations. You can see in the reasoning section, all the LLMs are hosted here, starting with DeepSeek to Llama 3.3 and so on. And many other models are available. You just have to click Explore, and you would be able to see all of them. In the domain of vision as well, some of the models are available for you to try it out and work with. So just taking one example, if you open the Llama 3.3 70 billion, you can see a chat window on the left side, and you can see the response is quick. Alongside with that, I was always talking about the industry-standard API endpoint support it has, right? So it has all the types of response formats available, be it with OpenAI. You can simply instead of using -- so in this case, instead of using integrated API and nvidia.com, you can use your IP address and port combination, along with your API key. Once you have pulled the model, you don't need to even use it. And the model that you have selected, is it 3.3? Is it Qwen or something else? Then once you have -- once you run it in this, you would be able to do the inference with simple client.chat completion or client.completion normally. Apart from this, we also make sure that we have support for all the well-known frameworks that people are using for building their agent or a RAG use cases. like supply chain, you'll find the support in LlamaIndex, Crew.ai and whatnot and even Haystack so that you can -- you don't have to write your own wrapper and then run the application. Plus on top of it, if you don't want to work directly with Python and you want to use it in the shell script or the simple curl command, you can interact with the model with that as well. So this is available for all the different domains as well for you to try out and experiment. Feel free to go to the build.nvidia.com page and try out these experiments, and then work accordingly and try to pull the necessary container on your own system. So let me go back to the slides. Okay, NIM deploy. So I was always talking about like you can deploy it with Kubernetes, you can deploy it with Helm chart as well. So we have a good resource guide available for you to try it out on almost all the hyperscalers. And there are certain nuances you need to take care of, like what kind of VMI I need to create and how -- what kind of system I should be having on the cloud providers so that I'd be able to run it. So all the information related to that is available in the GitHub pages of NIM deploy. I'll show that link as well once I reshare my screen. But you would be able to find all -- pretty much all the information there. Plus, if you are doing -- if you have faced any issue, feel free to reach out to the to the NVIDIA forums, where you would be able to get the responses done within 24 hours, like how do to deploy it on your own environment if you are facing an issue. Apart from that, all this is working with something called as NIM Operator. So it's another library on top of NIM for LLM. In the same documentation page I showed you, there is NIM Operator as well, which works on top of Kubernetes, create automated life cycle for you to deploy these LLMs or speech or any other domain models effectively with this operator. So it's like Kubernetes operator, but specifically targeted for the NIMs to improve and easify the process. This is the overall architecture that helps you extract all the metrics. And then for the auto-scale, there are ports that you have deployed. It also has the features to save all the information in the cache, and you would be able to accordingly run the complete life cycle. You don't even have to use Docker to pull the models, it all can be done with container-based Kubernetes. And you can just apply the Kubernetes secrets, all the information and then it would be able to run the NIMs and then create a service out of it. And then you would be able to host it with your own. You don't have to do it with the support of any additional open source way. So another question, like how can you access the NIMs ? If you can answer. So there can be more than one answer again for this question as well. Now let's see what's the answer. Yes, almost everyone is correct that everywhere you can run the NIM, if you -- until unless you have NVIDIA GPUs there, you'd be able to run them efficiently. Now the performance highlights, quick performance highlights. You'd be able to gain a lot of performance gain. So we have seen recently, we released on more blog. You just go and check it out with the latest GPU that we have -- sorry, with the latest GPU that we have released recently, B200 was able to achieve up to 25x performance with NIM on top of it. Same goes for the DeepSeek model. Right now, again, you can see in this case, if you don't use the optimized version of the pipelines, you would not be able to get the best TCO for your end applications. It's always suggested to go with the optimal version of it. One of them is sure to optimize it through NIMS. Now again, it was just till now about the models. Now let's understand apart from the models individually, which is text only or visual only or speech only, what else we can do, what else pipelines are available within NIMs? So there are -- there is something called as NIMS Blueprint that we have made it available. That is a reference example workflow for certain industry domains, be it health care or conversational workflow domain, be it like call center or so on, we have created certain blueprints. Alongside with that certain helpful developer blueprints, also we have created called as such as NV-Ingest is one of them that helps you extract the documents, if it is in PDF, PPTs or any other format in well-structured way all images, all tables and text will be extracted properly with a suite of different NIMs working simultaneous to each other. Let's see how it looks like. So there are multiple blueprints available. One I already showed you initially, which is digital human blueprint, that allows you to create avatars, real-looking avatars with different models in place. One I talked about multimedia PDF data extraction called as NV-Ingest for you to extract the information from a PDF that can be used for creating a multimodal RAG. Then blueprints in the domain of drug discovery and protein structure, Omniverse and other domains are also available for you to consume and run it effectively. This is the overall pipeline for digital human for customer service, which uses 6 to 7 different models, all working simultaneously. You can choose to use open source models or closed source. We have enabled it for both. Then this is the PDF extraction called as NV-Ingest pipeline, which takes the PDF as input, uses YOLOX model, which is a custom tuned model by NVIDIA to extract the images and charts separately. And then if it is charts, then it will be sending to another vision language model like called as DePlot by Google to identify the information from the chart and then use certain other models likes of cached and PaddleOCR to extract a few more information that can be stored as text and that can be used for further processing and creating another version of the document that is much more rich in text and can be used for building any RAG pipeline, which with the help of same embedding or re-rank model that we were using. And alongside with that, other processing other images like some native images can be done with models like Llama 3.2, vision models or so on. And you can extract out the final logic. It's all basically a reference example, you can modify and create your own version of it and use it accordingly. There are lots of customers already using this and have adopted the features. So now final poll question for today, and then we'll move to some more demos and then we'll ask -- answer your queries accordingly. Where are the NIMs and NIM Blueprints available to try it out? I just showed you one website where you can try out the hosted NIMs. I'll just check, let's see what you can answer. And just sharing, the answer is it is just a single-choice answer. It doesn't have multiple responses. So hopefully, you'll get a better response. By try out, I mean, for you to just without any hosting on your own environment, you can try out the page and get the feel of it. Okay. Let's see. Some of you have wrong answer. The answer should be build.nvidia.com, where I showed you the website where all the reasoning, vision models, visual models all are available. But most of you are correct. So there you can go and explore how these models can work from the client end, so that you would be easily able to run them as well into your environment when you want. And for that, you would need the access of NGC because the containers are present there for you to put. Apart from NIMs, we are also created a suite of frameworks, suite of libraries for different tasks for the entire ecosystem of generative AI applications. Starting with data preparation, we have something called as NeMo Curator that helps you extract out the necessary data from the suite of Internet scale data that you might have accessed by applying deduplication, applying certain filtering techniques like subdomain filtering, quality filtering or applying PII reduction, all these can be GPU accelerated and can be applied to your trillion-scale tokens and can extract out the necessary documents for your pretraining or fine-tuning task accordingly. So this is one library we have available for you to try out and explore. If you are building something of your own, building a model of your own, this can be very much helpful for you to create the pipeline. Now on top of it, we have something called as NeMo, also called as NeMo-Aligner and NeMo Customizer for you to pretrain and fine-tune these models. So I was talking about LoRa pretraining with NeMo -- sorry, LoRa fine-tuning with NeMo. So within this NeMo framework, you have the techniques, all the necessary optimization ways available like sub-tensor parallelism, pipeline parallelism, context parallelism, sequence parallelism. Then apart from the parallelism, techniques like selective activation, checkpointing and so on are supported within this framework. With just one config change, you would be able to tune the training process with ease. And once you have the checkpoint available, we have the evaluation mechanism available with the open source datasets for you to find out whether you have improved it effectively or not. And then we have already talked about NeMo Retriever within that embedding and re-ranker models are present. You can check out the documentation. And on top, there is another framework called as NeMo Guardrails that helps you apply jailbreak, topic control and also helps you provide a proper flow to the pipeline. If you want to create something like chatbot or so on, you can restrict the hallucinations in the models' responses with the help of NeMo Guardrails. And finally, as we already discussed, you can deploy it with NIMs. So let's see the screen share where all these presents. We'll also share it with you. So you can simply go and search for NeMo GitHub, although a better such would be NeMo framework. If you search NeMo framework here, all the different links. Complete documentation is available. You can see the release notes. All the user guide is present. Within this, if you scroll down, you'll find NeMo Aligner, Nemo Curator, Guardrails, everything is available. All the GitHubs are maintained separately. If you see the NeMo Guardrails, you'll find the information about how you can apply all the safety checks after -- during the deployment phase for the NeMo Curator, all the filtering techniques and extracting out the rich data you can do with NeMo Curator. So this all you can explore for building the entire life cycle. Let's come back to the case study and the demo. So this one quick demo I have created for Indian languages so that you can feel like with the fine-tuning and the retaining techniques that are available with NeMo, you can create your own models in the domain of speech, text to speech and LLMs and then also deploy them easily with and get the good performance. So let's see how it looks like. [Presentation]

Bharat Giddwani

executive
#4

So hope you get the idea. Basically, you could see the demo has the streaming speech to text, streaming text to speech as well as the LLM all working together. And along with LLM also, there are embedding retriever and re-ranking models were there for you to extract out the information from a vector database and also produce the final results. So this all was working fairly and effectively with ease so that you can also create something similar experimentation, I was wanted to show. So for this particular demo, this is a case study you have to -- let me go back. So the next slide would be muted by default for you. If you want to -- for you to listen, you have to come on your screen. And on the left side of the screen, there is a speaker option would be there. So just enable that and you would be able to listen to that. This is the case study we just cracked for the Indic and the South Asia work, which we have done with Tech Mahindra. [Presentation]

Bharat Giddwani

executive
#5

So that's it with the case study, but now let's see where all you can find all these information about how to experience the NIMs, then how to extract out all the data. So there's one simple typo here. This is the build.nvidia.com page where you can find and experience all the NIMs. Then to download it, you need to access the NGC hub. And then there is a Hugging Face page, where you'll find out all the models. In the previous video, you might have heard about in the Nemotron model. So these models, then there are models related to quality filtering, domain filtering, then certain LLM models that we fine-tune, all are available on Hugging Face NVIDIA. Go there and you would be able to find out those models that you can access and use it in your premises, if you want to work with that. Now one more interesting and exciting thing we need to offer for all of you is that we are providing free DLI access to you. One of -- so these DLIs are charged from $30 to up to $90 generally. But if you have accessed this workshop, you would be able to find out either of one of the DLI for free, use it. It's just you have to first log in into this page and find out the DLI or the workshop that interests you most. And choose one of that, and you would be able to log in it free of cost if you have already registered with this email ID in this particular workshop. So I would wait for 30 seconds for you to log in into this page. You can select one of the workshop and you would be able to gain the access for free accordingly. So let's access it and we'll wait for 30 seconds, and then we'll share one more exciting information post this slide. And at the end, we'll go through a discussion session with one of my colleagues about certain important NIM questions. All the complete agenda is present in the respective course as well. So if you are interested in adversarial machine learning or RAG or Omniverse, accordingly select that and log in. But with your email ID, you can only select one of the course, not more than that. Okay. So with the interest of time, I'll just move forward. Hope you all have taken the link. Otherwise, we will anyway share it with you. So now you can find out all the resources again on this page called as developer.nvidia.com, either just type it out on the Google or you can again take the access from this QR code. And all the information related to start-ups and ISVs, partnerships that we have and the software stack from the hundreds of domains that you might have noticed just in previous slides are available in this page. Now finally, NVIDIA GTC is coming on March 17 to 21. It is available both in person and virtually. With this link, you'll be able to register for free and as well as you would be able to gain lots of session insights that you would be finding out interesting. Later as well after GTC, if you don't have time during this period, if you register with this link, at the end of the GTC as well, you would be able to listen all the session that happened during that period. It also will cover lots of different advancements in the NIMs itself that would help you even further optimize things or if you make it efficient. So feel free to log in with this link and further have exciting news enabled in your [indiscernible]. We will share certain links after the session as well for you to get the registrations done for, and you will see this particular session as well as the other sessions in GTC that you might have encountered. So I'll wait for 30 seconds for you to scan this code. If you face any issue, let us know in the chat. And then we will move towards the Q&A section. Within GTC's page as well, you'd be able to find we have made lots of filtering available, like if you are targeting certain industry and certain domain. For example, you want to listen more towards the research topics, you want to know more about the optimization topics, you can accordingly filter the talks and add to your calendar so that you will get the notification as well in your -- during the session. Post that as well, you'll be able to see them, but that link is different. We'll share it with you once you register here and after the session done. Any questions regarding this, let us know in the chat. Is it slide is not working for everyone? I can see some people are mentioning slide is not loading. Okay. Still, I'll just go back and come back to the slide. For some of you, if you don't see it, you can refresh it at your side and then see maybe it will be visible. Okay, great. So one more minute, then we'll start with the Q&A, and then we'll end the session and we'll share the links to the registered email IDs. And then you'll be able to gain the access for both DLI as well as the on-demand recording of the session.

Bharat Giddwani

executive
#6

Okay. Sounds good. Then let's start with the Q&A round now. So along with me. I have my colleague, Utkarsh Uppal, who will be my co-presenter and discuss -- who'd be discussing about the question and answers that we have selected during this period. And let's talk more about that in detail.

Utkarsh Uppal

executive
#7

Yes, Bharat, can you hear me?

Bharat Giddwani

executive
#8

Yes.

Utkarsh Uppal

executive
#9

Okay. So a few of the things, right, based on what you were covering today, can you also sort of explain me the process of is it possible to integrate LoRa adapters with NIM framework? And if yes, firstly, can you just touch base on what exactly is the meaning of a LoRa adapter and how you can integrate that in certain NIM framework?

Bharat Giddwani

executive
#10

Sure. So if I give a quick example, for example, you might be generally using these models with most of the -- mostly with the frameworks likes of LangChain, LlamaIndex and so, right, where you don't actually find out the way to use a custom-tuned model in their API endpoints, in their like API endpoint classes basically, right? So for example, with NIMs, we have -- if I talk about LangChain specifically, we created something called as LangChain AI NVIDIA AI endpoints that helps you import ChatNVIDIA as the class and then that can be used for inference. Now if you want to use native models, it is pretty much simple. You can just mention the hosted URL and the name of the model. But alongside with that, if you want to use a specific adapter model, right, which you might have tuned with libraries likes of Onslaught, Hugging Face tuning libraries or NeMo framework, you would get an adapter module. If you're using LoRa, you might have noticed you will get another module that you would be -- you would have to use it in the folder where you have downloaded the NIM container or the NIM containers model actually. There, if you just host this model alongside with the actual LLM, you would be able to use it with the -- you would be able to use that in the LangChain or OpenAI code call itself. You just have to add one more parameter to it that will be calling the particular directory where you have hosted it. So it's pretty much easy with NIMs to access from the client as well as to host it during the inference. You don't even have to rerun the docker run command for the hosting process. Once NIM is hosted, you just have to load it in the directory, and then you have to restart the docker and that's it, you would be able to see the model is present within that and easily be able to infer it. So let me see if I find any client code also that would make you learn more. Meanwhile, you can answer the next -- we can talk about the next question, and I'll show that as well by resharing the screen at the end. So Utkarsh, there can be one more question, which I would like to ask you. So there were things like questions like how basically these modules works, and identify which hardware has the best suitable optimized version of the NIM profile, so available. So if you can answer that question, like how different types of optimization techniques are available within NIMs that it uses from the back end likes of TensorRT-LLM and how it identifies which should be taken for the specific GPU or the hardware from the user.

Utkarsh Uppal

executive
#11

Yes. So basically, and I also see a few questions in the chat around similar grounds. Essentially, if I were to deploy any model inside a NIM, right, and this is around some of the doubts which folks are asking in the chat. One thing we want to clarify that NIM from NVIDIA's perspective is saying that, hey, this is the best way of this is how you deploy LLM application in production, right? This is an inference architecture for LLMs from our perspective, which we believe that this is the best practice, which you should be following for production scenarios. That is what our NIM is, right? Now inside a NIM, essentially at the core will be a model, assume a Llama model, can be a Qwen model, can be Mistral, can be any model, right, can be a Hindi model as well. Now what we do is like based on the preview what -- from a hardware and optimization perspective, we work around with different back ends. For example, there are open source tools like TensorRT-LLM, vLLM, right? These are different compilers of optimizing these LLM architectures, making them ready for inference deployments. So what we do is since NIM -- inside the NIM architecture, right, we support both these compilers when we are optimizing. Let's suppose I want to deploy Llama 3.1 8B model, right? I want to just optimize that model. What we'll do is we'll run a lot of sweep search and we'll ensure that the specific kernels or that particular model or think of it like a component of the models are optimized for a specific hardware. So you will see your H100 profile, which we recommend running on H100 GPU. There will be A100 profile which we recommend running on a A100 GPU. There will be L40S [ like enables ] architecture-based profile, which will be compatible with the L40S GPU. For certain GPUs for example, H200, right, there will be separate profiles for those. The lower-grade GPUs like the previous generation, for example, Ampere might not support FP8, like it doesn't support FP8 precision, whereas the Hopper generation supports FP8. So we'll ensure that the engine and the corresponding optimizations are how they are supposed to for one particular hardware, and that's what we -- in fact, that's what we sort of bundle when you download that particular engine, what you were showing from the NGC. So yes, on the similar ground, right, let's assume I have downloaded the engine, I want to deploy on H100 particular Llama model. I wanted to ask like if you covered around Kubernetes Helm chart. Is there any metric exportations also like, for example, if I am working with any enterprise, I myself am an enterprise want to deploy the NIM and want to export certain metrics, is that supported in NIM for deployment?

Bharat Giddwani

executive
#12

It's a very good question. So yes, we have a lot of different observability features available within NIMs for you to explore. For example, let me show it by sharing the screen as well. So this is available within the documentation page itself, you can see. You can use these metrics, which will be coming on one of the port host, and that can be used in Prometheus or Grafana. On this page, you can see the information about Prometheus, Grafana and visualization that you'll see. You'll find out the count latency values in all the 3 ways: time to first token, inter-token latencies or end-to-end latencies, depending upon your use case to use case. If it is a chat use case, you might be interested in time to first token latency. If you want to identify the number of counts of input tokens and output-generated tokens for you to benchmark it, all these can be extracted as well as can be used to charge, say, for your own customers or for your own use, you can identify the feature, identify the usage. All these is available at the metrics endpoint. So it can be even extracted with the curl command, just wherever the model is deployed. By default, it is deployed on 8,000 port. If you simply add slash v1 metrics, you will get insights for all of them and identify if there is any bug during the deployment or what ways I can use to auto-scale the port and that you can write it with NIM operators further to work with. Apart from that, it can be used in -- like you can do the benchmark in 2 ways with NIMs. One, we have a suggested way of benchmarking tool available, called it GenAI-Perf here that you can use to do a benchmarking. You can see on the right side, GenAI-Perf is mentioned. And the whole information about what actually request per second means and how it does the performance, everything is visually available there. Apart from that, if you are by default using open source base of doing benchmarking like LLMPerfs, these also can be used and you would be able to benchmark your figures as well and accordingly define it during the Kubernetes deployment that you have for automatically scaling your pipelines. Yes, I think this is good for this. Let me know about something related to...

Utkarsh Uppal

executive
#13

So one more thing, Bharat, if you could show the blueprints, different blueprints, for example, the RAG blueprint. the vLLM blueprint, and the digital human blueprint. If possible, could you share the links, just share the screen?

Bharat Giddwani

executive
#14

Sure, this is a good question. I missed to show it in the build.nvidia.com page, but it is again available there only. I showed the model individually, which is Llama 3.3 at that time. But on top, if you can see, apart from models, we have blueprints also hosted here. So if you go here, you'll find a lot of different use cases or the reference examples available that are easily customizable for your own environment. You want to use some of our models, some of your own. Those things can be also done. For example, one pipeline like PDF to podcast, just upload a PDF. And then it does all the ingestion, run the all -- entire ingestion pipeline to extract out the necessary information. And then it runs an LLM to identify what can be sent to speaker 1, what can be sent to a speaker 2. And then accordingly, it will go to either ElevenLabs TTS or your own hosted TTS like either with NIM Riva or so on. You then would be able to get the final voice, which will be talking to each other, like 2 speakers talking to each other, so one example. And the entire architecture is also mentioned here. It's not just one single LLM call, the entire list of prompts which will go to the list of different models in simultaneous way. Obviously, you can make sure to use the one that you have and accordingly modify the pipeline. Then I talked about multimodal PDF extraction. So examples are available here like -- and it's -- the good part about this workflow is it can run it in a synchronous way as well as in parallel way. So you -- if you have a suite like a big PDF pool, it would be able to extract the text, table and the image description in textual form, in both markdown as well as JSON format for your RAG to consume it efficiently. Then there are others as well. So I'll just quickly show a quick demo. If it's working sometimes, obviously, it's hosted at our environment and multiple are using. It might not be working directly. But you can see there are 2 ways of digital humans also available 3D avatar, which the demos you also saw and the 2D avatar, both is available depending upon the avatar, according the page is present. In this recording, you won't be able to listen the voice, but you can ask any question. You can allow to voice and ask any question to Aria. So this avatar will come and accordingly answer this. You might not be able to listen it. But in your case, if you try it out at your end, it would be audible. But due to this platform's limitation, sometimes you won't be able to listen it. But here, you will be able to find all the blueprint for you to experiment and then accordingly optimize. We are also partnering with lots of libraries and the companies like LangChain, LlamaIndex and CrewAI to create such agentic workflows. Yes, Utkarsh?

Utkarsh Uppal

executive
#15

Yes, I think that's all. Most of it [indiscernible] and the material [ and the resources too, both in there ].

Bharat Giddwani

executive
#16

Thank you, everyone. And if you missed to register to GTC, I'll just launch this slide again. If you can -- if you have not done it yet, just go to the link and try to do it now. I think we can close in after 1 minute. We'll wait for 1 minute for everyone to log in. So there are questions like, can we use it in Windows? NIMs, right, you can use it. There are ways to return, but you have to use WSL in that case. So the resources that I have shared about both in the slides about how to deploy it with cloud service providers. There as well, you'll find the information about if you're using WSL-based type VM or native Linux, how to do it, that is available. And even in the documentation of the native NIMs, it is available. One more link I can share or the website. Let me show you that in the screen, it's although available mostly. I think it's not shared properly. Many of the -- many of you are asking like how to evaluate RAG and what's our best suggestion to create RAG use cases. One blueprint area you can explore. Apart from that, we also maintain a lot of different industry example add value to it by creating a lot of different use cases. You can go to this GitHub called generative AI examples by NVIDIA. There within RAG section and within community section, you'll find a lot of different examples are present. Plus within RAG, you will also find ways not just to do RAG, but also evaluate it and observe it. So the observability features and the evaluation features are explained in this notebooks and the repository. You would be able to explore how effectively you can find out the entire process is going and accordingly, optimize that. So the complete traceability is maintained here. Thank you. Yes, we can now close the session. Thank you, everyone, for attending the session.

This call discussed

For developers and AI pipelines

Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.