NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
December 14, 2023
Earnings Call Speaker Segments
Unknown Executive
executiveGood morning and welcome to the webinar: Accelerate AI model Inference at Scale for Financial Services. Before we begin, we wanted to cover a few housekeeping items. First, at the bottom of your screen, there are multiple application widgets that you can use. All of the widgets are resizable and movable. If you have any questions during the webcast, you can submit them through the Q&A widget. We will try to answer these throughout and at the end of the event. A copy of today's slide deck and additional help materials are available in the resources list, and we encourage you to download any resources or bookmark any links that you may find useful. You can find additional answers to some common technical issues located in the Help widget at the bottom of your screen. And finally, an on-demand version of the webcast will be available approximately 1 hour after the webcast and can be accessed using the same audience link that was sent to you earlier. So thanks again for joining us today. And now I'll pass it over to Shankar.
Shankar Chandrasekaran
executiveGreat. This is the agenda for today's webinar. We'll look at accelerated computing and its need, then cover what's NVIDIA AI inference platform. And then look at AI in financial services with specific focus on two use cases, transaction fraud and contact center and back office operations. Then we'll cover the benefits of NVIDIA AI Enterprise software suite. And then resources to get started. NVIDIA as a platform company with a full-stack solution for AI training and inference. Starting with the processors, GPU, CPU and networking. And then on top of it, we have many, many drivers and SDKs that actually help us take advantage of the underlying compute resources for our workloads. Then we have thousands of applications that have been accelerated. And we have many SDKs and AI models for different use cases for different industries. And there is an ecosystem of developers that are using, contributing and providing feedback that basically helps anybody develop applications for AI, HPC and many others on the platform. Now let's look at the need for accelerated computing. We'll look at three reasons, why we need to accelerate computing for AI. The first reason is on, an accelerated computing platform, the application performance increases exponentially, and that's what the chart on the left-hand side shows. This is not possible on a non-accelerated platform. The second reason is data centers are power-limited, and accelerated computing can get us more out of the same power budget. The third reason is models are growing exponentially. And if you look at the newer classes of models for the large language models, which are based on transformers, now they are going even at a faster rate than the other models. So these are the three reasons why we need accelerated computing because this is how we can actually get real value out of our AI. And to take advantage of accelerated computing, NVIDIA offers a full-stack platform for AI for enterprises. And what we see in this slide is the NVIDIA AI Enterprise software suite. It has all the software required from the actual infrastructure optimization, to management orchestration, and then actual AI development and deployment tools and some very use case-specific app frameworks. And of course, we have some pretrained models as well. And this software stack can run the workloads on cloud, data center, and inference happens on the edge or even on an embedded devices. Okay. Now let's look at the actual AI life cycle very briefly. I'm sure all of us are familiar with this workflow. We start with the data prep; data processing; and then training at scale; then we optimize the model; and finally, we deploy and run the model at scale, right? This is a typical AI workflow. And once it's running in production, then at some point, we get the new data and feed it back into the cycle, right? So that's how we retrain the models with the latest data. Now given our webinar is on inference, inference is very important in this overall workload because that's where the AI is actually put to work and this is where we actually see the value from AI. So Inference is the mission-critical production phase of AI. This is where AI can deliver successful outcomes under latency thresholds and other SLAs if done correctly. Okay. So we'll look at a couple of inference -- the key inference software from the NVIDIA inference platform before we dive into the actual use cases in the financial services. The first one is a Triton Inference Server. Triton Inference Server is an inference serving software for fast, scalable and simplified inference serving. The way it achieves all of that is by doing all these things that you see here in this chart, starting from support for any framework. So regardless of whether it's the machine learning or deep learning model, it supports all the popular frameworks, like TensorFlow, PyTorch, XGBoost; and then intermediate formats like ONNX; and inference frameworks like TensorRT, even basic Python and more. By doing this, it allows our data scientists to choose whatever framework they need to develop and train the models and then helps in production by streamlining the model execution across these frameworks. It also supports multi-GPU, multi-node execution of inference, of large language models. The second benefit of Triton is it can handle different types of processing of the model, whether it is real-time processing or off-line batch or accept -- it accepts video or audio as an input and has a streaming input. And also it supports pipeline. Because today, if you look at any AI application, any actual AI pipeline, it's not a single model that works. And we have preprocessing, we have postprocessing, and there are many models that actually work in sequence or some in parallel for specific inference. And so it supports that pipeline. The third benefit is Triton can be used to run models on any platform. It supports CPUs, GPUs. It runs on various operating systems and of course on the cloud, on-prem, edge and embedded. So essentially, it provides us a standardized way to deploy, run and scale AI models. And it works with many DevOps and MLOps tools, like Kubernetes, KServe MLOps, platforms on the cloud and on-prem. And this is how it's able to scale the models based on demand. It's able to offer all of these benefits without leaving any performance on the table. It provides the best performance on GPUs and CPUs; and it has unique capabilities like dynamic batching, concurrent execution. And thereby, it not only provides very high throughput and -- with low latency, it also increases the utilization of the GPU. So essentially maximizing the investment, maximizing the ROI from those compute resources. The next key software is NVIDIA TensorRT. TensorRT is an SDK for optimized deep learning inference. So it serves two things. One, it's a compiler. It compiles deep learning models, like CNNs, RNNs and transformers, into a very highly performant and optimized format. And second is also as a run time. and it runs those optimized formats very, very efficiently on the GPUs. And this TensorRT is, of course, supported on Triton Inference Server as a back end. TensorRT uses many different mechanisms and algorithms to optimize the model. And if -- you can see the main ones listed here. Like it goes -- it uses a reduced precision to get better performance without sacrificing accuracy. And it has techniques like layer fusion, tensor fusion, kernel auto-tuning and efficient use of memory and many more, right? This is how a just-trained model can become a very highly performant model, optimized model, using TensorRT for GPUs. And the optimized model can run anywhere, right? It's part of Triton. So you can run it on cloud, you can run it on an on-prem or even on an -- like an embedded device like Jetson. Okay. So let's look at it from a quantitative point of view, why accelerated inference? And we'll do this using 3 charts. Essentially, what is the performance gains we are getting with accelerated inference across 3 workloads, right, ResNet-50 SSD-Large and Bert-Large. And this is from MLPerf benchmark. And clearly, we can see across the 3 workloads, running the model on a GPU, in this case A100, with Triton and with TensorRT optimized model, gives us an x factor improvement in inference performance in -- measured in throughput, anywhere from like 7x to 73x compared to running the same model on a CPU, right? So what this means is we can do more inferences per second or in a time -- in a given time compared to running it on a CPU, right, which is a non-accelerated platform. By doing more in a given time, we can actually achieve cost benefits when in a reasonable scale. Now I'm going to hand this webinar to my colleague, Pahal, and he will walk us through the financial services use cases.
Pahal Patangia
executiveThanks, Shankar. Now let's have a look at some of the key trends when it comes to AI and financial services and the kind of disruptive effect it can have in bringing positive outcomes within the industry. So as we stand today, AI is at an inflection point, and we are seeing 3 key trends within financial services, particularly. The first one being is the adoption of deep learning, and that has been only accentuated by the recent rise in generative AI and the popularity of transformer architecture. Now what this has done is it has aligned executives and ML practitioners alike so that they can hop on to this bandwagon and bring on new use cases where AI can be embedded across the enterprise in different sub-verticals, different line of businesses, across use cases. The second trend we are seeing is that Moore's Law is dead. The very fact that your compute performance needs to grow 2x in every 18 to 24 months has been -- has become antiquated. So what's important is that, with the rise of complex models of large sizes, as well as large-scale data sets, which are prevalent in the industry today, there is a greater need to have an accelerated compute platform to do AI. And the third trend which we are seeing, is that AI has become democratized more than ever. If you have the right platform, if you have the right data, if you have the right skill set and the business use cases, then there is a greater alignment and incentive for a business to do AI today. We recently did a survey with 500 executives within financial services, and 83% of financial services leaders said AI is key to their success. And how they plan to do so and achieve with that is the 3 outcomes which are outlined on this slide. First, being accurately predicting risk across different line of businesses. The second is making the organization and workforce more efficient. And the third is reducing costs [ overboard ]. This makes it super important for a need to embed AI across different use cases in the financial institution if we have to achieve these outcomes. Now let's understand the AI use cases within different sub-verticals of financial services. So if you think banking and fintech, you would see that AI has an impact across different touch points of a customer in their journey, life cycle. So if you think at the point of acquisition, when the customer first interacts with the bank, there are ID verification and KYC checks. As the customer goes through further into the funnel, if they are applying for a credit or a loan, there are underwriting models in place to determine the risk of the customer. And as the customer gets onboarded on books, the bank manages the risk of the customer, leveraging default prediction models, collection models or also by building marketing and personalization models to give like a next-best offer or to upside or cross-sell products across their product suite. And in all these processes, if the customer wants to interface with the bank directly, there are use cases within contact center AI, virtual assistants, which come into play. Basically, all in all, there's a holistic penetration of AI within the banking ecosystem. Now coming to the payment side. The biggest use cases which we see are on credit card fraud detection in terms of transaction fraud, which are very prevalent. Not to forget the anti money laundering checks which are important to determine if the -- if a payment or a transaction which is happening has linked to the right and legit resources or not. And to move further into payments, settlements, payments reconciliation, there is a new tool for T+1 settlements coming in which banks and financial institutions have to comply. So a lot of prominent use cases for AI there as well. And not to forget and go back to the credit card side about credit card limit authorization and limit prediction models which run on top of each customer's risk profile. So moving on to capital markets. Particularly in capital markets, we see a lot of unstructured data being leveraged to determine the risk factors. So be it ESG investing or sentiment analysis; the labor transition which is happening these days; analyzing and summarizing 10-K, 10-Q reports; and building reports from an equity research perspective, you would see different use cases of AI being embedded across the capital markets domain. And insurance in particular pretty much has similar use cases as compared to banking and fintech in terms of determining the risk and the price ratios. One thing which stands out is detecting claims and damages from pictures. So that is an industry which uses a lot of computer vision models to determine anomalies as well as determining payout for damages. So all in all, in all these subdomains, what remains common as cybersecurity to help financial institutions fight bad actors. So now you see that the range of AI models your varies into their complexity across different subverticals, across different use cases, and training them is being done critically on an accelerated computing platform. Today, we'll look particularly into use cases where we talk about serving these complex models and how does the stack look like in terms of AI workflow for a select number of use cases. So the first use case which we'll deep dive into is credit card transaction fraud. So let's understand how a fraud detection workflow looks like from an AI pipeline lens. So an incoming transaction could originate either from a terminal or it could originate online. And the transaction request then goes through a set of sanity checks, whether the CVV is correctly entered or not or if the details are input correctly or not. And once it passes that, the request to send process to add in more features using the existing set of information it has to make it more predictable, to make it more desirable to be fed to the ML model. And as this process data goes through the next stage, it is then passed to a set of tools which are defined by business, based on heuristics or some business logic which is in line with the risk strategy. It could be some fact checks, it could be blocked accounts, it could be sanctioned accounts, et cetera, which have to be filtered through a set of such predefined rules. And then once the request are processed through, these rules have been fed to a set of models, which could be either ML models, it could be more complex models like deep learning models, graph neural networks, transformers or an ensemble of all or some of these kind of models. And what these models would do is -- these trained models would process each request and produce probability score, which is the fraud -- which is the score indicating the probability of a transaction to be fraudulent or not. And based on certain yardstick decided by the risks teams or the business, the transaction could be classified as approved or declined. Or in some cases of gray areas, it is sent to the manual experts to review further and then make a decision. So all in all, this is the entire workflow for transaction fraud detection broadly. And what's important to note here is that Triton interplays with each of the different facets of this workflow, be it processing data, incoming data requests, or be it incorporating rules via business logic scripting, albeit hosting different kind of ML or DL models or hosting certain copies of these models, hosting them in an ensemble fashion, Triton has the ability to manage all of these as a single point serving solution. And that is why it is a bespoke answer for serving inference request in use cases like credit card transaction fraud. Now traditionally, fraud detection has been performed solely using heuristic rules, and that was quite a reactive approach when it comes to detecting frauds and reducing your fraud losses. Why is that so is because fraudsters today have become increasingly sophisticated and smart in their ways and are continuously finding new avenues to master the current banking system. And that is where the need to move from rules and use them in combination with advanced ML and deep learning models came into play. So what you see today is that fraud detection pipeline is largely a combination of rules plus simplistic ML models, advanced models, like XGBoost and even further complex deep learning models like transformers, graph neural networks, et cetera. So you would see an ensemble of all of these into a typical pipeline. And where Triton really shines is that it supports multiple model architectures from different frameworks, be it PyTorch, TensorFlow. It has dynamic matching capabilities for multiple requests coming in through the serving solution. And it can host multiple copies of the same model to do some testing. And then you can create custom back ends to incorporate certain logics or certain criteria to read out your request from one model to the other in a pipeline. So very, very interesting sort of features leading to some high-level technical KPIs which would be of interest to an engineering audience. So high throughput, meaning a lot of -- a large number of requests to be processed in a fraction of a second. And the latency for serving these requests is, like in the case of fraud detection, is in milliseconds. And you can scale it out using Kubernetes, and at the same time, make sure that your machine utilization, which is your GPU machines, is kept to the maximum. And all the while, it means that you are saving costs into your fraud detection pipeline setup for inference. And using AI models hosted within Triton Inference Server is critical to bringing some great business outcomes in terms of better fraud detection efficiency for the organization. More number of requests are processed per second. And there's a lower case load for manual reviewers. And that is largely attributed to models being accurate and giving -- serving the right decisions at the least amount of time. It also means that your infrastructure costs and your fraud losses are kept to a minimum, and it all deals with a growing set of happy customers who have -- who are like not illegitimately denied for a fair transaction which they would have done. So great outcomes from a business perspective which are enabled by Triton. So we have a really cool case study with one of our really great partners, American Express, for credit card fraud detection. So Amex managers a volume of more than 8 billion transactions per year in terms of credit card transactions. And they were building deep learning models for fraud detection using NVIDIA AI platform. And so what they found was that, while training those models, their fraud detection rates reduced by 6%. And at the same time, when they were serving these AI models, the latency which they observed was less than 2 milliseconds in production. So this is the kind of speed of low latency, high throughput, inference for complex, large models, which we are talking about at scale from one of the largest payment processors in the United States. So really good testament of the capabilities of Triton and the impact it can have on the bottom line in financial services. And imagine, at the scale of Amex, how much would that reduced latency resulted into increased customer experience, increased convenience and ultimately more customers being retained onto the platform. Savings and the bottom line impact is huge. The next use case which we'll look into is the use of AI models within a contact center and how it helps with back-office operations in our financial service institutions. Here, we will dig into the details of a typical contact center and how it addresses customer interactions and different workflow and how do they manage back-office operations. And we'll see how AI models come into play into the entire process. So as the customer tries to engage with the contact center, it could be for support, it could be for troubleshooting, or it could be for completion of certain business process task. What happens is the customer request is redirected. So we see issue categorization use case coming in. Now the agent will work with the customer to address the problem. Now it could be an agent doing in a Q&A conversation fashion with a goal to retrieve and fetch some information from an enterprise knowledge base, where retrieval models and search models come into play and answering questions with context into mind. Or in another case, a customer must have uploaded some documents to process for a loan application, for example. So those have to be processed and made sense of, or certain values need to be extracted from them in tables or whatnot in different fields. So it's also crucial that how smoothly an agent arrives at a conclusion, and which is influenced by models trained on past transcripts of conversations with the customers. So we see in the entire contact center, what sits at the heart of all of it is LLMs, and there is a requirement to serve a large number of requests in real time. So as we saw in the case of contact center, the response time becomes a pretty important business KPI. That is related to both the efficiency of the contact center as well as the happiness of the end customer. So Triton's capabilities, like low latency and high throughput inference become extremely critical here. Now the LLMs which are served for these use cases are trained on large corpus of financial documents, financial interactions with customers, et cetera. And these quickly could get super large in size and may not be able to fit in a single machine. And that is where Triton's integration with TensorRT comes into play, that enables the models to become lighter for inference without much loss in accuracy. And secondly, Triton also brings in the capability to scale-out inference on multiple GPUs and nodes, if needed, per the model size. So these -- another point to note is that these workflows have a lot of rules and business logics scripted into the process automation, and it is important that those needs to be working in conjunction with the model inference process while serving requests. So Triton's capability of business logic scripting becomes vital here. And lastly, we should not forget the energy efficiency and infrastructure savings which Triton brings by maintaining maximum machine utilization, all while referring -- inferring large-scale models in large volumes. So in terms of business outcomes, we are pretty much aware that agents worldwide are overwhelmed with customer queries, especially in this uncertain financial climate, customer interaction volumes have increased more than ever. So it has become increasingly important that agents are augmented with AI so that they can become more efficient. And what these models do in general is make their life easier and ultimately enable them to produce more output. Secondly, this ultimately results in higher number of queries resolved in a given time frame, a metric which is very crucial to a contact center's success, if you will. And on the customer side, this would mean that they walk away happily after the conversation or the interaction. It just fosters more trust in their relationship which they have with the financial institution, and the satisfaction levels remain at par. And from a financial institution perspective, this would imply that happier customers lead will continue to be sticky, and that will result in a greater lifetime value for the customer and greater impact on banks' or the financial institutions' bottom line, which is like a win-win for all the parties. So we have seen the theoretical aspects of how large-scale inference would play in a contact center setup. But I think it's a good idea to also understand what it means from a real-life industry perspective. And we have a great story with an industry behemoth, Airtel in India, which is the second-largest provider of wireless communication services there. And what Airtel does is that their contact centers manage more than 200,000 calls per day. And they are leveraging automatic speech recognition models in order to detect what the customers are speaking so that it can be a more smoother conversation, both from the customer side as well as the employee side. And it reduces the back and forth which can happen in a conversation due to misunderstanding or miscommunication. Now these models particularly were served using Triton Inference Server on GPUs. And what Airtel saw was a 3x throughput as compared to their original serving solution, which meant that a higher number of requests could be sent through their ASR models, making their entire workflow more efficient and better, and allowing better experience for both their employees as well as by customers. So really great industry example of how Triton Inference Server has enabled an enterprise like Airtel to benefit for the sake of both -- of all of the stakeholders, I must say. Now let's pass on to Shankar, where we can learn how we can leverage Triton as a part of NVIDIA AI Enterprise platform on premise or on cloud or in a hybrid environment setup.
Shankar Chandrasekaran
executiveGreat. Thanks, Pahal. Now let that we've looked at how an accelerated inference transforms, the 2 use cases that we saw and many more, let's look at the overall platform benefits, the benefits that we get from NVIDIA AI Enterprise software. Great. Inference is a production activity. The models are run in production, they are part of the business applications, mission-critical business processes use those applications and models. So it's essential that we treat them as any other enterprise-grade software, and NVIDIA Enterprise does that. So we'll look at benefits in these 5 different areas. The first is on security and reliability. The first thing it provides is it scans the software containers for vulnerabilities and also notifies users of potential vulnerabilities. And as these bugs and vulnerabilities get fixed, we get those fixes on those production branches. And the big -- the benefit of being on a production branch is -- that is API compatibility. So we can continue to consume these security patches and bug fixes without having to worry about API breakage and other incompatibilities. The second area where it provides value is on some advanced features. These are features exclusive to NVIDIA AI Enterprise, features like AI workflows. AI workflows are, think of it as a blueprint of actually using multiple AI models, along with other components, to solve a specific business problem. And also, the second exclusive feature is there are some pretrained models that are available that we can start with. And then there are solutions like Triton Management Service, which is a very useful service to manage a large-scale model deployment. And these kind of services are exclusive to NVIDIA AI Enterprise subscription. The third area in which it provides value is on certification. Our production applications, including AI, they run on established platforms and tools. And so these software, as part of NVIDIA AI Enterprise, they are certified on some of the popular platforms, whether it's VMware, vSphere or Red Hat OpenShift or on the clouds -- the public cloud providers. And so certification gives us the peace of mind that, hey, this has been tested and validated on those platforms. The fourth area where it provides value is the ongoing delivery of enhancements and features. So patches, whether it is bug fixes or security updates and feature upgrades or even feature enhancements, they are provided as part of this. And again, keep in mind that we have these production branches, right? So we can consume these updates and upgrades without having to worry about like -- rework and revalidation and reverification of our solution on application. The fifth area is the technical support. This is very important because it is in production. So if for some reason something goes wrong, if something fails, then being on NVIDIA AI Enterprise subscription allows us to call NVIDIA experts and to help us troubleshoot and get to a quicker remediation. And there are different types of support available. So this is -- so these are the 5 areas where we get value from NVIDIA AI subscription. And these are very essential for any production application, where inference is exactly that. Okay. Let's just have a quick deep dive into the whole API stability that I mentioned. We'll actually start on the right-hand side, right? What we see here is, on the same platform -- here, I think we are using H-100 as an illustration, on the same GPU, the model or the application gets better performance because we constantly tune and optimize these software, like TensorRT, for example, right? So the same model. and here, we are basically comparing an MLPerf's inference 2.1 to 3.0. Those models, like DLR and Bert and ResNet and so on, they are -- we get very -- higher performance of those models on the same GPU but the newer version of the software. And this is because we are constantly innovating our software to provide better optimization, better performance. And at the same time, as we do this, we'll make sure that, in the production branch, all the dependencies are taken care of. So we can continue to consume this innovation, and at the same time, not having to rework our application to get those benefits, right? So that's the benefit of this API stability that is part of NVIDIA AI Enterprise. Now let's look at the availability of NVIDIA AI Enterprise on the CSP marketplaces. So there are two ways to consume NVIDIA AI Enterprise on public cloud service providers. The first is on demand, where we can go to any of these marketplaces that we see here on this slide and just start using it. The other way to consume it, which I think would be the majority of us would consume it that way, is a private offer. Because now with this, we actually can assess our usage and -- or the demand that we use for the various use cases and then have a private offer for NVIDIA AI Enterprise on the CSP instances. There is also the third way, which is bring your own license. This is where we could purchase the NVIDIA enterprise license through a channel partner and use that license on a certified cloud instance. So there are multiple ways by which NVIDIA AI Enterprise can be used on CSPs. And of course, we can also use this on-premises as well, like in our own data center instances. Okay. We are coming to the end of our presentation, and we'll look at resources to getting started.. There are 3 ways to start. The first is NVIDIA LaunchPad. NVIDIA LaunchPad is a free hands-on labs hosted on NVIDIA infrastructure. So we can just go there, sign up, and there are many labs available across different use cases. It's a very easy way to get started, to get experience and sample their software that we talked about in this webinar on NVIDIA-hosted infrastructure. We have free access for 3 days. So let's say we liked it and we want to extend the evaluation or maybe run it on our own custom workloads in our infrastructure, we could do that. This is a way to sign up for an on-premises evaluation for 90 days. And the third way is, as you mentioned in the previous slide, just we can access it on demand from the public cloud marketplaces. So this is a very -- these are, of course, the on-demand are priced per GPU per hour, and this is probably a quick way to go and try it on the cloud of our choice. And in any case, talk to your NVIDIA account representative today, and they can help you guide on those, plan and use case and strategy. Because as Pahal mentioned, we covered 2 use cases here, but accelerated computing also is in so many different use cases, right? And so -- and it's a good day to talk to them so that they can help us come with a plan and strategy and -- eventually and getting started with NVIDIA enterprise. Okay. Great. So this completes our presentation, and we'll go to the Q&A session and see if there are any questions that have to be answered. I know we have been answering many of them in the chat, and let's see if there are any more questions to be answered. And thank you so much for listening.
Unknown Executive
executiveGreat. Thank you, Shankar. Yes, there have been a lot of really awesome questions today from the audience. So give us a few minutes to look them over, and then we can dive right in.
Shankar Chandrasekaran
executiveAll right, great. So this is the Q&A. We've been answering the questions in the chat, but I think we'll take a few questions, important ones, and answer the live again. I'll start with the first one. That was a question on ensembles or pipelines. And how do we -- how does our Triton orchestrate the pipeline? This is an important question because today, modern inference, everything is a pipeline, right? And inference requires multiple models with preprocessing, postprocessing all working together to solve for that one specific request. And there are -- as I put in the chat, there are 2 ways of doing it in Triton. One is through an ensemble model, the other one is called business logic scripting. Each provides its own benefits and good for some different use cases, but it's very flexible. We could construct a pipeline on GPUs or CPUs, some components running on some GPUs, some components running on CPUs, different components, very flexible way to construct the pipeline using Triton. And this is what every inference today needs. So it's absolutely possible, and Triton gives a very flexible way to construct pipelines, from very simple to complex pipelines, for all kinds of use cases. Good question. Let's take the next one. This one, I'm going to hand it over to my colleague, Neil. There was a question on, a, is there better to run one Triton per GPU? Or should I be running Triton on a server which has got multiple GPUs, right? How do we make that decision? So Neil, can you answer that? I know we answered it on the chart, but if you can go over it again in live?
Unknown Executive
executiveYes, yes, absolutely. It's a really good question. And it's going to depend a lot on the specifics of what you're looking for from your system design. There are benefits to handling it both ways. So I would say one of the benefits of having a separate Triton server per GPU is that it's much simpler to orchestrate, especially if you're using some kind of container orchestration system like Kubernetes. You can do things like auto-scaling, your servers when they're tied to a single GPU. You also get a bit of fault tolerance so that if the workloads that are running on 1 GPU fail for some reason, then the workloads running on the other GPUs won't run into any issues. So those are some of the benefits of having a separate Triton server per GPU. On the flip side, you can have one Triton server orchestrate models across multiple GPUs. This can be especially useful if, for example, you have a model that is multi-GPU that's too big to fit on a single GPU, or if you have a pipeline of models like Shankar was talking about either in BLS or through a model ensemble that needs to be split across multiple GPUs. Within Triton, you can specify which GPUs any of your particular models need to be allocated to. So you certainly can control multiple GPUs with a single Triton server. At the end of the day, it's going to depend on exactly what you're looking for and some of the specifics of your application design.
Shankar Chandrasekaran
executiveGreat. Thanks, Neil. I'm going to pick the other next one, this is for Pahal. Pahal, in financial services, explainability is a big deal. Could you please tell us the importance and what we can do there?
Pahal Patangia
executiveAbsolutely. As regulators determine, and it is pretty critical for every financial services firm which has a fiduciary responsibility to provide explainability in their decisions, and where Triton plays in the stack is particularly with a lot of tree-based models, there is an integration with Shapley values. And Triton has a back-end called Triton FIL, which is Forest Inference Liability for serving tree-based models. And what happens there is FIL integrates with Shapley values which run on GPUs. So within Triton, if you are serving a pipeline, which is running on tree-based models, like XGBoost, like GBM, et cetera, you can invoke the FIL back end. And using the FIL back end, you can split out the Shapley values in production.
Shankar Chandrasekaran
executiveOkay. Thanks, Pahal. Yes, and I see one question here. Does it make sense to deploy a hybrid infrastructure, on-prem plus on the CSP for on-demand resources? Yes, I mean, it depends on your business need. And yes, that's a possibility. And the way Triton and NVIDIA inference platform helps is, it's the same stack that we'll be using on-prem and cloud. In fact, Triton has already integrated into many of the CSP services, like SageMaker or Azure ML or Vertex AI and others. And so if we are standardizing our inferencing on Triton, it's going to give us the same experience on-prem or on the cloud. So it's a great way to start building. And the hybrid really comes down to like what's our use cases, right? We may want to deploy certain applications on-prem, certain applications on the cloud or maybe in another case, we wanted, let's say, on demand, meaning we normally do it on-prem, but let's put it on the cloud. And for certain cases, we want to use -- extend it to the other. So it's possible, but I think implementing that would be -- would require going as beyond the scope of what Triton provides because you'll have to look at the whole thing, the whole infrastructure as a whole. But it's absolutely a possibility and, that we have a lot of customers that actually do that in that sense. Okay. Great. So that brings us to the end of this webinar. Thanks so much for all the questions, and thanks again for watching it. Have a good day.
This call discussed
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.