NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary

November 17, 2023

NASDAQ US Information Technology Semiconductors and Semiconductor Equipment special 68 min

Earnings Call Speaker Segments

David Taubenheim

executive
#1

Welcome to the Fast Path to developing with Large Language Models. I'm David Taubenheim, a senior solutions engineer at NVIDIA developer programs. Let's begin. Here's today's agenda. We're going to start off by looking at large language models in a wider context to give you some familiarity if you're not already. We'll also do the demonstration of an App that's taking advantage of the capabilities of large language models to help with the problem we all have. We'll then look at using large language model APIs, an example. We'll then move on to prompt engineering and show you the considerations of building your promts for your application. We'll move ahead to using large language model workflow frameworks and do a bit of an analysis of a few different ones for you. And then finally, we'll talk about how we can use one of those frameworks to combine large language models with your data. Therefore, your data is being used as part of the prompt. Historically, language models have been trained for specific tasks. Things like text classification or entity extraction, where you're trying to find the names of people, places or things, question answering. But then in 2017, the large model revolution began in earnest. And this was powered by transformer models. That's a particular type of deep-learning architecture that specializes in processing sequence of data points or tokens where tokens are numbers are presented in words or parts of words. These transformer architectures use self-attention to figure out which parts of the sequence can help interpret the other parts of the sequence. And this came about in a paper written by Google and University of Toronto researchers called "Attention is all you need" It's a famous paper and Large Language Model evolution. Now we have these much larger models trained on extraordinary quantities of data, billions and trillions of tokens. And looking over at the tree on the right, we can see that there's been an explosion of models. So we just want to point out that some of the ones that you might have heard of, like GPT4 or Llama aren't the only ones there. There might be some that are specifically good for what you want. We'll talk about how to pick some of those in a little while. I also wanted to point out the branches in this evolution tree. Lots of the attention has gone to the GPT decoder only branch. We're going to be showing uses of the encoder only branch as well. And there are even cases of the encoder and decoder branch be useful like human language translation. But the complexity of the models isn't really reflected in the complexity of the APIs, which have begun to converge substantially. So don't worry, we're going to show -- you the audience how to access this field. As for the models are built with unsupervised learning, and they have proven to be very effective at next token prediction. Sure enough. Looking over the left, you can kind of see what I mean. The sky is the phrase that's going into the large language model with log probability is a blue clear, usually the and less than coming out. Of course, Blue has the most likely probability being the least negative and the highest number, the most positive number. So that is the most likely next word. And this goes on and on, and that's how we can observe these applications like ChatGPT that are predicting on the next word, but doing so with such correctness and fluidity that it sounds like a knowledgeable person is telling you the answer that you queuried about. [indiscernible] are called foundation models. When we pretrained them like this on unlabeled data sets, and they can be tuned later to a bunch of specialized applications. And some of those might be what we talked about earlier, the traditional natural language processing tasks or it can learn general or other domain-specific knowledge. Or it can perform new tasks with few or no examples. So in that case, the large language model is a scale-up architecture that can perform a lot of various large language tasks like summarizing, translating even composing new content, and that's what gives it it's generative name. Let's hop into an example where we're going to use a large language model to help us triage our company e-mail. Imagine the case of the fictitious company Melodious, who manufactures musical instruments and audio equipment. Their issue is that they have hundreds of e-mails that come in every day from customers with various needs from urgent needs to non urgent needs from repairs to complements. This is left for somebody to go through and somebody would have to also assign which of these customer service representatives the e-mail would need to be handled by. But what if we can redo this and think about it in a more modern way that takes advantage of large language models abilities. Now our inbox looks quite a bit different. We have a description of the problem rather than just the subject of the e-mail. And when we click into one of those e-mails, we see that a few characteristics have been sussed out of the e-mail by the generative AI, by the large language model. What product e-mails about, representative, who should handle it, the tone, a summary of the issue and then an assignment of priority. In this case, it's the most urgent response. And so we see that there are several e-mails here, all that are urgent ,not urgent or in some cases, some e-mails don't need a response right now, so let's not take time working on them if we're very busy. We can also look at which customers support representatives have which e-mails. So Chris looks like he's going to be plenty busy. We can also choose products. So we can go down just the same way that we did with the customer service representatives through the list and pick and sort by those particular instruments. Now let's say that there's an issue that needs to be further researched by the customer service representatives. They can click on research issue and what pops up is a summary of the issue and then the summary of the search results into our assets at Melodious. We'll talk about how this works a little bit later, but what you saw here happen was that the summary was created on the fly based on the customer's e-mail and then the sources that were found that are very similar in semantic value to the question that came in from the customer. So these resources are intended to help the customer service response address the issue the customer is bringing. How did that demonstration work? How are we able to use a large language model to parse through those e-mails and triage them for us? Well, what we had going in is a semi-structured text input just containing an e-mail body. That e-mail body is then added to a prompt that gives a large language model a task to do with the e-mail body. We call that large language model through an API. And then finally, the large language models output was requested to be in the JSON format. And sure enough, that's our outlook. Next, let's consider how that large language model is functioning and think step-by-step through what we need to do this, starting with calling the API. Let's go through an example with OpenAI's ChatGPT. The first thing that we do is to import packages and looking over toward the right, we see that, yes, we did import the OpenAI package. We're making use of [indiscernible] MV to help us find and then utilize the information inside our and file that contains our open AI, API key. Let's choose GPT3.5 Turbo with a temperature of 0.9 from 0 to 1 for temperature, 1 being more random and kind of more creative and maybe beyond what you need and 0 being less so. Some other key parameters too might be Top-K and P or repetition penalty. Those depend on which model you're using. Next, we write the prompt. The prompt is write a [indiscernible] about large language models, and we use that prompt as part of a greater message that goes to the large language model in our framework. Finally, we call the API with the create method. And the response is in the response variable. And if we print out the contents of the response variable, we see endless worlds unfold, giant minds, best text arrays, wisdom from the void, and it gives me chills. What are some of the factors that you need to consider when you're selecting a large language model, One of the important ones is to take a look at benchmark scores on a relevant benchmark. On the left, you'll see a table with different task types like reasoning or reading comprehension question answering, math, coding and so on. So if your application is using a large language model for reasoning, you might, for example, use HellaSwag or for coding, you might use HumanEval and MBPP. You also will need a lot of data, especially if we are doing our pretraining. Later, we can fine-tune with far less. We also need to think about the kinds of evaluation or validation tests that will run on the model when it becomes part of our system. Latency is very important to and for instance, latency is the amount of time it takes for us to start getting our answer back out of the language models. And of course, for something conversational, we might need something that's shorter than 1/3, of a second. The cost of deployment or the use or the price per tokens is going to eventually end up costing you a certain amount. So you want it to be able to think through costs, typical responses or typical prompts that go into language model, how much context can the language model remember at a time that's called the context size. And then licensing terms, which normally is something that you might gloss over, but I recommend not doing that here because in some cases, these models are not for commercial use, but are for research use. There are models available for commercial use or you can train your own. We also have to think about the main specificity. And when we start to train models that are very domain specific, we find that we can get as good a performance from a smaller model that's trained on that narrower field than with a much larger model that's trained more generally and of course, going to cost more, not only in terms of the hardware resources to instantiate but also in the use of energy and the cost of deployment. Looking at the HellaSwag benchmark column, we can see how the Falcon 180B model will perform compared to the Falcon 40B model. They're both highlighted in yellow. And then in the HellaSwag benchmark highlighted in green, we see that for the 180B model, we have a performance of 89 points. And for the 40B, 40 billion parameter model, we have 85 points. So for 4.5x more size, you gain 3 points on the performance. That's something to consider. Are those 3 points worth nearly a 4 or 5 time size model which will impact the cost. That's an important consideration for you when you're designing the system. Okay. Let's shift gears a little bit and talk about prompt engineering. Having a good prompt is very important if you expect good results from a large language model. So let's talk about a few of the different methodologies. One of them is called Zero-shot. So in this case, we're asking the foundation model to perform a task with no in prompt example. So what is the capital of France. That question goes into the large language model and if all goes well, A, Paris pops out, tries to follow the format. We have a Q and then we have A. What's nice about Zero-shot prompting is that it gives you a lower token count. So you remember, we were talking about potential token costs or token memory in the context memory. So we have a lower token count. We want to be efficient, giving us more space for the context. But in some cases, a zero-shot prompt isn't enough for a model to give you what you want. With a few shot prompts on the other hand, we provide examples as some context for the foundation model that's relevant to the task. And in this case, we would give it some examples of capitals and answers. What is the capital of Spain, answer Madrid. Italy, answer Rome. And we noticed also that we are asking for the answer in a particular format that looks like a python dictionary. So when it comes time for us to ask what the capital of France is, we get a response back in the correct format. The answer is Paris and in the dictionary format. So the -- and the responses are then better aligned, especially in terms of formatting. And in general, they have higher accuracy, these few shot prompts give higher accuracy on complex questions. On some models, including large models, the newer ones as well. Few-shot prompting may not be necessary, Zero shot prompting may get you the result that you need. So I would suggest trying both and figuring out which one is giving you the answer more reliably that you need. We can also use a prompt to generate synthetic test data. Do you remember in the demo that you just saw, we had about 100 e-mails that came in that were complaining or complementary about a whole range of products and over a whole bunch of names. Here's the prompt that was used to generate that. We see that contains variables customer, product and feedback along with some instructions. For example, when you write e-mails you get right to the point and avoid pleasantries, like I hope the email finds you well or I hope you're having a great day, start with the subject line and such. And so we wanted to be concise. And if we take a look at what's produced. I'm just cutting the body here, we have a pretty good e-mail that's catching the things that we ask for. The product is the CG Series Grand P&O, the customer, that's [indiscernible] at the very bottom. And then feedback. Feedback, exceptional quality of sound, exceeded my expectations, and then a thank you. So it's important when you use a large language model for synthetic test data for SDG to check the model's license. In some cases, at a company, you may not be permitted to use the model and that includes to use it to create a synthetic data. There also might be prohibitions on using the output of one language model to train another language model. Taking a look at the prompt on the right, you'll notice that it is asking the large language model to think logically, step by step to help the customer service representative. And certainly, there are 5 steps there, including specific output format to answer in. And then at the very bottom, in Blue, we see that the body of the e-mail is going to be appended to these instructions. So the instructions plus the body of the e-mail comprise what we call our Triage prompt. This is called a chain of thought prompt in the sense that we're asking a large language model to reason through a process step by step, certainly, adding something like let's think step-by-step or let's think about this logically into your prompt has been shown to improve the result from some models of large language model, definitely worth trying. And you can supply those specific steps if there's a consistent process that you need to have your large language model run through, in our case, indeed, it was sorting through a big pile of e-mail. The prompt on the left, along with the e-mail body that is appended to it produces the result on the right. And we can see that the large language model is reasoning through each of the steps. And step one determines that it's about a specific product, that piano. Then what the issue is. In this case, it's praised for the quality. The tone of the e-mail is positive. And then we don't need an urgent response anyway because the customer isn't expressing a problem or something we need to correct quick, and then finally, Step 5 is outputting the customer name the product, the category, the summary, the tone and then the response urgency like we asked for. When it's your turn to design a prompt, you have to go through the process of deciding if it's going to be a Zero shot, a Few shot, a Chain of thought prompts or in something like this, which is a long Zero-shot prompt that's shown on the right. You'll notice that the more aligned or sophisticated model is and one of those like GPT3.5 or 4, then the fewer explicit cues it will typically need, and in this case, we can see that even though it's a fairly lengthy prompt, it's still a zero-shot prompt because we're giving no examples, although we're giving a lot of instructions. Now some of those prompt elements that you should consider going in your prompt would contain Role. So a dictated job along with the descriptive adjective or two, and we can see that's an efficient administrative assistant. Instructions. That's step by step, what you want done. You can use action verbs to make this better, determine, determine and classify, write a one sentence summary, organizer your answers. The context is the relevant background info into the prompt. For example, in this case, we're a musical instrument company, and we received e-mail, for example. Then the output format can be almost anything you can dream up, including something custom. In this case, we're asking for JSON object with the following keys: name, product, category, summary tone and urgency. Our prompt also needs to be exacting in what we want. We don't have to worry about being too brief. The more exacting we are, the more exacting the result will be. And of course, you can imagine the kinds of things you'll need to avoid vagueness, unfounded assumptions or topics that are just too broad. The outlet format of the large language model can be almost anything that you can imagine. You're prompt can specify the output to JSON, CSV, HTML, markdown and even code, that list is always growing. And so if you take a look over on the right, this was one of the instructions from the previous example to organize our answers into JSON object. We see that the JSON was formulated correctly, but indeed, it is just text. It's not an actual object that's output by a large language model's text. So we perform a simple conversion step for a structured format like that. Now even these high-end LLMs can sometimes result in an imperfect format, you can try tuning, but I also strongly urge that you add some air checking in your code to make sure the output is in the format that you expect. And if it's not that it's somehow corrected or perhaps re-requested in a slightly different way. Tools are available to help keep our LLM based application in its lane. Something to add boundaries to ensure that the large language model is not performing in kind of undesirable behavior. These are called guardrails and toxicity checks. So if we have an enterprise application, like the one that we were just working on for the Melodious music store, we might have a user who asks something in an e-mail. And if we take a look at the green path through the check boxes in the diagram, that user's request, makes it all the way through the guardrails of NeMo. This is NVIDIA's LLM, then the app tokens like link chain is what we use for some of the examples, we're going to be showing you in a bit. We'll make the call to the LLM, possibly, we have to access a third-party app to get a result back to, back in to link chain. And then finally, the response that comes back, maybe all terrible in red and can't actually go back to the user like it is or is mostly right and needs just some modifications in order to make it through, and that's shown in blue NeMo Guardrails use a co-link pattern established in a configuration file, but other systems would configure their boundaries and behaviors a little bit differently. The upshot is that when it comes to topical safety, we want these Guardrails to focus interactions with the specific domain like something in a music store, not a grocery store, not [indiscernible] opinion, not the weather, safety to prevent hallucinations if the LLM is producing undesirable results or something toxic, we would need to also perform a toxicity check on the input and output. And then finally, Security. We don't want necessarily our user to be able to access everything about a third-party App simply through an e-mail, which goes into a prompt. You might be wondering how you can evaluate how your large language models is doing in its application. And I'll say that the evaluation type that you're going to perform depends on the data. So for example, if we have structured data we could use a large language model to help bolster our currently existing structured data by generating more structured data. And so we could generate a data test set, including inputs and known outputs. We run the large language model on those inputs. Then we compare the large language model outputs to our test outputs and perform scoring. It's a little bit different with unstructured data generation though. That might be text like those e-mails that we had. We synthetically generated the e-mails in the first demo from all of the customers. Next-generation auto completion, summarization. So these are things that have many possible good answers. There's no one perfect exact answer. There are many good answers. In this case, you would simply take those unstructured inputs, run those through your large language model and then either a human or an AI will be applying a rubic to determine how well the evaluation went. Similarly, one thing that we do internally on my team is to supply users with AB testing, so that two different model candidates or two different output candidates make it to the user's eyes and then they decide which one is better then vote on it. And that kind of AB testing has allowed us to implement internal systems that perform really well on unstructured data. Next with API calls, prompting guardrails and evaluation all behind us, let's take a look at another function that large language models can help us with in our e-mail app demo. And that is researching based on the customer's e-mail and on up-to-the-minute company content. Let's go back to the e-mail application and see how we can help one of our customers. Noah has a [ indiscernible ] whose keys are sticking and that makes it difficult for him to play smoothly. We see, that's the summary, as inferred by the generative AI LLM and that we would be the customer service representative that would handle the call. Let's go ahead and pretendedly and research that issue. Okay. So we can see the summary of our search results as it just filed in. We are using that summary and using another large language model, albeit much smaller than GPT to determine what are called the embeddings of this. It's the Symantec Embeddings. What does this sentence boil down to semantically meeting-wise? Once we have that embeddings vector, we can then compare it to embeddings vectors for each of these assets shown on the right, blogpost, press releases and so on. So when we recall the original issue and then we find the passages that are relevant to our solution, those can go into the prompt of the large language models, the large language model's input once more and ask for a summary of the search result and sure enough it's here. It looks like it says that the [indiscernible] commonly does experience some issues with dried cork joints and sticking it's suggesting some powdered graphite and so on. What if we had an e-mail, but we -- it wasn't in our box. It's just something we wanted to quickly triage. Let's try it. So let's go to research, and we're going to paste in an e-mail that I asked ChatGPT to write for me based on having a problem with my MP stage keyboard, I'm about to have a performance, but my [ Wawa Joystick ] is not working right with [ Mini ], what can we do? Let's go ahead and search that. So instantly, what just happened there was that, my e-mail was boiled down to a summary. The summary was vectorized into it's embeddings. The embeddings were compared to the embeddings of all of our solutions for everything and the most similar ones were saved to be produced by the ChatGPT large language model to produce this summary and possible solutions. So how did that work again? Well, this involved a couple of steps. The first step was data preparation. We had our documents to be input and processed for later searching and retrieval. Don't forget that our documents are all synthetic. The company and those products don't really exist. But in any case, these are the input documents that are going to be used as sources of technical information or problem resolution. And we process those documents by breaking them down into smaller pieces. We'll be discussing this in some detail with code in a bit. We convert those bits, I should say, those smaller passages of text into vectors, then we store those vectors as embeddings in a vector store. Then when it comes time for somebody to use our application, their actions will result in a prompt, that will cause document retrieval so we find the most similar documents and then we retrieve them back into their text. And that gets tasked into the large language model with an API call asking for a technical summary of the problem of the solution to the problem and a description of the problem as well. And then that text comes out and what you saw scrolling past. Let's switch gears and look at the code behind how the demo app works in terms of accessing the large language model and preparing the data, and then let's also compare a few different frameworks. Developing these kinds of systems can become complex pretty quickly. And so we look for ways to simplify our development. One way is to take advantage of the modularity and flexibility of certain frameworks. One framework that we used in this demonstration that you saw earlier and that we use in our own development is link chain. Taking a look at the code. You can see that in Python, link chain allows us to open up an LLM object and specify which model it is. And then we can inject the parameters into standard chat messages using a definition like shown here. we're going to be writing a poem where in a given topic, in a given language and with a given large language model, one of the benefits to using this kind of framework is that it comes with components. So link chain components to help you not only build the chain and work with different chains of prompts but also gives you the ability to instantiate agents, which is something that's talked about later. How to use memory, vector DB storage and indices, how to load in documents. We haven't talked too much about that but we're about to. The example on the left is a simple stand-alone use case, something that you probably wouldn't need link change for actually. But when it comes to more complex graph like chains, then we want to think about using link chain's facilities. So on the right, you'll see that we do put in a swappable large language model. just like we did in the stand-alone case. But on the right, we developed a system prompt and a human prompt. The system prompt is what our large language model system is supposed to do. And then the human template normally contains the request putting those things together, the system prompt and human prompt makes a complete full prompt. Then the next step is to connect the chain, which you see at the bottom right,[ Cobox ]. We can also flow more parameters into the chain so that they can be used by potentially multiple prompts, for example, with this run chain prompt, we can use it again and again, changing the topic and the language. And of course, it's possible to take the output of this chain and feed it into other chains, thus, creating a complex flow. You've seen link chain in action in the demonstration that we gave earlier; and it is an excellent framework, but it's certainly not the only game in town despite it being a source, having a large community, lots of integrations and even enterprise tools to help you. Two more frameworks that are available are Haystack and [ group tape ] each with their own target purposes and advantages. Haystack, for example, developed by [ indiscernible ] is open source, and it has a lot of resources to help you with scaled search and retrieval also the evaluation of pipelines, so you can tell how your whole system is evaluating. Remember, we talked about evaluation earlier. It's also deployable as a rest API. [ Group tape ] can be deployed open source or in a managed way with commercial support. It also is optimized for scalability and cloud deployments, containing resources for encryption, access control and security. And in the example shown here, we've created a small toy example that is asking the LLM to write a 4-line poem about a topic in French. So all 3 of these examples are doing this. The first task is to create the LLM object. So the three different frameworks do offer different functions that are actually quite similar just want to select the right model for your need and then passing in its API. It's pretty easy then to create a function passing some arguments. And in our case, that's the topic of a poem. We used the LLM object that we had previously instantiated and then define the correct prompts and then run it to get back the output. In this case, it's the poem in French. In fact, in Haystack, we give a further example of where we take the output in French and then translate it using another large language model to [ Group tape ]. We also added the ability for context to be read into our prompt. So in this case, we are going to load a PDF that contain some information that we're going to use as we, as the large language model is composing the poem and then the pipeline is run. [indiscernible] has components to help you build a vector database as well. Vector database is going to be handy and, in fact, necessary for you to perform similarity search. Similarity search is looking through a set of documents or a base of text to try to find passages that are similar to the query. You can see that this is different than keyword search, which is looking for an exact character match. The process here is to input our documents and shown in a code here. We're using the link chain base loader function, which is going to allow us to point directly at Wikipedia, the poetry page, and take the text into a variable text loader. Next step is to process the text into chunks. And sometimes this is also referred to as splitting. There are a few different ways to split our long text from that web page or any other source into these chunks. But the one that we're showing here is the recursive character text player. It's going to achieve the 300 character chunk size by naturally looking for markers that may suggest a change in semantics or meaning for example, a couple of slash ends to signify a new paragraph or the start of a new section in the text. The chunks are then output into the variable that you see here and passed into vector version of vectors where we use large language embedding models specifically, this is distinct from a GPT model that is letting us chat or a query. Instead, this is producing a set of 768 embeddings. These are values that represent the data as determined by the embedding model. We call this a vector, and we can have a whole database of vectors based on parts of a larger text body, or individual documents that are sentences. One way to do this is to use the FAISS function, FAISS and this is the Facebook AI similarity search function, where we pass in our chunks then we pass in the reference to our embedding model and what comes out is our Vector database. Later, we'll need to retrieve. And so we have to make sure to set up a retriever object that's going to be looking at our vector database and we can pass on a text query. Can you help me on defining the big picture of the [ Tetrameter ] metric. And so what happens here is that, that query is then itself embedded and compared to every vector that was embedded from our larger document and the most similar ones rise to the top in similarity and become the sorted results of our search. So at this point, your gears may already be turning, taking up ways that you could take advantage of having a large vector database, full of information that's ready to use with your large language model. That's what we're going to talk about next. What is retrieval augmented generation. And why would we combine a large language models and encapsulated knowledge with your data, but one of the first reasons is that while large language models have been trained on large amounts of data, they may have been created without data or in topics that don't really fit your application. For example, getting a summary of some confidential enterprise info or private medical data come to mind, to retrain a model with newer data can be long and costly. The new data may become irrelevant over time anyway. Besides, if you expect to add private data into training, you would have to ensure that the model does not make it out into the public. Also adding in real-time new specific data on the fly, does not remove the burden of a limited large language models context window. So then the RAG concept is simply to ingest data from an existing database or web pages or a specific document like the latest text document on a topic that you created not too recently to be part of a database. That way, you can use to retrieve relevant information and the workflow of your application either providing some of the context or when it's necessary to acquire factual information to generate an answer. So then summing up a few benefits of RAG over a standard language model alone, are access to external knowledge you can pull relevant information from a fixed data set during the retrieval phase. Another is answer diversity by retrieving different passages, RAG can produce a variety of answers based on the external data interacting with it. RAG can also give you structured responses by pulling information from structured databases. And that can give you more concise pin-pointed answers sometimes and then traceability of responses. Answers derived from database entries can be traced back to the resources and that gives a level of transparency. Thinking about the workflow that we followed in the research app demo. That was the one where we clicked on the research button and then the LLM synthesized a technical support summary for us from technical information that was available in documentation blocks. So now let's consider a schematic of a RAG workflow. When the quarries arrive, a retrieval process goes to fetch the relevant data through the framework. And we know once again that the data could already be part of a database or could have been loaded at that moment. Then the retrieved info is combined in the prompt and since the large language model, which, in turn, produces a response that considers info contained in both. Note that this RAG workflow also incorporates guardrails for both the prompt and the response. One of the first steps is to vectorize and embed our input. And putting it simply, that means that embedding is a method to convert an input. And in this case, that's text, but it can be images, videos, whatever, into numbers to comprise a numerical vector. Important here is that the embedding has performed, taking into account the content rather the context and the meaning, which means that the same word in 2 different sentences could be encoded differently based on their semantic use. And thinking about the example here, the gray box. We have a query who will lead the construction team one chunk says the construction team found lead in the paint and the other chunk says that Aussie has been picked to lead the group, so we can see that if we want to find a more similar chunk it would have to do with leadership and teams or groups and less to do with the chemical substance, so Chunk 2 would be more akin to the original query. All right. Going back then, once finished, when you take these vectors and we store them in a vector database, with the idea that the closer 2 vectors are, the closer is their similarity and also their meaning. But to be fair, we have to note that similarity doesn't imply relevance. And that's why, in some cases, keyword search approaches can offer better results than vector [ DB 1s ]. You got to know your application. In any case, it's becoming common to use embeddings and similarity search with vector databases and semantric retrieval for different use cases, like classification or topic discovery. And if we take a look at the clusters on the left, we can use this as an example. On my team, we were interested in better understanding feedback we got about GTC, which is free form and unstructured. So looking at the image, you'll notice that we have these clusters of points. These arose from plotting the embedding vectors and then projecting down into 2 dimensions where here, each point represents a feedback messages text. These clusters turn out to be thematic, sharing the common topic like scientific computing or [ CJI ], and so this is a good way to visualize semantic distance. Let's now think step by step to bringing new data into our application for the LLM to process. One good question is, what can the LLM ingest how much? As a reminder, today's LLMs can really only ingest a limited number of tokens. And today, that's in the 10 to low hundreds of thousands. So to ensure that an LLM can ingest our data, we need our data to be split into chunks that fit the context windows size limit. So turning to the code. You see that pie PDF loader grabs the text unencrypted PDF. But I should point out that other loaders could have handled things like JSON or CSVs or so many other types of data. Either way, elicit document objects has returned. And for this example, let's imagine, it has a length of 20 pages. You can note that -- could have different vectors representing these various segments. So then afterwards, when we retrieve the information, the retrieval function, make it back pieces of the files text, and that's more useful in simply being returned 20 pages, the entirety of the PDF file's text. Then after loading, we need to break up the text into smaller pieces to capture semantics of the document and to improve the potential text passage relevance during our search. And that will also allow for the limited context window that we have. In the initial splitter shown on the left with 500 characters as the chunk size. That's a large chunk size that could help us find say, an idea more holistically inside one of our files. But on the right, we have a chunk size of 30 characters. And that smaller chunk size allows for more fine grain searches. We also know that there's a chunk of overlap to make sure that we don't miss any semantic concepts as we slide our window along our text. In the second stage, the chunking process is somewhat deceivingly not such a straightforward task to highlight the complexity, let's consider two different kinds of data, a French novel and an English tech document. As you may know, some languages are more [ robust ] than others. So when we're dealing with content of different languages, you may have to consider the alphabets or the recurrence of words or sounds in that language. And besides some languages are more direct than others being more efficient, given a meanings per character. And then finally, a tech document is less for [indiscernible] and more straightforward than a long poetic description and French literature. All this goes to say that chunking by splitting by character count alone may not be enough to extract a meaningful piece of text. The same is true if you have a technical document, let's say, a code example. Going to the next line, like using the [ /Ninfo ] may be sufficient to indicate the switch of a topic. An approximate conclusion here would be then that depending on the kind of data, tax PDF markdown, et cetera, the kind of fields, the documents related to like tech specs or historical summary or business report language used and so on, you will have to consider different chunk sizes even though it would be ideal to find some elusive standard one. One piece of device do experiment and do consider parameters. We just mentioned to determine the right chunking size. Try it out. And just to highlight it once more, given the same kind of input but by modifying the chunk size parameters, we do end up with either a more or less meaningful vector. Understanding how to preprocess your data to be subsequently embedded is of utmost concern. So let's take a look at some other means to make the injestion of data more efficient. One is to use RAG sub query chaining and cascading. Many inputs have multiple topics and using a single embedding for such an input really dilutes the ability to retrieve the topic-relevant chunks. We can use an LLM to generate retrievable queries, identify relevant info and return chunks and then combine those into the final prompts, we may be able to further optimize by parallelizing part of the RAG process. Before we go into the block diagram on the left, after a complex query is decomposed into sub queries, both so quarries are processed simultaneously and subsequently combined to produce the output. We do love to paralyze processing, reordering or reranking. Here's an example you can retrieve more results than you ultimately desire via an efficient search like embedding distances, then use something like a pairwise evaluation to select from that list. Or completely different, apply the maximum marginal relevance MMR approach, score on both the quarries relevance and the result's diversity should reduce redundancy. And then finally, another technique is to embed data with their associated Metadata. That's kind of a context awareness. Let's illustrate that, looking at the code example on the top right. When looking through the chunks of text to embed, adding information from headers like in a markdown document will emphasize the meaning and the context of the sentence. So that function markdown hetero text split gives us the ability to combine the text file with header information when looking at the sentence. In this case customizing a model using parameter efficient customization. The admitted data will help the language -- the large language model to understand the context of the sentence and possibly it's relevance. And for instance, adding the document date so that the large language model can determine which chunk to prioritize if the two conflict. Earlier, we described a simple but realistic case with our demo about incoming e-mail at our fictitious company. We further expressed several ideas on how it could be extended or improve via various techniques. And though simple, we wanted to highlight a real enterprise grade level of UI design and back-end API end points, leveraging some of our internal tools as well as others from the open source community. So finally, the code shown is from the research functionality of our demo and it's surprisingly brief, considering all that it is doing. Of course, that's largely thanks to the framework and the API. The workflow is to retrieve the documents, then filter by the top and then feed those into the large language model to summarize. The LLM chain here contains the stuff documents chain which will concatenate together documents to feed into the LLM as the context within the overall prompt, then a large language model takes in all of its information and forms a natural language response. Before we wrap up, I just wanted to invite you to come explore the NVIDIA AI foundation models, including Nemotron-3, Code Llama, Niva, Stable Defusion XL, Llama-2 and [ clip ]. Here, I use NEMO LLM service in the left to generate a story about an Egyptian goddess who's a cat. Then on the right, I asked Niva the NeMo vision and Language Assistant to analyze the synthetic image that I made about that cat. And sure enough, it understands that this cat is sitting on a couch in what appears to be an ancient Egyptian setting, very well done. Let's review the information that we covered today. We first discussed the core concepts of large language model architecture and foundation models before moving on to the factors for selecting between and evaluating large language model APIs. We then moved on to prompt engineering basics and covered a few workflow frameworks for LLMs before taking a look at retrieval augmented generation or RAG. We also presented 2 demonstrations showing how you could use these principles for an e-mail triage application. Before we go, I'd like to thank my colleagues, [indiscernible], Chris Milroy and Chris [ bang ] for their many contributions to this session. Thank you. Thanks again for joining today. I'd like to take this opportunity to welcome you to continue to add questions, and we'll go through some of your questions live now. You can continue to submit them even as we continue to answer them. You may submit them using the same way that you have been.

David Taubenheim

executive
#2

All right. So one question that popped up is regarding the performance and turnaround time for the vector and embedding search. What can we do for scale? So this embeddings model is going to be a smaller model than the large language model that we're using for triage and analysis or whatever your application may be. That embedded model can run very quickly usually in terms of milliseconds to get an answer. So in -- just wondering if I'm having some audio issues now, somebody is reporting that.

Operator

operator
#3

We are fine. Keep going. Thank you.

David Taubenheim

executive
#4

Thank you. Yes. Good. Awesome. So when we use the smaller models to determine the embeddings, it happens very quickly, often a small fraction of what it takes to get the turnaround from the large language model back. So in that case, the latency is very low. Turnaround times, again, can be in the single milliseconds or even quicker for some activities. Let's see, what is quantization? It's another question that popped up. We can, you can imagine that if we have a long data word. So we have maybe 32 bits that define a word of our data, it would take more resources to compute a stream of 32-bit words than it would with say, 16, 8 or 4 bit words. So if we have quantization, we are taking some of the paths in the neural network down from a large number of bits like 32, down to something like 8 or 4. And that means that we could pack in more computations in the GPU fabric than we can if we are computing just a whole bunch of 32bit words. The downside of this is that sometimes it can affect the accuracy of the model, but if it's done well and perhaps even train, you can train with these quantized parameters and you're aware of them while you're training then what can happen is that your performance delta might actually shrink to variably 0 in which you get back is a faster model that has less latency, higher throughput and that also might be smaller in the GPU so that you get higher GPU utilization and an enterprise-type deployment where you need to have many of these running at the same time. All right. So on to a question about link chain. The question reads, I've heard link chain is great for building a proof of concept, but is terrible for use in production. What are our thoughts on that? And how does it compare to Haystack or Llama index? So it's hard for me not to read that as Yama index like in Spanish and Llama index. So we did a bit of a comparison during the presentation, and it's hard to do a comparison because pipelines and chains can do the same thing. You can make them do that. But in some degree, link chain and Llama index, Haystack, [indiscernible] they have different purposes with different needs. What I will say is that for link chain, we have link serve, for example, I'm going to put that here and send it out to you. Link serve helps you build your LLM application and deploy runnable as chains, has a rest API. And it's something to take a look at, just to make sure that before link chain is dismissed by something that could be used in production, let's see the resources that are available for it. I'm going to send this out to everybody now the link for that. One question that also popped up is our too long prompts, really not an issue. Don't LLMs tend to focus more in the beginning and the end of the input? So perhaps I should have been a bit clearer during the actual talk. For example, if you have a very long prompt, maybe even up to 128k tokens, then sure. If you have something that's important to you at the beginning of that context, a bunch of random stuff that's supposed to be taken into consideration and then maybe something else towards the end of it. How will, how the large language model know where it's supposed to pick from those. So in that case, I guess, long prompts can be an issue, but what I was getting at in the statement that I made during the presentation, was if you are creating kind of a human generated prompt to perform an action, here on the side of more specificity even if it makes the prompts a little bit longer. I certainly didn't mean to suggest that if you make the prompts completely maxed out to fill the entire capability from some of the newer language models that, that would be fine. I was speaking more about when you're creating a prompt by yourself to perform a task, be more specific and some more words are okay. Let me find another one here. So are RAG and fine-tuning office often used together to boost LLM domain-specific performance? So you can. It's hard to say how often that happens. I think it's maybe more instructive to think about what each of these two processes does and how they differ. With fine-tuning, you're taking additional samples or examples. And then you're changing the model parameters or weights to accommodate that new or different updated or corrected information. And so Yes, in that sense, that would boost a large language model's domain specific performance if the fine-tuning data was from that domain-specific place. But the way that RAG does this is different in the sense that if you have a store of the kind of data that you want to use to make something domain specific, you're analyzing it, using embedding vectors. And so what happens is that the results from a quick embedded search, embedded vector search produces some statements that get stuffed into the prompt and then included in the prompt that the LLM is going to process. And the fact that, that additional context was made available by the RAG algorithm will if it's done right, boost the LLM domain-specific performance. So you can see these are two things that are separate. Now if you do them together, you might even get an even better result. But I will mention that as new data comes in, continuously fine-tuning can really start to eat up your budget. So it might be better to consider the performance as very, very good if you were using RAG. And it might be perfect if you're using fine tuning, but then there's the cost in that, and you probably would not fine-tune on daily basis either. All right. Let's see. Would it be possible to enhance the accuracy simply by moving a model from the CNN so kind of solutional neural network to or RNN or RNN to LLM.? Without looking at the benchmarks of which CNN or RNN or even LLMs to use is, it's hard to say. What I will say, though, is that depending on your application, it might be more efficient to use a smaller model like an RNN or CNN to perform with the same performance for that particular neuro task that you'd expect from the LLM. I have not yet seen the kind of performance datas that we normally use for benchmarking CNNs or RNNs for their tasks, whether they're image processing, text processing or audio processing for LLMs. We do know that some LLMs now can handle image input, but I don't know how accurate that is compared to some of these CNNs or RNNs, I would say tread carefully here, but just understand that a narrow CNN or RNN may provide a much more efficient answer than the LLM. All right. So we're also getting a question about whether RAG can be multimodal. So not just text or even text in the PDF but also images, video or sound? Sure. as long as the output of the result is something that an LLM can process, then definitely, think of the case, for example, where maybe you have a bunch of images that have been vectorized. You want to find the one with the -- that is most similar so you would run the RAG process, getting the embedding vectors in co-sign distance or whatever and then finding the 2 or 4 images that are the most similar. If you were then to pass them into an LLMs that can handle that image input, then perhaps it can do additional processing on those for you. So you would use the embedding method with RAG to quickly narrow down the images to just a handful that might be the 1 you're looking for. And then perhaps the LLM is the 1 that can look for the things that you're searching for in the image to make sure that it is the best 1 that you wanted out of the search. Let's see, for coding applications, is it worth it to pay more to have access to longer context versions of GPT 4? This is, again, one of those examples that it kind of depends on your application, but for coding applications, we have to remember that the longer the context, so maybe the more code you can stuff in the context window of an also would suffer from the same phenomenon that happens when you pass a lot of human readable text into it. If you take 128,000k tokens as an input that's something like 300 pages of a regular printed book you're going to be introducing a lot more opportunity for an LLM to get wrong, the things that it should be pulling that are important to the query because there's more available. On the other hand, if you were able to use a smaller context window for your application, you're more likely to get a result back that is going to be nearer to what you're looking for. So is it worth it to pay more to have access to a longer context version? The answer is maybe. And depending on how much code you're passing in, I think that's going to help you make your choice. I think maybe a rule of thumb would be to use the smallest context window you can get away with. Can I train and LLMs model myself? Is a question that popped in as well. So yes, you could do it but it would require a lot of compute resources and a lot of knowledge to do something like this for yourself. And we are talking about one of these large language models, of course, something that has billions or even a trillion parameters is normally something that's done by a team of engineers or scientists or professionals that are familiar with in the processes and the particular data sets that were used to do so. Something that might be better to consider doing is to see if you can fine-tune LLM yourself. And the answer there is, depending on the API and depending on the LLMs framework itself. You might be able to fine-tune it. And of course, that takes just a few examples. Another consideration is that you may not need to even tune the LLM at all. You might be able to do prompt tuning or prompt engineering or efficient tuning in order to add a very small model, for example, to the beginning of your large language model that is helping adapt, it's called an adapter, helping adapt some questions or some input to the large language model and changing the behavior of the large language model by modifying the prompt is coming in, in order to get a more accurate or better result. So when you say train an LLM model, there are different things to look at, and I think that it's important to consider those before just running off and wanting to retrain a large language model at its core. Let me go through these questions and look for a few more. Okay. So one of the questions that's coming in is, is Link Chain only necessary for RAG or are there other applications too? So one chain that you can make is RAG. But there are other chains that you can make too to become a chatbot or to handle another application as well. So RAG is just one possibility. And indeed, there are multiple implementations of RAG. And so it could, RAG could be customized to a particular need. Now one way to think about customization is if you need to perform multiple subquiries, your chain for your application can do them linearly or in parallel depending on your needs or re-ranking, you might be able to get different RAG results if you re-rank. So RAG could be written as part of a chain. So I hope that I didn't give the impression that RAG is the only reason you do is link chain. It is one of many and really infinite applications that you could perform with that or any other framework as well. All right. Let me get a couple of other questions here. The large language models responses are like receiving text on an old 9,600 bot modem, and there's a start-up delay, any comments? I think that may be referring to the example with research in example application debt that I showed. In that application, we have a connection setup with the LLM between LLM and the web page, so that as soon as the text is rendered or produced by the LLM model, it flows as it's coming into the web page. And if you use some of the other available LLMs with their own ChatUIs, you probably noticed that as well. ChatGPT, for example, you type a question or a prompt and you can see the text fold in, whether that's flowing in on your screen or on your device. It runs right across the front. So, the delay is because it takes a while for the LLM to start producing the original, the very first parts of the response. And the reason why that's particularly of concern to a lot of people who are using these large language models in an ever more complex and sophisticated apps is that if we move to something that needs to be conversational, that pause needs to be less than a half a second and more hopefully, less than about 1/3 of a second for the conversation to see human and natural. So that delay is important. And what are some ways that we can shorten the delay. We can make our LLMs faster by running on faster infrastructure, and we can also optimize our large language models for how they're implemented by performing, say, quantization or reducing the number of parameters that will reduce both the latency and then having increase the speed at which the text flows in. So that's the reason why that's happening now. And you'll notice over the next year or 2 that not only will that initial delay the latency, get shorter but also you'll notice that the text is flowing faster as the optimizations continue. And another question here as well. Do these snippets reflect the new OpenAI API? So something happened there This presentation that you just witnessed was already produced in flight for the venue that we're at right now, when the standard was updated. And so these snippets don't yet reflect the new OpenAI API, but they still should work, and you're welcome to note the differences between them and the new version of the API in the future, hopefully, we won't have a coincidence line up like that quite again, where there wasn't time to make any changes to the presentation. Okay. Well, once again, thanks for joining us at LLM Developer Day. An on-demand version of this webcast will be available approximately 1 hour after its event and can be accessed using the same link. Thank you again for coming. And I'd like to invite you to watch the rest of the session is today. Thank you.

For developers and AI pipelines

Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.