NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
August 10, 2023
Earnings Call Speaker Segments
Unknown Executive
executiveThank you for joining this webinar on Compute Graph Framework. I hope you have had a chance towards the previous webinar on the System Task Manager, or STM. If not yet, then it is available for replay on demand as it provides essential foundational knowledge of STM and DriveWorks. Latest driver software release includes the Compute Graph Framework, or CGF. Developers can use CGF to express their autonomous driving application as notes of a directed acyclic graph. Structure provides an intuitive approach to organizing complex heavy applications, thus enabling the developers specify unique notes once and reuse them as required. It combines the versatility and comprehensiveness of such graph-based frameworks with the deterministic execution and scheduling acquired for AV development. This webinar will demonstrate how developers can easily express components of heavy stack as blocks and thus develop self-driving applications more efficiently. This webinar is the second of a 2-part series focused on System Task Manager, STM, and the Compute Graph Framework, CGF. Components are part of the drive SDK, which is composed of the DriveWorks middleware that drive operating system. DriveWorks provides a comprehensive library of modules, developer tools and reference applications that take advantage of the computing power of drive AGX, trying to achieve maximum throughput limits of the sampled computer, enabling real-time self-driving applications. New DriveWorks, we have drive operating system or Drive OS. Foundational software stack consisting of the NVIDIA hypervisor embedded real-time operating system and libraries like CUDA and TensorRT and other modules that provide access to the drive engines, hardware engines. In this session, we will take a deeper dive into the Compute Graph Framework layer of [ Paybox ]. Graph framework or CGF works on top of the System Task Manager, or STM. It enables developers to express applications as computational crops comprised of several notes that then pass to STM for deterministic and optimal scheduling. For more details on System Task Manager, or STM, please check out the previous webinar in the series. Let me dive right in. Here's a quick look at the agenda. We'll begin with an overview of the development workflow using CGF [Audio Gap] flow and run time. We'll then explore the fundamental concepts involved in CGF networks and graphlets and the various tools available to us to manipulate side notes in graphlets. Next, we'll look at a moderately complex pipeline of building CGF, a review of various [Audio Gap] Finally, we'll wrap up with an overview of how all this fits into the overall safety concept. [Audio Gap] different workflow using CGF. In initially designing an application, one need not to even implement any core until one wants to actually run the application. Fundamental design principle of CGF is the use of dataflow programming model. Dataflow programming model involves modeling and application as a series of processing steps with the emphasis on amount of data through the various process [Audio Gap] this workflow, a few tools are available, primary tool of interest being a DW Graph UI. DW Graph UI is a graphical ID that enables developers to define notes, combine them into graphlets and describe even the most complex application as a pipeline within the compute graph framework. [indiscernible] architect has described an application as a pipeline that is a piece of notes interconnected in a graph, but then remains is to implement the core corresponding to each logical block that is each node in the graph. Node here is the smallest logical block of processing being modeled. Their flat is a bunch of such interconnected nodes. Application pipeline or a graph consists of anywhere from a couple of such nodes for simple demo, several hundred graphlets themselves composed of equally large number of nodes with them. Each complete application pipeline drive by a single top-level application, JSON file, that contains references to the various graphlets and nodes in world pipeline. We also have tone assisted generation of skeleton code associated with each node. This simplifies the creation of fully functional code base that represents the logical implementation of complete application pipeline. Vertical operations to be carried out in each node can then be implemented one by one within the auto generated skeleton or complete [Audio Gap] Eventually, we compared to be executed on the target. Also note that for each note, the expected worst-case execution time, WCET, specified. This information is passed to STM, which enables it to generate an optimal run time schedule for the application. Let's now look at how the flow of execution looks like at run time. Remember from the previous slide, the top-level application JSON file. One that describes the entire application pipeline and contains references to the graphlets and nodes and other design time artifacts. That application JSON file is passed to call the launcher. Launcher passes the application [indiscernible]. Hiring difference to various nodes involved [Audio Gap] and creates the necessary run time processes where to host and execute the project within the individual nodes. [ Portal ] light is then otherwise empty container process. Execution logic of various nodes can be loaded within individual instances of loader. Additionally, STM Master, the orchestrator within a system task manager, or STM, is also launched. It is responsible for [indiscernible] tight scheduling and execution of the application pipeline, compile schedule based on the worst-case execution times that abuse [ CAD ] of individual node passes and the dependencies between the individual nodes extracted from the application graph as per STM Master. It then uses it to ensure that appropriate execution logic within individual nodes is executed exactly as per the deterministic schedule. This concludes the overview of the CGF workflow at run time. [Audio Gap] of the basic concept of CGF node [Audio Gap] passes of node. It is a collection of a bunch of various bits of information combined into one entity. [Audio Gap] lock of the pipeline, the smallest unit of our execution pipeline. Node can accept inputs and can provide outputs for the nodes ports. Node has passes with them, which are the fundamental units of execution logic. The logic that the nod is going to execute is referred to as the nodes passes. Node is defined multiple instances of the node can be used in a pipeline. If a node is used, we may wish to customize it a bit. Such customizations can be specified using parameters of a node. This concludes the quick overview of the CGF node [Audio Gap] passes and parameters. [Audio Gap] development of flow that we saw earlier in the session, whereas the same laid out as an iterative step-by-step procedure. [indiscernible] involves creating individual nodes, composing graphs using multiple nodes and repeating this until we have defined the entire application pipeline as a graph. Then we typically generate the skeleton code for each node and add the actual logic in their respective process of various nodes and compile it. We then proceed to specify the individual worst-case execution times for each of the nodes passes, obtain a deterministic schedule for the entire application pipeline using STM. Next, we execute the compiled executables as per the schedule on the target. [indiscernible] unit and repeat this entire cycle. While this is the overall development cycle, there is another shorter workflow, one that does not even involve writing any code. We noticed when we create a bunch of nodes and link a few instances of them together to form a graphlet or a pipeline, they're doing all this using the DW Graph UI without actually writing any code. Next, we can simply jump to specify the application, including specifying the nodes and the processes they operate on and their individual worst-case execution times. All of this still does not involve writing any code. And without having written any code just yet, just both specifications in JSON and [ available ] files, CGF framework allows us to compile a schedule that satisfies all type constraints. Constraints as where in terms of data dependencies between two nodes or constraints in the form which node can run on which execution engine, for example, a particular CPU code and constraints in the form of worst-case execution time for each pus. Just all this, we are now able to visualize our hypothetical execution schedule. This ideal execution of the currently defined pipeline on a time line happens without having to actually run it. Once we are happy with this, we can then go ahead and add the logic within each node. just implement the code for each of the nodes process, profile it and tune it and keep repeating this until the execution times fit within the worst-case execution times as per the initial design. Note that during this workflow, if at any point of time, we discovered that there are significant challenges with the initial pipeline design, we are free to take it. Anything from simple to each worst-case execution time thresholds or modifying the graph by reorganizing the dependencies between nodes to further optimize the execution. Let us now explore how graphlets are represented in CGF. Node is primarily described by the following information. Set of ports that are the endpoints for either incoming or outgoing data and a set of passes that are executed to perform the desired processing. On the left, please see an example of a node in the DW Graph UI, a feature detector node. We can see that the feature detector node has 3 input ports, pyramid predicted feature NCC scores and a single output port, which are GPU. Also, the feature detector node has 3 process: set up, detect new features and clear down. Currently, it is a convention to have set up and tail down passes for each node. And one or more additional passes detect new feature pass in this case, but the actual logic to be associated with the node. In the context of this node, the incoming data that this node operates on are depicted by the input boards. The logical operation status node performs are depicted by the passes and the outgoing data or results generated by this node are depicted by the output port. Next on the right, we see an example of a graphlet in the DW Graph UI. This graphlet consists of 3 nodes: a pyramid node, a feature tracker node and a feature detector node. The very same node that we initially saw on the left now is also seen as part of this graphlet on the right. This graphlet on the right has 1 input port and 3 output ports. And it has 3 nodes interconnected within it. The incoming data to this graphlet is fed into the pyramid node. With subsequent output, it's then fed into the featured detector and featured tracker nodes. The detector node also needs 3 additional intermediate inputs from the future tracker node. Election of various results from each of these notes is finally combined and fed as to the output ports of this graphlet. So whenever this graphlet is used as part of a larger craft, it lights up 2 input and provide 3 outputs. At any time for a quick review of these concepts, please refer to the respective sections on the node and the graphlet and the NVIDIA DriveWorks documentation.
Unknown Executive
executiveLet me just open one of the CGF demo, which is available in the PDK. So tomorrow, later, any time you want to check this out, you will be able to see this yourself. So you can navigate to the DriveWorks installation directory and look at this part, CGF demo. You can just search for this file and open it in Graph UI editor. So what we were looking at, let's zoom in. So as you can see, it is a somewhat complicated graph. We'll go into the meaning of this later. But what we were seeing in the screen shots -- so let me open a camera pipeline. And here, this is what we are seeing as input, output. We go one more level deeper, and this is what we had seen in the slide. So if you notice, there is a lot of complex information here. Let me just give one quick overview. So if you notice here, there are multiple boxes -- so many boxes here connected. This is starting from the input side and goes to the output, and this basically one whole pipeline as part of CGF demo. Now each box that you see on screen is not a node. It can be a node or it can be a graphlet. So in fact, if you look at the screen, everything that you see here is a graphlet what basically means is this is not the end. So if you want to look inside this graphlet, you can double click this and it opens another tab. And in that tab, it shows us camera pipeline. So here, we looked at cameras. So there are multiple camera pipeline, 3, 2,1, 0, right? So you click any of these, double click this and it will open to that tab. Did you see? So what we have done actually is we have defined a graphlet called camera pipeline. And in our processing, there are multiple such recurring parts, which basically means, as part of this logic, we'll have some camera sensor logic and then camera preprocessing logic. But we want to do this for maybe 4 different set of cameras. But it is the same logic, but with different data, different inputs. So instead of creating those same nodes 4x, it becomes very complicated, right? Just create this graphlet onetime. And every time you want to create a new set of these 4, 5 nodes, you can just drag and drop that, and you'll get one more instance. So for example, there is camera pipeline graphlet here. Just to show I am doing this, so we get one more camera pipeline here, and we can make this camera pipeline #5. Then we can take this input here, create one more, take this output here, do all that, okay? So that's something which we can do. So -- and it is not just one level deep. Again, if you look at how camera pipeline is created, camera pipeline also is not made out of 2 nodes. These 2 components that we see on screen are again graphlets indicated by this icon with multiple nodes in it. So if we were to look at camera preprocessing -- now look at camera preprocessing, right, that contains actual final level nodes. So this is what we had seen in the slide, pyramid node, feature tracker node, feature detector node And if you're double like this, you can actually go into that final lowest level, which is the node. So the feature detector node that we had seen is this one. So though we went from top, like how outermost layer top to the lowermost layer node, actually in development, this is how it looks like. Somebody will think of it, okay, I want to have some logic, which does feature detection. So I can have a node like this. How will I use this feature detection? I will also combine something called feature tracking pyramid nodes and all that, and I create this set, common recurring logic. In this common recurring logic, I want to use multiple times. So I want to use it four times, so I can then drag and drop that here. Similarly, you start from the lowest simplest node and keep building the logic until you arrive at a final top level view that we see here.
Unknown Executive
executiveLet's now look at how we can create custom nodes and graphlets in the DW Graph UI ID.
Unknown Executive
executiveSo what we are doing here is we have created something like an MT node. So let's go ahead and start by trying to define a new node. So when we try to define a new node, MT node, it has no input ports, no output ports, no logic pass, nothing. But as part of defining this, we can see that whatever we create here gets saved as a very open format JSON file that anyone can write or modify by hand, but we have the GUI to make it very, very easy and maintainable. So let me start by creating a node called hello node. And okay, to give an example, this is what we are aiming for. I have something called hellonode.node.JSON which takes one input and one output. Input is something integral current count, output is next count. And it has three passes: setup, processing and tail down. You can imagine, you can think this is something that we have written by hand. Now if we want to create this same in the DW Graph UI, let us see how it looks like and let us see how the output JSON would look like. We start by defining start to empty node, define -- call it hello node and in the input port, we go ahead and define -- what is that? Yes, we want to call it current count. And we want the data type to be, let's say, input. Similarly, we want to define one output port called next count, and we want that to be an integer as well. And then we want to create -- oops, sorry, I forgot to add them. Let me just add current count. As you can see, the input board current count is now showing. Let me create quickly, output called next count and give it a data type of end, direct. Now you can see next count. And then let us quickly create three passes called setup process and tail down. So the interest here is not to actually write any logic here, but just to see how the GUI editor can tell us how to create the JSON files simply. So we created this. This is what we wanted. We have defined it, and let's go ahead and save this. I'm just going to override the manually written file that we just had, hellonode.node.JSON. So by convention, any nor that we define will be saved with a .node.JSON. And any graphlets that we create in the GUI data will be shared with the graphlet, .JSON. This is a node. We give it a name hellonode.JSON and save it. And now if you see this right, the editor is saying, yes, the file got changed, but it's just the exact same output, which should be now created using the GUI data. You saw this, right? There is an input port, there is an output port setup process tail down. So while DW Graph UI editor and a lot of other tools like nodes, we generate score from such template. All understand this JSON file format, this JSON schema. It is open and the schema is available actually as part of the PDK, which means creating additional tools or integrating this into any of your existing tools and workflow should also be possible. There is no lock-in involved. So you are free to use the Graph UI editor, you are okay to write this by hand or you're free to create or use any other tools by extending them with this current schema. Similarly, let's try and save this as one more file. So we want something called world node just to show how to create a graphlet. So let me go ahead and just save this as world node. So basically, I don't want to change anything. I'll just create the same exact detail in world node. Just the name is changed. It also has current count, next count and three passes. Let me go ahead and save this as worldnode.JSON. And now what we do is we want to create a new graphlet. So let's say we want to create this. Let's see, where were we? Yes. So hellonode.node.JSON and worldnode.node.JSON. We can drag and drop them here. Okay. And we can say from the input port, we want to have this current node -- current count. And we have this world node. So let's just say hello, next. It goes to the input of world. And then we have an output port. And so we can connect to this, and that's what I was trying to show in this. So once we connect all this, this is how it looks like. And then we save this as a helloworldgraphlet.JSON. And let's look at that, how does that to look like? So if we save this, we have some JSON file, which defines the graphlet. It is very similar to node JSON. If you see, it also has input ports, output ports. But instead of having passes in node.JSON, there will be passes, which actually contain the placeholders for the logic. In the graphlet, there is no pass. Graphlet doesn't execute anything on its own. It just describes the connections between the multiple nodes. And the nodes will have the actual passes, actual executable logic under them. So which is why we have something called connections and not passes. And in the connection, it can say what is the source, what is the destination, what is the source, what is the destination. As you can see, for each source, there can be multiple destinations. But in a simple example, there is only one-to-one mapping here, which is why we don't have an [ RF ]. But if we had so many nodes, like we saw in the CGF node, one source could be connected to multiple nodes. Output of one node can be filed as input to multiple nodes.
Unknown Executive
executiveNext, let us look at a tool node stop. To enable, just quickly generate complete compilable skeleton code for any node that we define. Once they are defined a node and once we have a node JSON file, we can instantly obtain template code for it using this node step tool. Let's see how.
Unknown Executive
executiveSo if you see, right, we have created something called worldnode.JSON and hellonode.JSON. These two JSON files, we pass it to the node step, and we say generate the C++ implementation in this directory and use this as the base class. There are a couple of other node classes available. Currently, I'm planning to use exception of process node. So when I run this command on hello node since I've already created them, this is just being very careful and asking if I want to overwrite. Yes, I want to overwrite. And I want to do this for worldnode.JSON as well. So as you can see, for every single node that we pass, it creates 4 files. There are 2 node class files and 2 implementation files. And as far as just getting quickly to work without changing anything, just add my logic and save this. That is my only concern, I can just modify the final implementation C++ file, and I don't need to worry about anything else. Yes. So this is the IMPL, implementation C++. And all the standard baller plate code, template code to make it compile and work is all present. But as you can see, in the process pass, there is absolutely no logic here. It will just return success without doing anything. So if I want to put any logic here, I'll just go ahead and implement this here.
Unknown Executive
executiveWhile implementing the actual logic within a node's pass, note that there are already a bunch of helper functions available. Example, to access the data being processed by the node, helper functions are available to access a node's ports and parameters. This is explained in a lot more detail, along with example code within the CGF documentation. Part of the DriveWorks documentation. We have implemented a new custom logic for any custom node we have created and compile it to obtain an executable library with a custom nodes implementation in it. For example, the implementation of the hello node and world node that we created earlier. Subsequently, whenever we specify an application JSON, that references the hello and world nodes in its execution pipeline. In different time, nodes to load the CGF custom nodes executable library, which contains the implementation of the executable logic within the passes of the hello and world nodes. And that will be executed as per the predefined schedule of the application pipeline, whenever the hello or world nodes need to be activated. Let's look at an application in the context of CGF. Conceptually, this is what an application looks like. As defined as an execution graph or a pipeline that is made up of a bunch of graphlets, which, in turn, are made up of other simple graphlets nodes. This just defines the data dependency. We also need to specify additional execution constraints. So how do we actually go up for doing all this? let Me describe an application using an application adjacent file. First of all, we had a reference to the top level application graph or pipeline. You know that, that [indiscernible] internally composed of additional graphlets and nodes. Second, we specify relevant details required for deterministic execution of the application. This includes the execution engines, such as the GPO and any specific CPO course on which each of the passes are allowed to be executed. Also for each group of passes, we specify the periodicity with which [indiscernible]. All this information is key for STM, the System Task Manager [Audio Gap] sum up with an optimal and deterministic executions. Finally, we specify the expected worst-case execution times for WCETs, each of the passes of all the nodes involved in the application. We specify all the WCETs in a single AML file and add a reference to it in the application, JSON. NVIDIA DriveWorks shifts with the CGF demo sample. This developed by following all the steps that we have briefly covered today. It now uses 4 camera streams and a DNN-based object detection of -- and an implementation is based on the sample application, sample of check detector tracker. It uses the [ Tesamorelin ] and the sample application to perform object detection on for camera recordings. Briefly mentioned earlier, once an application is defined in terms of an execution graph or a pipeline, along with its execution constraints and worst-case execution times, System Task Manager can generate an optimal deterministic schedule. This deterministic schedule can be visualized even without actually having to execute the application. A way to do that is by using the STM schedule 2. Here, we can see the ideal execution time line with each block representing a distinct pass of a node involved in the application. Here is the screenshot of the CGF demo application in action. For more details on customizing it and running it, please refer to the NVIDIA DriveWorks documentation. This year at the very end of the design and development workflow, we focus on profiling the execution of the application to identify any bottlenecks and hotspots for optimizing the execution and resource usage. NVIDIA Hindsight and [ NTX ] are extremely handy to obtain an overview of the system as well as pinpoint any resource contention. Using CGF in addition to the built-in process and logs that exist as part of the framework itself, one can add custom pressing points not to any custom code one may have implemented. This ensures that any profiling information collected from the execution of the final implementation provides an integrated and inquisitive understanding of the behavior of the application on the system in various conditions. [indiscernible] solution of such an application on a certified functionally safe system is constantly monitored without any additional inputs from the application. During run time, the underlying system task manager, or STM, is responsible for the deterministic execution of the application as per the predetermined schedule. It keeps track of the actual execution of the application and ensures that the program flow monitor, or PFM, is immediately aware of any deviation in the actual execution time line. This such exceptional behavior occurs, then this can subsequently take any necessary corrective actions. Let's start developing with DRIVE OS and DriveWorks, here are a few simple steps. You can participate in our webinars as you did today and the hands-on training sessions. View the tutorials and watch our drive lab series. Please go to a drive developer page for more information. If you are not part of our drive developer program for DRIVE AGX, then I strongly encourage you to join. Please visit the drive developer page. And once you are part of the drive developer program, you can download and install DriveWorks. On the drive documentation page, you'll find the DRIVE OS and DriveWorks SDK reference guides and describe the APIs and workflows for the various topics we discussed today. Once the software is installed, it is time to start developing. I recommend you to review the DRIVE OS and driver examples that are fully functional demonstrations along with their respective source code. It is then a great starting point for our development. After familiarizing yourself with the samples, you can start developing your own software. We want to build a strong ecosystem of developers and would like to hear from you. NVIDIA developer program provides helpful resources such as forums to post questions and interact with other developers. Let us know what you are creating to have your success story featured in our partner pages. With this, I'll now open the floor for any more questions.
Unknown Executive
executiveHey, thanks for joining this webinar. I'll now begin answering some of the questions we have received. We seem to have a bunch of interesting questions around the worst-case execution time. So there's one that's about what happens when the worst-case execution time is exceeded. So basically, the underlying scheduler, the system task manager, or STM, notices that the predefined schedule has been violated and signals in the components responsible for the overall safety of the system be notified about it. So as we sign one of the slides on DRIVE OS, this is the program flow monitor, or PFM, and that will be notified about which executable missed the deadline and what was the extent of deviation from the schedule. And the PFM is now expected to decide whether any correct -- the severity of the deviation from the ideal execution time line and decide on the appropriate course of action. Another question we have is how do we support high ACL integrity? Is it mandatory to use PFM. So the program for monitor of PFM acts as the supervisor component, and that would be developed to the highest ACL levels, which then means that the actual functional components, the executable logic, the functional logic can be QM or SLP. And the system decomposition, the safety composition would be such that, since we have either redundancy of execution or either in hardware or software and we have a higher superior component, that's how we can target the highest ACL levels. Moving on. We have a question on DW Graph UI. So when an application contains several graphlets, does the application have one process or multiple processes? So the process configuration is completely orthogonal to the graphlets, the hierarchy of graphlets and nodes. Multiple graphlets or nodes can be executed within a single process or they can be split across multiple processes. It's up to the system architect, system designer to make this choice, depending upon the knowledge of what executable logic runs best on any particular -- let's say, we want to open up a set of executable logic, a bunch of nodes onto a particular CPU core, that is definitely possible irrespective of how we have implemented the code whether across multiple nodes or multiple graphlets. So yes, completely independent and free to choose. What else? Do we have any other questions? So here's one. So what if we have 2 nodes running on different [ torrents ]? Could we do network-based communication via the nodes? The answer is yes. That's definitely possible. However, I can't think of a sample in the drivers that looks or that explains this in detail. That's something we can definitely add going ahead. So yes, definitely possible to communicate across multiple [ torrents ]. Looks like we have all the questions done. So yes, thanks for joining this webinar, and we hope you found it useful. You'll now receive an e-mail with a reply to the webinar along with the slides shown today. Yes, hope you had a good day or night, depending on where you are. Thanks again. Bye-bye.
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.