NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
February 29, 2024
Earnings Call Speaker Segments
Unknown Executive
executiveHi, everyone. Thanks for joining us today for our webinar on Optimizing Multi-task Model Inference for Autonomous Vehicles. Before we begin, we wanted to cover a few housekeeping items. In the upper right corner, you should see a More Information button. If you click this, you'll see a few links expand, including a feedback survey. Please take a moment to provide us your feedback as it helps tailor future webinars. [Operator Instructions] A copy of today's slide deck and additional help materials are also available on the resource list. We encourage you to download any resources or bookmark any of the websites that you may find useful. Here are some tips to help make this event as best as it can be. To maximize the quality of the audio stream, please close any open applications aside from your browser window. Also, refreshing your browser can fix a lot of the many issues. Sometimes if your audio cuts out or it's not loud enough or you notice any lags, that's a good, simple, quick technique to fix it. Now without further ado, I'll turn the event over to our speakers to begin the presentation.
Le An
executiveHello, everyone. Welcome to our webinar on Optimizing Inference for Multi-task Models for NVIDIA Drive platform. My name is Le An from NVIDIA, and I'm working on different model development and the inference optimization for automotive use cases. Today, my colleague, Yuchao, and I will talk about how to strategically deploying multitask model on a NVIDIA DRIVE Orin platform by utilizing different deep learning computer resources for inference with efficiency. In the following, we will first briefly go over the basic concept of a multitask model and some really good work in autonomous driving. Before we dive into details about the deployment of a multitask model, we will give an overview of NVIDIA DRIVE Orin platform and how we can elaborate different computer resources such as GPU and Deep Learning Accelerator, DLA, for better efficiency. This will be illustrated by sample network, and my colleague, Yuchao, will explain the end-to-end workflow on model training, model quantization, model compilation and implementation of the inference application in details. We will show that the proposed deployment strategies can greatly improve the latency and throughput for this kind of multitask network. So conventionally, a deep learning model is designed for a single task such as a ResNet for image classification. However, we are dealing with much more complex applications where many tasks need to be performed. Applying individual model for each of single task is no longer efficient or scalable. In some latency critical applications such as autonomous driving or embedded platform, limited compute resources will be available, therefore having many single-task models may not even be feasible. On the other hand, multitask models perform different tasks within a single network. In a typical multitask model, a backbone is shared by different tasks. And different tasks are responsible for processing the shared features and producing task-specific output. The tasks in multitask models are also very relevant. From a learning perspective, lot of the tasks can [ improve ] by mutually beneficial training signals, leading to overall better accuracy. Nowadays, multitask learning is widely adopted in different domains such as computer vision, natural language processing and so on. In recent years, multitask models have become popular in autonomous driving, especially in perception tasks. The perception tasks such as object detection and tracking, task estimation, semantic segmentation are highly correlated. And therefore, putting those tasks into a multitask framework is beneficial for both training and deployment. Apart from perception, downstream tasks such as planning and control can also be integrated into a multitask learning framework. One representative work is multitask attention network for end-to-end autonomous driving. This model is composed of the shared enclosure backbone. One decoder for depth estimation, one decoder for semantic segmentation, one classifier for traffic lights as well as a driving module to predicts steer, throttle and brake. The output is a control signal to drive the car. To improve the performance of multitask model, our recent work proposed at a so-called pretrain, adapt and fine-tune paradigm for general multitask learning. In the adapt stage, learnable multi-scale adapters can dynamically adjust the pre-trained model weights with supervision from multitask objectives. This approach is validated in our use case for object detection, semantic segmentation and trainable area segmentation and it's showing very competitive results. In another recent work, a planning-oriented end-to-end model called UNI AD is proposed. This model has 4 stages: backbone feature extraction; perception; prediction; and planning. In the perception part, the inputs to tracking and mapping modules are the bird's-eye view features. The output of perception is directly consumed by the prediction, which later provides its output as input to the planning module to predict the future trajectory of the car. This is a very interesting work that combines all essential tasks into a single multitask end-to-end model with state-of-the-art results of public dataset. So given the popularity of the multitask model in autonomous driving, in this talk, let's see how we can better deploy a multitask network on DRIVE Orin. For demonstration, we construct a simple multitask model. Specifically, our model takes a single image as input which go through a shared backbone encoder. There are 2 tasks on top of the backbone. One task heads, is a depth decoder for depth estimation, and the other is a segmentation decoder for semantic segmentation from the input image. So before we talk about how to efficiently deployed the multitask model of NVIDIA DRIVE Orin, let's first go over the hard work configurations of the NVIDIA DRIVE Orin platform. The NVIDIA DRIVE Orin SoC is an embedded supercomputing platform which can process data from camera, lidar and radar sensors in other support applications, such as autonomous driving, in-cabin functions, drive monitoring as well as other safety features. The DRIVE Orin SoC is based on Ampere architecture with the third-generation Tensor Cores. For deep learning, it consists of the GPU with 16 streaming multiprocessors and 2 instances of so-called Deep Learning Accelerator. Both can be used for inference. In total, DRIVE Orin can deliver up to 254 INT8 TOPS. And it can be scaled to support from Level 2+ system all the way up to Level 5 fully autonomous vehicles. Now let's take a closer look at the Deep Learning Accelerator on DRIVE Orin. The DRIVE Orin features up to 2 second-generation DLAs. So regarding the deep learning performance, the 2 DLA can deliver up to 87 TOPS. DLA is a fixed function accelerator and it supports many layers such as convolution, deconvolution, fully connected, different types of activation, pooling, normalization and so on. DLA is a complementary to GPU in the sense that so while the GPU delivers the most of TOPS in high-power profiles, DLA is very good at power efficiency. DLA has support mixed precisions such as FP16 and INT8. However, note that the DLA is mainly designed for INT8 inference. So we encourage users to run networking in INT8 precision wherever possible on DLA. For more information on DLA, you can refer to the TensorRT developer guide, and there's a dedicated chapter on working with DLA. Where you can find more details about DLA layer support, data format support and so on. NVIDIA also open sourced a GitHub repository on DLA in which you can find more details, including how DLA works, performance benchmark numbers and some useful tools and scripts. Now going back to our sample multitask model. In this case, we have 2 task heads. So we're going to deploy the depth head on one DLA, and the segmentation head on the other DLA. For backbone, which is heavier and needed more compute than the heads, we deploy them on the GPU. For accuracy consideration in feature extraction, we rolled the backbone in FP16 precision. On DLA, we performed quantized inference in INT8 precision. In this setup, we free the GPU from doing inference of the heads. This can be beneficial in 2 ways. Either the saved GPU resource can be repurposed for other tasks or the inference can be pipelined such that with DLA is processing the inference for the current frame, GPU can already work on the feature extraction for the next frame. In practice, we split the original model into backbone and heads. This can be done with ONNX tools such as GraphSurgeon. Next, I will hand over to my colleague, Yuchao, and he will dive into details of the implementation.
Yuchao Jin
executiveHi. I'm Yuchao Jin from NVIDIA. I will present to you with more detailed information about our implementation. For the encoder as a common feature extractor for both tasks, which was mixed transformer from SegFormer. This is a backbone inspired by the popular Vision Transformer and optimized for semantic segmentation. Compared to Vision Transformer, this backbone has 4 stages and could produce multiple layer feature maps like convolutional networks. And multi-scale features at different resolutions are critical for dense prediction tasks such as semantic segmentations and depth estimations for objects and scenes which vary in sizes. So mixed transformer employs an overlap patch merging process, which is used to preserve local continuity around patches. With good designs on the strength and patterns of the overlap patches, it could produce features with the same size as another overlapping process. Within each stage of this backbone, there are efficient self-attention modules. For automotive use cases, this is beneficial since the main competition bottleneck of some many other transformer base encoders is a self-attention layer whose computational complexity is all in square with respective to input image size. While for autonomous driving, large image resolutions are frequently used, so efficient self-attention modules implemented the sequence reduction process which is introduced in Pyramid Vision Transformer to reduce the total computational complexity by a constant factor, which is a hyperparameter set on Stage 1 to Stage 4 inside the encoder. Also note that in this backbone, we do not use positional encoding as typically seen in other Vision Transformer. This is also beneficial in the way that is -- the resolution of the positional encoding in Vision Transformer is fixed. When the resolution of a test image is different from the training set, the positional encoding needs to be interpolated and this often leads to dropped accuracy. Instead of using positional encoding, we use the SegFormer mix FFN module, and this is empirically found to be sufficiently enough to provide positional information for transformers. For the semantic segmentation decoder, we also adopted the design from SegFormer. This decoder only consists of multilayer perception layers of sampling layers and activation layers. In our implementation, we further reduce the complexity by replacing the concat operator with head. This makes our segmentation head very lightweight and easy to deploy on DRIVE platforms. For the depth estimation head, we use a progressive decoder. This decoder receives outputs from all transformer blocks in the encoder and combines them in a progressive manner with a sequence of up sampling and convolution and the size of the final output depths map from the decoder is equal to the input size. For training, we adopt a semi-supervised learning strategy to make better use of the raw data with all the ground-truth labels from different datasets. We apply 2 strategies to create pseudo labels for this dataset to do semi-supervised learning. The first strategy is the so-called online process, in which we generate pseudo labels during the training of the multitask model. This process is lightweight and easy to integrate into the existing workflows. The second strategy is an offline process where we employ a larger teacher model to create high-quality pseudo labels. To better leverage the generating pseudo label, we use a discriminator to selectively choose reliable ground-truth for training process. For more details on multitask model training, you can refer to our previous GTC talk on this topic. As we may know, lower precision will lead to less memory footprint and higher mass operations per second. Our Orin platform to trillion operations per second for INT8 is twice compared with FP16. It is because we can utilize faster and cheaper INT8 Tensor Core. This is especially useful for compute intense operators such as convolution or MLP. Lower memory usage will lead to less memory bandwidth requirement and can end up with a faster inference speed. Also, embedded platform UNI AD has limited memory. So data type like INT8 can also help us save the precious space. As mentioned in previous slides, if we want to better utilize the DLA hardware, we should context both heads into INT8. Here, context means casting a 14-point number into an 8-bit integer, range from minus 128 to 127. To achieve this, we need to normalize and rescale the floating point with a specific scale. In this work, we follow the post-training quantization procedure. In general, the workflow will be collecting a set of inputs as calibration data, feeding them into the pretrain network, collecting statistical information for each layer and coming up with a scale number for each of them, just like the chart at the bottom of this slide. We can use APIs from TensorRT to obtain the calibrated cache files. TensorRT already provide some choices like min, max or entity-base calibration. You may choose either of them and take one of them with the least accuracy drop. And the scale for activations and weight will be saved and used in the following inference stage. So we don't really have to recalibrate every time. Alternatively, Quantization-Aware Training can also be applied. It is a training technique to boost quantized networks performance. Basically, it is to let the training procedure aware of the error introduced by quantization. And early, this additional step will make the network more robust to the quantitation and leads to less accuracy drop when we quantize. This is a whole area which is so large to be covered in today's webinar. In short, if your model suffers from performance drop after training with different calibration measures, you may consider using QAT techniques, and it might solve your problem. You can refer to the second link in the footnote for more detailed explanation on this topic. To fully utilize the power of the target board, we should build the engine on it. In this case, TensorRT will be able to profile cloud in the run time and choose the best tactic with the lowest run time. Only with actual INT8 covenant, we will be able to collect the most precise run time results. NVIDIA already provides the tool trtexec exact for building and evaluating model on Orin platform. To build a GPU engine from an ONNX model file, you can run trtexec with [indiscernible]. As you may notice, we specified the outperformance to FP16 CHW32. Here, FP16 means we will output the Tensor with data-type FP16. This is because the backbone is already running FP16, and we can directly use our full feature in the same precision. This will not only save some memory traffic, but can also eliminate the data conversion at the end of the inference. Otherwise, TensorRT will insert additional data type conversion from FP16 to the default precision, which is actually FP32. And the output layout will be CHW32 or 32 wide channel vectorized row major format. This is a special design layout considering memory access coherence. And this memory layer is to avoid memory reorganization since we will pass the output feature to DLA with the same layout. To build a DLA loadable, you can attach useDLAcore and buildDLAstandalone [indiscernible] to become online. A loadable is different from an engine we mentioned above. A loadable is designed to run outside of TensorRT. That's why we call it a standalone. We can use cuDLA to load and do the inference. For more details, you can refer to the link in the footnote. As we mentioned before, DLA is mainly designed and focused on INT8 inference. So we changed the precision flag from FP16 to INT8. Also, we specified the input and output format for a better performance. Different input and output memory layout will lead to different tactic choices, so choosing the layout properly to minimize the latency. Within trtexec, it will first load and pass our provided ONNX model. Then the surpass model will be sent to DLA compiler. The compiler will finally produce a notable and stored in a binary file. Please be aware that TensorRT cannot load and process a DLA loadable file directly. After we build a DLA loadable with trtexec, we would like to load it and do the inference in our application. We can do so through cuDLA. cuDLA is an extension of CUDA. It will help us to use DLA just like how we interact with GPU. cuDLA low-level operations with NVIDIA and DLA driver, and it will also natively support CUDA stream and CUDA event. We can initialize and submit a DLA task to a CUDA stream, just like what we can do with CUDA kernels and/or TensorRT engines, even though GPU and DLA are totally different hardware. cuDLA also provide similar semantics for a error handling. You can use CUDA 8 last error just as we use in CUDA last error. With all the advantages above, cuDLA is almost the only choice for rapid prototyping. For more information, we attached several useful links in the footnote. In cuDLA, we break the whole network into 3 parts: backbone; segmentation head; and depth estimation head. We assigned the backbone to GPU and 2 heads to DLA individually. Since the 2 heads rely on backbone's [ output ] feature, we must schedule the DLA task after GPU finishes work. Otherwise, DLA can't correct feature map. So here is a high-level design about how we schedule the pipeline. When Frame T arrives, GPU will consume the T frame. While DLA cores are still working on the future map from previous Frame T minus 1. And after GPU finished Frame T, that will trigger a signal for DLAs, indicating that GPU finish its work. And then GPU can start working on the next frame, [indiscernible]. This formulates a 2-stage pipeline to pursue better hardware utilization and better throughput. And the overall throughput will be affected by the longest section, in our case, the backbone. For different backbone and different tasks, the split strategy might be different. Developers should consider multiple factors. As the overall run time will be affected by the slowest part, we should consider run time for each part individually. Also, DLA saves time for power-efficient influence, which means in some specific case, some custom operators, it will be easier to do so on GPU rather than DLA. For some tasks, we may want to keep the network running with FP16 for better results, which means this network may not be suitable from DLA also. Memory copy and reformat between some network can also be a problem. This overhead might be large if the feature map is huge. With all these factors as guidelines, we can decide the task assignment. A good separation plan should introduce minimum memory and reformate overhead while maintaining good node balance between tasks. So all the hardware will not idle and wasting time. Allocating memory for DLA is just the same as TensorRT. All you have to do is to claim a chunk of GPU memory with cudaMalloc and register it to DLA with [indiscernible] register. In this way, DLA will recognize and consume the data from the given address. In this slide, you can see 2 pointers. The GPU pointer is to start to address of the GPU memory you just allocated. And the DLA pointer is actually the address for DLA use internally, as GPU and DLA has separate memory mapping table in hardware. From the user view, you can consider the GPU pointer as the input buffer. You only need to interact with GPU buffer and cuDLA will help you with the underlying data mapping and transfer from memory to DLA internal buffer. For TensorRT, we can either allocate GPU memory with cudaMalloc manually or reuse the existing memory. Then users can assign the pointer as a buffer for each input or output Tensor with set Tensor address API. Buffer allocation and management can be important. A careful design like a zero-copy can save both memory and time. In our case, both the 2 heads running on DLA will use the same input feature map. So we assigned a feature map buffer pointer to both DLAs. And the 2 DLAs will share the same feature map from backbone. In this way, we save the memory for duplication, and we can convert the feature map to initiate only once. To submit a task with TensorRT, assuming we already initiated to TensorRT context, we can submit a task to a specific stream within EnqueueV3, just one line, and everything will be handled by TensorRT. Well, for cuDLA, we can set up the cuDLA task structure and use cuDLA to submit task to launch it, just like TensorRT. In this structure, module handle points to the notable DLA engine. We just initialized and load it. Input tensor and alpha tensors are pointers we acquired from cuDLA mem register. Wait events and signal events are for stand-alone mode with DLA driver. Since we use cuDLA to interact, we can directly ignore these 2 fields. In our application, these parameters are actually the same during the whole time. Do remember to provide custom stream to these function calls. Otherwise, they will be submitted into the default stream in this rep and may cause unexpected behaviors. For example, some signals API calls may block other kernel from running and may break the orders we want. Recap with our design. Backbone is running in FP16, and heads are running in INT8. Since DLA is running stand-alone and the input data type is already set up as INT8, so we have to manually quantize open feature before DLA can directly consumes it. Before the 2 DLA tasks submission, we convert the feature map from floating point value into INT8 manually with a given kernel. This kernel is naive, implemented and act as a simple sample to show how to do content by ourselves. The first step is to risk scale to improve floating point value from arbitrary range to a limited range. Then we keep it between intake range and round the floating point to integer. This is an element-wise operation. So the indexing in this kernel is fairly simple. As we can see for each kernel, we only write back one single 8-bit number. So it is possible to optimize it with factorization. Say [indiscernible] Handle for numbers and the write-back can be finished with 1 single 32-bit transaction. To maintain extreme order between GPU and DLAs, we can use CUDA stream and CUDA event. In CUDA, API calls can be either signals or [indiscernible] signals . For example, CUDA mem copy signals API. When this API returns, the memory copy action is submitted and finished. We offer CUDA mem copy ASIC, which is ASIC counterpart of CUDA mem copy. If we call this API, it will return immediately. At this point, the copy task is only submitted to the 3. But the actual copy action may not finish but not even started. It depends on the status of the stream. So now we step back a little bit. What is the CUDA stream? CUDA stream is a special structure to describe a SKU of device job. Host push jobs into the SKU and return immediately for next actions. While at the same time, GPU will try to acquire and schedule work from [ strips ] when the hardware is free. Operations in the same strip will be launched in first in, first out order. This order is guaranteed by CUDA driver. But remember, by design, we run GPU and DLA busy at the same time. If we want to populate our test, we must put them in different streams. Now the problem becomes how can we maintain order in different steams? To achieve this, we can use CUDA event. CUDA event can be considered as a signal. For CUDA event, you can either put it into a stream, which CUDA event record or weigh down specific event with CUDA stream wait event. In our specific case, assuming we have a GPU stream and a DLA stream. From the CPU perspective, the action will be: include the backbone to a GPU stream; record the event in GPU stream. That DLA stream weigh down event and submit DLA task for DLA stream in the end. Granted, GPU will start the inference right after we include it and working on the input. After a while, when GPU finish the task, it will attempt to execute less item in the stream, which is a CUDA event record. At this point, the event will be marked as occurred from the current status. And on DLA perspective, what we call cudaStreamWaitEvent, and will check if the CUDA event is in the current status. If not, DLA will just wait. And after a while, when GPU finally finish the job and change the status of this event, DLA will finally be able to move on to the next item in the stream, which is cuDLA submit task. So in this way, we ensure that DLA task will always come after GPU finish its job. Now let's take a look at some input results. Here are 4 -- serializations from segmentation and depth estimation results for both heads with our inference application, Orin. As we mentioned before, the backbone input will study FP16 precision and the inference for both heads will study INT8 precision. Those images were randomly chosen from city scape dataset. The left image is the input, segmentation is in the middle; and depth estimation is on the right. From a qualitative point of view, you can see the segmentation results can match what we see in the input images. For example, please see the image in the top right. Pedestrians and motorcycles are categorized correctly with qualitized inputs on Orin. From our internal study, we also noticed that there is negligible accuracy loss from segmentation task with INT8 precision. Generally speaking, classification task are less prone to quantization error and thus, they are great candidates for deployment in INT8 precision. Interactively speaking, this is because classification task focus more on relationship between classes. As long as our target task has the largest confidence score, the result can be considered as correct even though the actual numbers from the output are totally different compared with FP32. We offer depth estimation, the inference with INT8 precision skew shows good results. And as you can see, objects from near to far are outlined decently. However, there can be context error in output, represented by some [ blocking ] artifacts in some fine details. One way to improve is to apply mixed precision. Specifically, we can choose some of the layers to run in FP16 precision on DLA. This may impact the latency but may help with accuracy greatly. Next, precision for inference is a case by case choice, and it is out of the scope for this webinar. We plan to further investigate in the future. Here are the benchmark results acquired on Orin-X platform. With Drive OS: 6.0.9.0, CUDA 11.4 and TensorRT 8.6.12.4. Please be aware that except for numbers in row 2, all the other numbers are measured and reported with trtexec command line tool. in row 1, this is a model that is all the parts of the whole network under FP16. All the workload is running on GPU, and the inference time is 38.923 milliseconds, and the throughput is 25.692 frames per second. This slide will consider our baseline result. In row 2, this is a pipeline application for the whole network. That means FP16 backbone on GPU and INT8 heads on each DLA as we designed in previous slides. In these settings, the inference time is 29.237 milliseconds, and the throughput is 24.203 frames per second. Compared with row 1, the improvement is significant. It's 33%. In rows 3, 4 and 5, they are the profile results for each part separately. Row 3 is for backbone only in FP16; and row 4 and 5 are each heads in INT8 on DLA. For backbone, it costs 28.136 milliseconds for each frame. It is slightly smaller than row 2. The first reason is that trtexec doesn't require preprocessing, and [indiscernible] compensation we introduced between DLA and GPU. The second reason is that although DLA is a separate hardware, DLA and GPU shares a memory bandwidth. So when DLA and GPU is working at the same time, the memory traffic has higher pressure and leads to a minor run time increase. For segmentation and depth estimation head, they have 4 feature maps in different skills. And the latency is 20.749 milliseconds and 17.555 milliseconds, respectively. All of them run faster than the backbone. So overall, full time run time is dominated by the backbone run time. In this slide, we are showing some profiling results. This result is captured with Nsight System, a GPU providing tool provided by NVIDIA. It can be used on multiple platforms and generate a report for every corner of the system. This report can be visualized with Nsight Systems graphic user interface on other platforms. So you don't really have to plug a monitor on Orin to see the result. You can also list the statistical report with results of GUI. Developers can use it to understand how much time each kernel or a CUDA API could take as well as hardware utilization information. Each current rectangle in the screen shot represents a specific event. It can be either a special CUDA kernel or CUDA event call. For example, in this screen shot, the blue blocks are CUDA and kernel costs, and convolution, [indiscernible] normalization and other stuff. The orange blocks are TensorRT markers. It is a good indicator to know how long each frame takes in total for specific TensorRT engine. The yellow tiny blocks under the TensorRT blocks are TensorRT layer block. And the cyan blocks are memory events such as memset or memcopy. The longer the rectangle is, the more time it consume. So with this tool, with this report, we can easily find out and locate the bottleneck of our system. Developers can also add custom range or events with [ RTX ] APIs. This means you can mark your range even on CPU. Nsight System is a useful tool for performance analysis. If you want to dive into a specific kernel, especially when you see 1 or 2 kernels are super small and contributes most of the run time, you may want to try Nsight Compute for detailed kernel-level information. Nsight Compute will help you collect more low-level information such as the number of [indiscernible] conflict, memory traffic between L1 and L2, shared memory and global memory. And also you can see instruction install status. It will also generate a roofline chart. With all this information, developers can decide potential optimization directions. Let's go back to our screen shot. This screen shot is from profiling the whole network on GPU only with FP16. Events and ranges are categorized on the left. In this picture, you can see DRAM traffic on the top, GPU utilization in the middle, and CUDA kernel costs at the bottom as well as TensorRT inference time. But there's no DLA. No row is for DLA, which means DLA is idle with this settings. Please remember, for profiling our Orin platform and seeing all the information we have here, you should remember to that Nsight System captured all sections by adding specific [ slash ]. You can do so by adding slash t, CUDA, cuBLAS, cuDLA, [ RTX ], [indiscernible] accelerators and [indiscernible] media. And also, if you want to see memory traffic profiling, you can enable it by setting flag [indiscernible] metrics to true on our Orin platform. So this slide is different from the previous one. It contains the profiling results of our pipeline application on Orin. Compared with the previous slide, we focus on the actual role in SoC metrics, which, as you know, we enable DLA and we can see DLA utilization in it. And GPU is running at the same time. Remember that all heads are running on DLA and the head depends on the GPU output feature. From the provider results, we can see that DLA is waiting for GPU. You can see this brand spaces. It is because DLA is waiting on CUDA event we specified. Only after GPU finishes work will DLA start on the following tasks. And also enroll DBB DRAM bandwidths. Compared with previous screen shot, there is some memory traffic in this room. This is the extra memory traffic toward DLA and aligned with our benchmark numbers. Since we offload some of the workload to DLA, we achieve around 33% overall throughput improvement without any adjustment, pruning or retraining the network. We have just talked about our training methods and experiments. To recap, we have developed a novel approach to multitask training that involves the combination of strategies to enhance the overall performance of the model. We utilize both online and offline label generation process to train the model, and we can use them individually or jointly, depending on the training budget. In addition, we use a pseudo label selection approach via a discriminator to improve the quality of the model and reduce the impact of noisy data. These approaches have improved models with better generalizability across tasks. In the experiments, we showed how to use [ depth ] data to improve segmentation prediction. Similarly, we can use segmentation data to help improve depth prediction. Our training strategy can also be easily applied to other tasks as well. Moving forward, we plan to further optimize the model design to make it hardware-friendly and more efficient. We aim to deploy it on the NVIDIA DRIVE platform by utilizing multiple devices to perform inference in order to achieve low latency and high throughput. To give an example, the model can be partitioned into encoder part and decoder parts so that the relative heavier encoder can run on GPU where we have more computational capacities. And the relatively lightweight decoders can run on the deep learning accelerators, DLA, to take advantage of their power-efficient characteristic. Those different devices can cooperate in a producer-consumer pattern, such that when the DLAs are processing the feature maps from the encoder, the GPU can be free and process the next wave instead of idling. In this way, we can make fully utilize of the DRIVE platform and further accelerate the inference time for the multitask models. For our future work, we will have more follow-ups such as open source samples and webinar. So please stay tuned.
Operator
operatorGreat. Thank you for that wonderful presentation. [Operator Instructions] Let's get started now, and we'll give our presenters a moment to look over the questions. Whenever you're ready, please feel free to begin.
Le An
executiveThanks, everyone, for attending our webinar. So here's the QA session. So we do receive some very interesting questions that I would like to answer them here. So there's one question like, "Can we alternate between GPU and DLA when running to other parts of the network?" The answer is yes, it can be done. But like there can be some overhead. So for example, the reformatting may be required when you do with Tensor fitting GPU and DLA. So the suggestion is to try to use company-supported data format on both DLA and GPU to avoid reformatting as much as possible. And there's another question about like, "So what networks kind of run DLA?" So basically, if you want to run the Ampere model of DLA, so the DLA itself already supports so many backbones such as Resnet, EfficientNet and MobileNet, and there's some like obviously [indiscernible] like [indiscernible] SSD and fully commercial networks for segmentations, et cetera. And some models around them may benefit from very high utilization, such as Resnet18 and Resnet34. And on the other hand, if you are using the TensorRT for inference, you can basically run any network as long as it is supported by TensorRT. So basically, the outside layers will fall back to GPU and those like the only eligible layers will run on DLA. So another question is, "Does the DLA support transformer models?" So as of now, the Tensor model is not fully supported by DLA due to some support operators. So featuring enhancement can perform the same improvements may come with future release of DLA software and TensorRT. So please stay tuned on that. So the other question is about like how to generate the calibration cache file. So basically, the calibration cache file can be generated with TensorRT's calibrator. So to do that, you will need a calibration set in order to drive the scales in the calibration file. So typically, a few hundred images would be a good starting point as a calibration set. Another question is about like, "How do we evaluate the accuracy from DLA inference?" So you can compare the inference results to those from the trading framework, such as PyTorch. So in the training framework, typically the forward path, it's done with full precision. So you may observe some accuracy job with lower precision inference in FP16 or into DLA. So if this happens, you can try to find out from where the gap starts, and then you may choose to use a higher precision for such a layer. So one more question is about, "Is the latency on DLA [ different on this state ]?" So generally speaking, the DLA influence has minimal variance from run to run. All the same hardware configuration and software platform. So however, because like DLA and GPU, we both will share the season DRAM. And we also network on a DLA, so there can be other workloads running on GPU or the other instances of DLA at the same time. So the inference may be bottlenecked by the run [indiscernible]. So those are all the questions I have. And now I'm passing it to my colleague, Yuchao, so he may want to answer some other questions.
Yuchao Jin
executiveOkay. Thank you, Le. Here are some interesting question on my side. First is like, "How much development efforts we need is going to integrate [indiscernible]?" Basically, I would say it's pretty much similar to just using [indiscernible] kernels. So it's fairly easy, I believe. The only thing I would like to point out is management. It might recur more attention to avoid reading some L&D buffer or mayben digital memory assets. But the other one is like how to better understanding about like kernel low level. So we mentioned Nsight System. Nsight System is mostly on the system side. So Nsight system will help us capture the API cost and kernel costs as well as their -- any information. So if someone wants to understand like how a kernel works on the low level, I would recommend Nsight compute. So it's really like a controls to help you understand how kernel working in low level. And the other question is like, "I think many people are interested in detection task." So yes, detection is definitely also working with DLA and GPU. So for example, with a 2-stage network, you may consider putting our pooling or our line on GPU for efficiency. So in that case, maybe we can go back on DLA. And on the other hand, maybe for [ West Stage ] network for like [centerpoint ]. Maybe we can consider to put in maybe the headquarter on DLA and backbone on GPU. It depends on the workload and your timing budget. Anyway, it can be very [ flexful ]. You can either use DLA for encoder or decoder. And if someone is specifically interested in YOLOv5. We do have a phone call cuDLA [indiscernible] on GitHub. It contains all internal steps for like on computations or C++ application and results like derivatization to bring YOLOv5 to Orin platform. I think that's all my -- all the questions on my side. I don't know if there's any other questions?
Le An
executiveNo, I don't have any further questions.
Yuchao Jin
executiveOkay, cool. And I think that will be the last question for today's webinar, and thank you. Thanks for your attendance. Thank you so much.
This call discussed
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.