NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary
February 28, 2024
Earnings Call Speaker Segments
Unknown Executive
executiveHi, everyone. Thanks for joining us today for our webinar on Optimizing Multi-task Model Inference for Autionomous Vehicles. Before we begin, we wanted to cover a few housekeeping items. In the upper right corner, you should see a More Information button. If you click this, you'll see a few links expand, including a feedback survey. Please take a moment to provide us your feedback as it helps tailor future webinars. [Operator Instructions] A copy of today's slide deck and additional help materials are also available on the resource list. We encourage you to download any resources or bookmark any of the websites that you may find useful. Here are some tips to help make this event as best as it can be. To maximize the quality of the audio stream, please close any open applications aside from your browser window. Also, refreshing your browser can fix a lot of the many issues. Sometimes if your audio cuts out or it's not loud enough or you notice any lags, that's a good, simple, quick technique to fix it. Now without further ado, I'll turn the event over to our speakers to begin the presentation.
Le An
executiveHello, everyone. Welcome to our webinar on Optimizing Inference for Multi-task Models for NVIDIA Drive platform. My name is Le An from NVIDIA, and I'm working on different model development and influence optimization for automotive use cases. Today, my colleague, Yuchao and I, will talk about how to strategically deploying multi-task model on a media drive ordering platform by utilizing different deep learning computer resources for inference with efficiency. In the following, we will first briefly go over the basic concept of multi-task model and some really good work in autonomous driving. Before we dive into details about the deployment of multi-task model, we will give an overview of NVIDIA DRIVE Orin platform and how we can elaborate different computer resources such as GPU and deep learning accelerator, DLA, for better efficiency. This will be illustrated by sample network, and my colleague, Yuchao, will explain the end-to-end workflow on model training, model conization model compilation and implementation of the inference application in details. We will show that the proposed deployment strategies can greatly improve the latency and throughput for this kind of multitask network. So conventionally, a deep learning model is designed for a single task such as a retina for image classification. However, we are dealing with much more complex applications where many tasks need to be performed. Applying individual model for each thing task is no longer efficient or scalable in. Some latency critical applications, such as autonomous driving or embedded platform, limited compute resources will be available, therefore having many single task models may not even be feasible. On the other hand, multi-task models perform different tasks within a single network. Your typical multi-task model, a backbone is shared by different tasks. And different tasks are responsible for processing the shared features and producing task-specific output. The tasks in multi-task models are also very relevant. From a learning perspective, lot of the tasks can [ prove ] by mutually beneficial training signals, leading to overall better accuracy. Nowadays, monetizing learning is widely adopted in different domains such as computer vision, natural language processing and so on. In recent years, multi-task models have become popular in autonomous driving, especially in perception tasks. The perception tasks such as [indiscernible] detection and tracking, task estimation, [indiscernible] segmentation are highly correlated. And therefore, putting those tasks into a model task framework is beneficial for both training and deployment. Apart from perception, downstream tasks such as planning and control can also be integrated into a multi-task learning framework. One representative work is multi-task attention network for end-to-end autonomous driving. This model is composed of the shared enclosure backbone. One decoder for depth estimation, one decoder for semantic segmentation, one classifier for traffic lights as well as a driving module to predict steer, throttle and brake, the output is a control signal to drive the car. To improve the performance of multi-task model, our recent work proposed a so-called pretrain, adapt and fine-tune paradigm for general multi-task learning. You may adapt [ stage ], learnable multiscale adapters can dynamically adjust the pre-trained model weights with supervision from multi-task objectives. This approach is validated in our use case for optic detection, semantic segmentation and trainable area segmentation and is showing very competitive results. In another recent work, a planning-oriented end-to-end model called UNI AB is proposed. This model has 4 stages: backbone feature extraction; perception; prediction; and planning. In the perception part, the inputs to tracking and mapping modules are bird's-eye view features. The output of perception is directly consumed by the prediction, which later provides its output as input to the planning module to predict the future trajectory of the car. This is a very interesting work that combines all essential tasks into a single model task end-to-end model with state-of-the-art results of public data set. So given the popularity of the multi-task model in autonomous driving, in this talk, let's see how we can better deploy a multi-task network outside [indiscernible]. For demonstration, we construct a simple multitask model. Specifically, our model takes a single image as input which go through a shared backlog encoder. There are 2 tasks on top of the backbone. One task half is a depth decoder for depth estimation and the other is a segmentation decoder for semantic segmentation from the input image. So before we talk about how efficient [indiscernible] multi-task model of NVIDIA DRIVE Orin, let's first go over the hard work configurations of the NVIDIA DRIVE Orin. The NVIDIA DRIVE Orin SoC is an embedded supercomputing platform which can process data from camera, LIDAR and reader sensors in other support applications, such as autonomous driving, in-cabin functions, drive monitoring as well as other safety features. The DRIVE Orin SoC is based on Ampere architecture with the third-generation tensor cores. For deep learning, it consists of the GPU with 16 streaming multiprocessors and 2 instances of so-called deep learning. Both can be used for inference. In total, DRIVE Oris can deliver up to 254 intake TOPS. And it can be scaled to support from Level 2+ system all the way up to Level 5 fully autonomous vehicles. Now let's take a closer look at the deep learning accelerator on DRIVE Orin. The DRIVE Orin features up to 2 second-generation DLAs. So regarding the deep learning performance, the 2 DLA can deliver up to 87 TOPS. DLA is a fixed function accelerator and supports many layers such as convolution, deconvolution, fully connected, different types of activation, tooling, normalization and so on. DLA is complementary to GPU in the sense that so while the GPU delivers the most TOPS in high-power profiles, DLA is very good at power efficiency. DLA has support mixed precisions such as FP16 and int8. However, note that the DLA is mainly designed for the inference. So we encourage users to run networking in int8 precision wherever possible on DLA. For more information on DLA, you can refer to the TensorRT developer guide and there's a dedicated chapter on working with DLA where you can find more details about DLA layer support, data format support and so on. NVIDIA also open sourced a GitHub repository on DLA in which you can find more details, including how DLA works, performance benchmark numbers and some useful tools and scripts. Now going back to our sample multitask model. In this case, we have 2 task heads. So we are going to deploy the depth head on one DLA, and the segmentation half on the other DLA. For backbone, which is heavier and you need more compute than the heads, we deploy them on the GPU. For accuracy consideration in feature extraction, we rolled the backbone in FP16 precision. On DLA, we performed [ qualitized ] inference in int8 precision. In this setup, we free the GPU from doing inference of the heads. This can be beneficial in 2 ways. Either the saved GPU resource can be repurposed for other tasks or the inference can be pipelined such that with DLA is processing the inference for the current [ screen ], GPU can already work on the future extraction for the next frame. In practice, we split our is model into backbone and heads. This can be done with ONNX tools such as Graphsurgeon. Next, I will hand over to my colleague, Yuchao, and he will dive into details of the implementation.
Yuchao Jin
executiveHi, I'm Yuchao Jin from NVIDIA. I'll present you with more detailed information about our implementation. For the encoder as a common feature extractor for both tasks, which was mixed transformer from SegFormer. This is a backbone inspired by the popular Vision Transformer and optimized for semantic segmentation. Compared to Vision Transformer, this backbone has 4 stages and could produce multiple layer fissure maps like convolutional networks. And multi-scale features at different resolutions are critical for dense prediction tasks such as semantic segmentations and depth estimations for objects and scenes which vary in sizes. So mixed transformer employs an overlap hatch merging process, which is used to preserve local continuity around patches. With good designs on the strength and patterns of the overlap patches, it could produce features with the same size as an overlapping process. Within each stage of this backbone, there are efficient [ self-retention ] volumes. For automotive use cases, this is beneficial since the main competition bottleneck of the [ Manila ] transformer base encoders is a self-retention layer whose computational complexity is all in square with respective to input image size. While for autonomous driving, large image resolutions are frequently used. So efficient self-retention modules implemented the sequence reduction process, which is introduced in pyramid Vision Transformer to produce the total computational complexity by a constant factor, which is a hyperparameter set on Stage 1 to Stage 4 inside the encoder. Also note that in this backbone, we do not use positional encoding as typically seen in other vision transformer. This is also beneficial in the way that is -- the resolution of the positional encoding in vision transformer is fixed. When the resolution of a test image is different from the training set, the positional encoding needs to be interpolated and this often leads to dropped accuracy. Instead of using positional encoding, we use the same form as mix [ FFM ] module, and this is empirically found to be sufficiently enough to provide positional information for transformers. For the semantic segmentation decoder, we also adopted the line from SegFormer. This decoder only consists of multilayer perception layers of sampling layers and activation layers. In our implementation, we further reduce the complexity by replacing the contact operator with head. This makes our segmentation head very lightweight and easy to deploy on drive platforms. For the depth estimation head, we use a progressive decoder. This decoder receives outputs from all transformer blocks in the encoder and combines them in a progressive manner with a sequence of up sampling and convolution and the size of the final output devs map from the decoder is equal to the input size. For training, we adopt a semi-supervised learning strategy to make better use of the raw date without [ grand stables ] from different data sets. We apply 2 strategies to create pseudo labels for this data to do semi-supervised learning. The first strategy is the so-called online process, in which we generate pseudo-labels during the training of the multi-task model. This process is light weight and easy to integrate into the existing workflows. The second strategy is an offline process where we employ a larger teacher model to create high-quality pseudo labels. To better leverage the generating pseudo label, we use a discriminator to selectively choose reliable ground truth for training process. For more details on multi-task model trading, you can refer to our previous GTC talk on this topic. As we may know, lower precision will lead to less memory footprint and higher mass operations per second. Our Orin platform [ determines ] operations per second for int8 is twice compared with FC16. It is because we can utilize faster and cheaper in a tensor core. This is especially useful for compute intense operators such as convolution or MLP. Lower memory usage will lead to less memory bandwidth requirement and can end up with a faster inference speed. Also, embedded platform rarely has limited memory. So data type like int8 can also help us save the precious space. As mentioned in previous slides, if we want to better utilize the DLA hardware, we should contact both heads into int8. Here, context means casting a 14-point number into an 8-bit imager range from minus 128 to 127. To achieve this, we need to normalize and rescale the floating point with a specific scale. In this work, we follow the post-training quantization procedure. In general, the workflow will be collecting a set of inputs as calibration data, feeding them into the pretrain network, collecting statistical information for each layer and coming up with a scale number for each of them just like the chart at the bottom of this slide. We can use APIs from TensorRT to obtain the calibrated cache files. TensorRT already provide some choices like min/max or entity-based calibration. You may choose either of them and take one of them with the least accuracy drop. And the scale for activations and weight will be saved and used in the following inference stage. So we don't really have to recalibrate every time. Alternatively, context of where we're training can also be applied. It is a training technique to boost [ quantitized ] networks performance. Basically, it is to let the training procedure aware of the error introduced by condensation. And early, this additional step will make the network more robust to the quantitation and leads to less accuracy drop when we quantize. This is a whole area which is so large to be covered in today's webinar. In short, if your model suffers from performance drop after training with different colligation measures, you may consider using curative techniques, and it might solve your problem. If you refer to the second link in the footnote for more detailed explanation on this topic. To fully utilize the power of the target board, we should build the engine on it. In this case, TensorRT will be able to profile client in the run time and choose the best tactic with the noise run time. Only with actual [indiscernible], we will be able to collect the most precise runtime results. NVIDIA already provides the tool that exact for building and evaluating model on Orin platform. To build a GPU engine from an ONNX model file, you can run trtexec with this flex. As you may notice, we specified the outperformance to FP16 CHW32. Here, FP16 means we will output the tensor with data type FP16. This is because the backbone is already running FP16, and we can directly use our full feature in the same precision. This will not only save some memory traffic, but can also eliminate the data conversion at the end of the inference. Otherwise, TensorRT will insert additional data type conversion from FP16 to the default precision, which is actually FP32. And the auto layout will be CHW32 or 32 wide channel-vectorized low major format. This is a special design layout, considering memory access coherence. This memory layer is to avoid memory reorganization since we will pass the auto feature to DLA with the same layout. To build a DLA loadable, you can attach used DLA core and build DLA stand-alone flex to become online. A loadable is different from an engine we mentioned above. A loadable is designed to run outside of TensorRT. That's why we call it a standalone. We can use CUDA DLA to load and do the inference. For more details, you can refer to the link in the footnote. As we mentioned before, DLA is mainly designed and focused on inference. So we changed the precision flag from FP16 to int8. Also, we specified the input and outperform for a better performance. Different input and auto memory layout will lead to different tactic choices, so choosing the layout properly to minimize the latency. Within trtexec, it will first load and pass our provided ONNX model. Then the first model will be sent to DLA compiler. The compiler will finally produce a notable and stored in a binary file. Please be aware that TensorRT cannot load and process a DLA load profile directly. After we build a DLA loadable with trtexec, we would like to load it and do the influence in our application. We can do so through CUDA DLA. CUDA DLA is an extension of CUDA. It will help us to use DLA just like how we interact with GPU. CUDA DLA wrapped low-level operations with NV Media and DLA driver, and it will also natively support CUDA stream and CUDA event. We can initialize and submit a DLA task to CUDA stream, just like what we can do with CUDA kernels and/or TensorRT engines, even though GPU and DLA are totally different hardware. CUDA DLA also provide similar semantics for a error handling. You can use CUDA DLA last arrow just as we use in CUDA last arrow. With all the advantages above, CUDA DLA is almost the only choice for rapid prototyping. For more information, we attach several useful links in the footnote. In precise, we break the whole network into 3 parts: backbone; segmentation head; and depth estimation head. We assigned the backbone to GPU and 2 heads to DLA individually. Since the tool has relied on backbone's auto feature, we must schedule the DLA task after GPU finishes work. Otherwise, DLA can't get correct feature map. So here is a high-level design about how we schedule the pipeline. When Frame T arrives, GPU will consume the T frame. While DLA cores are still working on the future map from previously frame T minus 1. And after GPU finished Frame T, that will trigger a signal for DLAs, indicating that GPU finish its work. And then GPU can start working on the next frame, T+1. This formulates a 2-stage pipeline to pursue better hardware utilization and better throughput, and the overall throughput will be affected by the longest section, in our case the backbone. For different backbone and different tasks, the split strategy might be different. Developers should consider multiple factors. As the overall run time will be affected by the slowest part, we should consider run time for each part individually. Also, DLA is in time for power-efficient influence, which means in some specific cases, some custom operators, it will be easier to do so on GPU rather than DLA. For some tasks, we may want to keep the network ready with FP16 for better results, which means this network may not be suitable from DLA also. Memory copy and reformat between some network can also be a problem. This overhead might be large if the feater map is huge. With all these factors as guidelines, we can decide the task assignment. A good separation plan should introduce minimum memory and reformate overhead while maintaining good node balance between tasks. So all the hardware will not idle and wasting time. Allocation memory for DLA is just the same as TensorRT. All you have to do is to claim a chunk of GPU memory with CUDA and registered to DLA with CUDA DLA registered. In this way, DLA will recognize and consume the data from the given address. In the slide, you can see 2 pointers. The GPU pointer is to start to address of the GPU memory you just allocated. And the DLA pointer is actually the address for DLA use internally, as GPU and DLA has separate memory mapping table in hardware. From the user view, you can consider the GPU point as the input buffer. You only need to interact with GPU buffer and CUDA DLA will help you with the underlying data mapping and transfer from memory to DLA internal buffer. For TensorRT, we can either allocate GPU memory with CUDA manually or reuse the existing memory. Then users can assign the pointer as a buffer for each input or output tensor with set tensor address API. Buffer allocation and management can be important. A careful design like zero-copy can save both memory and time. In our case, both the 2 heads running on DLA will use the same input feature map. So we assigned a feature map buffer pointer to both DLAs. And the 2 DLAs will share the same future map from backbone. Initially, we save the memory for duplication, and we can convert the feature map to initiate only once. To submit a task with TensorRT, assuming we already initiated to TensorRT context, we can submit a task to a specific stream within enqueueV3, just one line, and everything will be handled by TensorRT. Well, for CUDA DLA, we can set up the CUDA DLA task structure and use CUDA DLA to submit task to launch it just like TensorRT. In this structure, module handle points to the notable DLA engine, we just initialized and load it. Input tensor and alpha tensors pointers we acquired from CUDA DLA register. Weight events and signal events are for stand-alone mode with DLA driver. Since we use CUDA DLA to attract, we can directly ignore these 2 fields. In our application, these parameters are actually the same during the whole time. Do remember to provide customer stream to these function calls. Otherwise, they will be submitted into the default stream in this rep and may cause unexpected behaviors. For example, some signal's API costs may block other kernel from ready, and may break the orders we want. Recap with our design. Backbone is running in FP16 and heads are running in int8 since DLA is running stand-alone and the input data type is already set up as int8. So we have to manually quantize open feature before DLA can directly consume it. Before the 2 DLA tasks submission, we converted the feature map from posing point value into int8 manually with a given kernel. This kernel is naive, implemented and act as a simple sample to show how to do content by ourselves. The first step is to risk scale to improve pinpoint value from arbitrary range to a limited range. Then we keep it between intake range and around floating point to Integer. This is an element wise operation. So the indexing in this kernel is fairly simple. As we can see, for each kernel, we only write back on single 8-bit number. So it is possible to optimize it with [ betterization ] say [ H-stripe ] handle for numbers and the write-back can be finished with 1 single 32-bit transaction. Community and set order between GPU and DLAs, we can use CUDA stream and CUDA event. In CUDA. API calls can be either signaled or [ agencies ]. For example, CUDA mem copy signals API. When this API returns, the memory copy action is submitted and finished. We offer CUDA mem copy ASIC, which is ASIC counterpart of CUDA mem copy. If we call this API, it will return immediately. At this point, the copy task is only submitted to the 3. For the actual copy action may not finish but not even started. It depends on the status of the site. So now we step back a little bit. What is the CUDA stream? CUDA stream is a special structure to describe a tube of device job. Host push jobs into the SKU and return immediately for next actions. While at the same time, GPU will try to acquire and schedule work from streets when the hardware is free. Operations in the same street will be launched in first in, first out order. This order is guaranteed by CUDA driver. But remember, by design, we will run GPU and DLA busy at the same time. If we want to populate our test, we must put them in different streams. Now the problem becomes how can we maintain order in different steams. To achieve this, we can use CUDA event. CUDA event can be considered as a signal. For CUDA event, you can either put it into a stream which CUDA even record or wait down specific event with CUDA stream wait event. In our specific case, assuming we have a GPU stream and a DLA stream, from the CPU perspective, the action will be included the backbone to GPU stream, recording event in GPU stream, that DLA stream weigh down event and submit DLA task for DLA stream in the end. Granted, GPU will start the inference right after we include it and working on the inference. After a while, when GPU finished the task, it will attempt to execute less item industry, which is a CUDA even record. At this point, the event will be marked as occurred from the current status. And on DLA perspective, what we call CUDA stream weight event, and we'll check if the CUDA event is in the current status. If not, DLA will just wait. And after a while, we finally finished the job and change the status of this event, DLA will finally be able to move on to the next item in the stream, which is CUDA DLA submit task. So in this way, we ensure that DLA task will always come after GPU finish its job. Now let's take a look at some inference results. Here are 4 of serializations from segmentation and depth estimation results for both heads with our inference application. As we mentioned before, the backbone influence will study FP16 precision and the influence for both heads will stand in int8 precision. Those images were randomly chosen from Citiscape citizen. The left image is the input, segmentation is in the middle, and depth estimation is on the right. From a quality point of view, you can see the segmentation results can match what we see in the input images. For example, please see the image in the top right. Pedestrians and motorcycles are categorized correctly with quantitized inference on [ orange ]. From our internal study, we also noticed that there is negligeable accuracy results from segmentation task with int8 precision. Generally speaking, classification task are less prone in quantitization error and thus, there are great candidates for deployment in int8 precision. Interactively speaking, this is because classification task focus more on relationship between classes. As long as our target task has the largest confidence score, the result can be considered as correct even though the actual numbers from the output are totally different compared with FP32. We offer depth estimation, the inference with int8 precision skew shows good results. And as you can see, objects from near to far are outlined decently. However, there can be context error in output represented by some blocking artifacts in some fine details. One way to improve is to apply mixed precision. Specifically, we can choose some of the layers to run in FP16 precision on DLA. This may impact the latency but may help with accuracy greatly. Next, precision for inference is a case by case choice, and it is automotive scope for this webinar. We plan to further investigate in the future. Here are the benchmark results acquired on Orin-X platform. With Drive OS: 6.0.9.0, CUDA 11.4 and TensorRT 8.6.12.4. Please be aware that except for numbers enrolled true, all the other numbers are measured and reported with trtexec command line tool. Either one, this is a model that is all the parts of the whole network under FP16. All the workload is running on GPU and the inference time is 38.93 milliseconds and the throughput is 25.692 frames per second. This slide will consider our baseline result. In row 2, this is a pipeline application for the whole network. That means FP16 backbone on GPU and int8 heads on each DLA as we designed in previous slides. In these settings, the inference time is 29.237 milliseconds and the throughput is 34.303 frames per second. Compared with row 1, the improvement is significant, it's 33%. In row 3, 4 and 5, they are the profile results for each part separately. Row 3 is for backbone only in FP16 and row 4 and 5 are each heads in int8 on DLA. For backbone, it costs 28.136 milliseconds for each frame. It is slightly smaller than row 2. The first reason is that trtexec does require preprocessing and expensive compensation we introduced between DLA and GPU. The second reason is that although DLA is a separate hardware, DLA and GPU shares memory bandwidth. So when DLA and GPU is working at the same time, the memory traffic has higher pressure and leads to a minor runtime increasement. For segmentation and depth estimation had, they have 4 feature maps in different skills. And the latency is 20.749 milliseconds and 17.555 millisecond, respectively. All of them run faster than the backbone. So overall, full run time is dominated by the backbone run time. In this slide, we are showing the profiling results. This result is captured with Nsight System, a GPU providing tool provided by NVIDIA. It can be used on multiple platforms and generate a report for every corner of the system. This report can be visualized with Nsight System's graphic user interface on other platforms. So you don't really have to plug a monitor on Orin to see the result. You can also list the statistical report with [ online ] results of GUI. Developers can use it to understand how much time each kernel or CUDA API take as well as hardware utilization information. Each current rectangle in the screen shot represents a specific event. It can be either a special CUDA kernel or CUDA event call. For example, in this screen shot, the blue blocks are CUDA and kernel costs, and convolution, bandwidth normalization and other stuff. The orange blocks are TensorRT markers. It is a good indicator to know how long each frame takes in total for specific TensorRT engine. The yellow tiny blocks under the TensorRT blocks are TensorRT layer block and the cyan blocks are memory events such as memset or memcopy. The longer the rectangle is, the more time it consume. So with this tool, with this report, we can easily find out and locate the bottleneck of our system. Developers can also add custom range or events with [ MVX ] APIs. This means you can mark your range even on CPU. Nsight System is a useful tool for performance analysis. If you want to dive into a specific kernel, especially when you see 1 or 2 kernels are super small and contributes most of the run time. You may want to try Nsight Compute for detailed kernel level information. Nsight Compute will help you collect more low-level information such as the number of bank conflict, memory traffic between L1, L2, shared memory and global memory. And also you can see instruction stall status. It will also generate a roofline chart. With all this information, developers can decide potential optimization directions. Let's go back to our screen shot. This quest is from profiling the whole network on GPU-only with FP16. Events and ranges are categorized on the left. In this picture, you can see DRAM traffic on the top, GPU utilization in the middle and CUDA kernel costs at the bottom as well as TensorRT inference time. But there's no DLA. No row is for DLA, which means DLA is idle with this settings. Please remember for providing our Orin platform and seeing all the information we have here, you should remember to that Nsight System captured all sections by adding specific flex. You can do so by adding [ Slash-T ], CUDA, [ Kumba ], CUDA DLA, ETX, Tegra Accelerators and early media. And also, if you want to see memory traffic profiling, it can enable it by setting some metrics to true on our Orin platform. So this slide is different from the previous one and contains the profiling results of our pipeline application on Orin. Compared with the previous slide, we focus on the actual role in SoC metrics, which as you know we enable DLA and we can see DLA utilization in it. And GPU is running at the same time. Remember that all heads are running on DLA and head depends on the GPU output feature. From the provider results, we can see that DLA is waiting for GPU. You can see this brand spaces. It is because DLA is waiting on CUDA event we specify. Only after GPU finishes work will DLA start on the following tasks. And also enroll DBP DRAM bandwidth. Compared with previous screen shot, there is some memory traffic in this room. This is the extra memory traffic toward DLA and aligned with our benchmark numbers. Since we offload some of the workload to DLA, we achieved around 33% overall throughput improvement without any adjustment pruning or retaining the network. We have just talked about our training methods and experiments. To recap, we have developed a novel approach to multitask training that involves the combination of strategies to enhance the overall performance of the model. We utilize both online and offline label generation process to train the model, and we can use them individually or jointly, depending on the training budget. In addition, we use a pseudo label selection approach via a discriminator to improve the quality of the model and reduce the impact of noisy data. These sources have improved models with better generalizability across tasks. In the experiments, we showed how to use depth data to improve segmentation prediction. Similarly, we can use segmentation data to help improve depth prediction. Our training strategy can also be easily applied to other tasks as well. Moving forward, we plan to further optimize the model design to make it hardware friendly and more efficient. We aim to deploy it on the NVIDIA DRIVE platform by utilizing multiple devices to perform inference in order to achieve low latency and high throughput. To give an example, the model can be partitioned into encoder part and decoder parts so that the relative heavier encoder can run on GPU, where we have more computational capacities. And the relatively lightweight decoders can run on the deep learning accelerators DLA to take advantage of their power-efficient characteristic. Those different devices can cooperate in a producer consumer pattern, such that when the DLAs are processing the features maps from the encoder, the GPU can be free and process the next wave instead of idling. In this way, we can make fully utilized of the DRIVE platform and further accelerate the inference time for the multi-task models. For our future work, we will have more follow-ups such as open source samples and webinar. So please stay tuned.
Unknown Executive
executive[Operator Instructions]
Le An
executiveHi, everyone. Thanks for attending our webinar. So there are a couple of questions that I found out every Interesting, so I would like to answer them verbally. So there's one question about like how does the inference be on DLA compare to that of the GPU. Okay. So generally speaking, the latency for network on one DLA will be higher when compared to GPU because DLA has lower theological mass throughputs in terms of TOPS compared to that of the GPU. So however, as we showed, so you can reduce your total overall latency by distributing the deep learning workload and nondeep learning workload across DLA and GPUs. So DLA is especially suitable for applications where there are requirements on consistency of workload latency and throughput. Yes. So there's another question about like what networks are recommended to run DLA. So basically, so if you are using TensorRT for inference, you can run any network as long as the next part of is TensorRT. You can choose to allocate to the layers on a [indiscernible]. So if you want to run the inter model of DLA. So DLA currently supports many backbones, such as [ res ] net, efficient net, mobile net or application services networks such as ULU and SSD for object detection and [ SDA ] for sematic segmentation. So there are some models which can be very efficient on DLA. You saw, for example, like a Resn18, Resn34 and they use 50, so they will have a very high utilization. And for the mobile network capacity, mobile net, and so they will have a very low latency of DLA also. So you can choose to deploy whatever networks that can be run on DLA to [indiscernible]. Okay. So now I will pass to my colleague, Yuchao, and he will also answer some questions from his end.
Yuchao Jin
executiveSo I've also seen some interesting questions. So like one of it is if the current MTM procedure can run detection class. So the answer is definitely, yes. For example, with our true stage network so some of the issues in ROI pretty outline. So I would like to consider putting it on GPU first [indiscernible]. So in that case, we may want to execute the backbone on DLA and head on GPU, and we are all set. So is it possible. And maybe for one stage network, that a centerpoint. So we may also consider like alternatively stop it so the backbone can be on DLA and head on GPU. So it depends on the workload and the budget of the competition costs. This placement can be very flexible. And another question is like maybe how much you -- efforts we may need. So it will be fairly low equity, but it's just like training kernel with CUDA stream and CUDA events. The only thing I think it might be cautious is memory management. So we should pay attention to it to avoid reading from ready buffer or [indiscernible] Let's see if there's more questions.
Unknown Executive
executiveAll right, everyone. Thank you for joining our webinar. We appreciate you joining. We have a couple of questions still in the queue so we'll try to answer some of those via chat as well. You will get an e-mail with a replay -- link to a replay of this and you will also be able to download the slides through that e-mail. We hope that you'll join us at GTC this upcoming March. You can learn more about it at nvidia.com\gtc, and we hope you can join us for future webinars. Thank you.
This call discussed
For developers and AI pipelines
Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.