NVIDIA Corporation (NVDA) Earnings Call Transcript & Summary

February 1, 2024

NASDAQ US Information Technology Semiconductors and Semiconductor Equipment special 38 min

Earnings Call Speaker Segments

Itay Ozery

executive
#1

Hi there. Welcome to this webinar on how to accelerate AI cloud computing. My name is Itay Ozery, and I run product marketing for networking at NVIDIA. Here are the main topics for today. Obviously, AI is garnering a lot of attention these days. I would like us to dive into the infrastructure side of things. We'll start off by covering the unique attributes of a generative AI cloud infrastructure. Then we'll talk about the NVIDIA accelerated computing and look at BlueField DPUs and their role in AI cloud data centers. In the bulk of the time, I wanted us to cover the main BlueField capabilities and use cases in this new class of AI cloud infrastructure. Just before we start, please feel free to use the chat throughout the session to post questions and view answers. With that, let's jump in. AI and specifically generative AI is where all the buzz is. Generative AI is this emerging field of artificial intelligence that enables the creation of new data and content. The release of ChatGPT in late 2022 helped uncover some of the potential that AI has for business and society. ChatGPT soon became the fastest-growing app in Internet history. This iPhone moment of AI, as we like to think of it, has created a sense of urgency for companies to reimagine their products and business models. To give you an idea of the potential that generative AI has, Goldman Sachs projects that generative AI tools could drive a 7% increase or almost $7 trillion increase in global GDP over the next decade. With so much opportunities on the horizon, companies are racing to integrate AI in their products and operations. This is all good, of course, but it doesn't come without challenges, as you know. Many companies face challenges when it comes to integrating AI, and with generative AI, even more so. I've listed here some of the challenges that hinder AI adoption for businesses. First and foremost is the need to manage and operate complex computing infrastructure at scale. It includes GPUs, networking, storage and requires deep full stack expertise in accelerated computing. The second challenge is security. AI data centers face similar, if not greater, security threats as general-purpose data centers. And as organizations adopt generative AI, they become exposed to this new range of security threats, which could lead to creating false results and even much more than that. The third challenge is the need to meet strict data management requirements. Training large language models requires extensive and diverse data as the input and a robust infrastructure so that the GPUs can use this data effectively. By training LLMs using traditional storage technologies that cannot provide the high throughput and low latency, slows down and can cause bottlenecks in the training process. So we looked at various challenges that hinder enterprises from adopting generative AI. Now let's look at how NVIDIA addresses those challenges. Generative AI is enabled through NVIDIA's accelerated computing. In fact, ChatGPT from OpenAI was developed on NVIDIA accelerated computing infrastructure. NVIDIA pioneered accelerated computing almost 20 years ago to solve problems beyond the capabilities of conventional computers. It's a full stack data center scale platform which enables every business to become an AI business. NVIDIA's accelerated computing is designed to extract the complexity from its users and is built on GPUs, CPUs, DPUs and accelerated networking. Now let's take a closer look in NVIDIA DPUs and accelerated network. NVIDIA BlueField-3 is an advanced infrastructure computing platform used to power generative AI cloud infrastructure. It has 400 gig of network bandwidth with RDMA and SDN acceleration capabilities and more. It provides programmable computing, including ARM cores for offloading applications from the CPU and 16 data path accelerator core that provides additional programmability and acceleration on the data path. We talked about the need to secure generative AI infrastructure. BlueField-3 features a range of zero trust security capabilities. It accelerates next-generation firewalls, micro-segmentation and more. It also enables composable storage capabilities like NVMe over fabric and NVMe TCP accelerations. It ushers in the era of AI cloud computing with leading cloud service providers that adopt BlueField-3 for a variety of use cases. BlueField is a foundational component in the NVIDIA accelerated computing stack that underpins generative AI. Accelerated computing operates at data center scale. Generative AI applications run simultaneously on hundreds and even thousands of GPU nodes. This makes network connectivity a critical piece of any AI cloud infrastructure, but not all networks are created equally. AI workloads have unique characteristics when it comes to networking. They use high-bandwidth RDMA flows, require very low latency and are very bursty in nature. On the other hand, cloud networks are built for general-purpose application and use TCP. Cloud networks are oversubscribed by design to achieve economies of scale. This is why AI workloads require a dedicated network as the AI compute fabric. So an optimal AI cloud infrastructure would minimally consist of 2 network fabrics, a dedicated network for the AI compute fabric, also known as east-west. This is represented here in green. This network is used to provide high-speed connectivity between GPUs that exchange data. The east-west network could be InfiniBand or it can also be Ethernet. Another network, which is a general-purpose cloud network, is known as north-south. This network will be used for things like cloud provisioning, storage connectivity, user access and more. The north-south network is almost always an Ethernet network. This optimized networking model ensures peak AI performance using the east-west network and cloud manageability and fast user access through the north-south network. Going back now to BlueField-3, we've recognized the unique attributes of AI cloud networking and created 2 product offerings: the BlueField-3 DPU and the BlueField-3 SuperNIC. Both DPU and SuperNIC have compute and network capabilities. But each product is optimized for a distinct set of workloads. On the left, we can see the BlueField-3 DPU, which is best at data center infrastructure processing. This includes things like software-defined networking, storage, encryption and more. On the right, we have the BlueField-3 SuperNIC, which is new and purpose-built for accelerating networks for AI compute workloads. The DPU is optimized for the north-south network in AI clouds. It has powerful computing so it can offload workloads from the CPU. The SuperNIC, however, is optimized for providing extremely fast network connectivity between GPUs, so it has powerful networking and a power-efficient lean design. NVIDIA offers a wide range of accelerated computing systems. Here, we have the HGX H100 and OVX L40S, both optimized for large-scale AI training and inference. These systems are designed to meet the most demanding performance requirement of generative AI applications. They are also optimized for ensuring cloud manageability and security. To that end, we have integrated BlueField-3 DPUs and SuperNICs in every system. The HGX H100 typically runs 8 GPUs and has one BlueField DPU with a total of 8 BlueField SuperNICs. That is one SuperNIC for every GPU in the box. OVX L40S has 2 L40 GPUs and uses one BlueField DPU with 2 SuperNICs. Each SuperNIC serves 2 GPUs in a 1:2 ratio. Now if we take a closer look, we can see an illustration of how the DPUs and SuperNICs are integrated in these flagship systems. The DPU provides north-south network connectivity, while the SuperNIC provides east-west connectivity. Let's double-click now on the role of the BlueField DPU in accelerating AI cloud computing. AI cloud data centers need to be performant, multi-tenant, secure and elastic. Integrating BlueField DPU into every AI server enables organizations to securely deploy and operate accelerated AI compute clouds. I've listed here the main use cases for BlueField in AI servers, and we'll dive now into each and every one. And it's -- it really starts with accelerated cloud networking. AI clouds require high-performance networking to meet the explosive user demand for generative AI. They also need to provide efficient data access during training operations. But we know the traditional software-defined networking solutions that run on CPU alone cannot deliver the network performance and scale requirements needed for modern AI clouds. In addition, many AI cloud data centers are designed as bare-metal clouds. This means that the cloud service provider is not able to deploy SDN software stack on the AI compute, which limits the functionality and performance of the infrastructure. We addressed that by using BlueField in every AI server. BlueField-3 DPUs provide up to 400-gig line speed, low latency connectivity with 0 CPU utilization. Multi-tenancy is a critical piece of any AI cloud environment. With BlueField, the SDN software stack runs on the DPU as opposed to running it on the host in traditional cloud environments. Out of the box, BlueField supports 2 paths to create secure multi-tenant AI cloud networks. The first one is SDN-based OVS/OVN acceleration. The second is VXLAN EVPN-based solution. While both paths provide VPC networking, each takes a different approach. SDN centralizes control onto an SDN controller, while EVPN-VXLAN distributes control with a BGP-based control plane with MAC learning. The integration of BlueField in every AI server enables the creation of virtual private cloud networks, also known as VPCs, with strong tenant isolation. The diagram in the middle shows how the stack looks like. What's special about BlueField is that it provides software-defined hardware accelerated networking. As you can see in the background, a traditional server runs the OVN/OVS stack on the host, while with BlueField, the software stack is completely offloaded and accelerated on the DPU and fully isolated from the host. Then looking on the diagram on the right, BlueField's VPC network acceleration can be enhanced with additional accelerators. Some notable examples include IPSec acceleration for encrypted network connectivity, RoCE acceleration for data storage connectivity and precision timing. So in summary, BlueField enables secure and robust VPC networking for AI clouds. It provides out-of-the-box support for multi-tenancy with either SDN or EVPN control planes. It is designed to deliver software-defined hardware acceleration with cloud network connectivity and full programmability and extensibility. Now let's shift gears a little and talk about how BlueField enables zero trust security in AI cloud environments. Earlier in this session, we talked about how organizations face challenges to secure their AI cloud environments. We know that modern AI cloud data centers are exposed to a wide range of cybersecurity attacks. I wanted us to look at some interesting data points on the cyber threat landscape. Insider threats are threats originating within an organization or a data center. Those threats are on the rise, with an average cost of $50 million per single incident. Another costly type of attack are software supply chain attacks that could exceed $46 billion this year and almost double by 2026, and the global average cost of a data breach increasing this year and also in the past 17 years. So these types of attacks puts risks to organizations as they deploy and operationalize their AI cloud infrastructure. With BlueField, organizations can create zero trust AI data centers which would help stop those attacks. Since I know that zero trust is an overly rotated term, let me explain what do we mean by that. As mentioned, BlueField has integrated compute capabilities. It also has its own memory interface and runs an operating system. BlueField DPUs can run all kinds of security controls, fully isolated from the whole CPUs. Think of a next-generation firewall or a micro segmentation application. You can actually run it on BlueField fully isolated from the CPU. What it does is it helps create isolation from where tenants' workloads are running to a dedicated compute platform. On top of that, BlueField can operate in zero trust mode. This mode restricts any access from the host to communicate with BlueField. This type of zero trust architecture enhances the security posture of the cloud infrastructure because it helps isolate data center and security control plane from tenant applications. A second use case for BlueField is how it distributes security to every node. Perimeter firewall solutions are no longer equipped to provide protection for modern data centers. Many AI cloud data centers deploy bare-metal cloud nodes to take full advantage of the infrastructure. In those types of environment, the cloud service provider is not able to install any software on the compute node because it is now in the hands of the tenant. The challenge with this cloud offering is that service providers cannot enforce security at the node level. Some CSPs are enforcing security policies on the top-of-rack switch, but that is very limited and is also operationally complex. With BlueField in every AI server, cloud service providers can now enforce a distributed fine-grained security policy. Instead of using things like access list in the switch to allow block connectivity, BlueField can be programmed to implement those actions in hardware and also at line speed, so not only that it's more secure and advanced, it's also more performant. The third use case is around data security for AI environments. By integrating BlueField with the cloud orchestration system, BlueField can detect unauthorized access to the data store. The upper diagram illustrates how BlueField enhances the data center security posture by blocking an unauthorized data request. On the left, User A successfully authenticates and gains access to the data store. On the right, BlueField is deployed in every host and is integrated with the data center scheduling systems. This makes it aware of the user workload placement. With this, BlueField can detect a request to authenticate as User B or node 2 as a fraud and blocks the connection. The lower diagram shows how BlueField stores secrets and uses those to allow the host to securely access to the file system, providing another level of protection. Summarizing this section, we've looked at 3 use cases for how BlueField DPUs create zero trust AI cloud data centers. BlueField enables a zero trust architecture by providing a dedicated compute platform to run the data center security control plane. It transformed bare-metal cloud security, enabling cloud builders to offer bare-metal service but also enforcing distributed fine-grained security policies. Lastly, BlueField can enhance data security by identifying fraudulent data access requests. Next, we're going to look at AI cloud storage. Large language models are trained on vast amounts of data to learn patterns, relationship and language understanding. The availability of extensive and diverse training data is a key factor in achieving the impressive language capabilities demonstrated by these models. Let's look at some of the complexities that organizations face with data handling in AI data centers. Software-defined storage technologies, which offer great flexibility and agility, do not perform very well for workloads such as AI training and high-performance computing. These modern workloads require massive retrieval and storage of data, far beyond what SDS could deliver running over CPUs. Many AI environments have addressed this challenge by installing high-speed storage in every server. By local storage such as NVMe does offer performance advantages over software-defined storage. It is limited by the hardware platform capacity and cannot effectively scale. This limitation becomes more pronounced as the organization aim to process larger data sets and perform computationally intensive tasks. In addition, managing and maintaining local storage on every node in large-scale AI environments is a complex undertaking. With storage hardware distributed across hundreds of nodes, organizations face operational challenges in ensuring consistent performance, data integrity and system stability. These are some key data-related complexities for organization as they attempt to run generative AI workloads at scale. BlueField offers several key functionalities that enhance data management for generative AI workloads. Powered by the BlueField SNAP technology, it offloads storage and networking tasks from the CPU, accelerating data movement and enabling intelligent storage functions. BlueField emulates network storage as if it was local storage by presenting an NVMe or virtio block device on the PCI interface, allowing the host to perceive it as a standard block device without being aware of its network-based nature. With BlueField SNAP, AI workloads running on the whole systems can efficiently fetch and store data at extremely fast rates. The BlueField SNAP technology combines advanced functionality from software-defined storage with performance surpassing that of local storage, offering the best of both worlds. Another interesting use case revolves around how BlueField SNAP enables efficient data operations. With BlueField SNAP, cloud operators can expand storage capacity for workloads within seconds, eliminating the need for time-consuming manual processes such as removing a node from the cluster and physically installing hardware. This enables similar scalability of storage instances and optimize cloud operations which offer great flexibility and performance. Another notable advantage is the ability to boot compute nodes directly from a remote storage volume, eliminating the reliance on local storage. This feature provides enhanced flexibility for swapping images across multiple nodes while ensuring data integrity and also consistency throughout the entire data center. With having BlueField DPU in every AI server, an AI cloud can seamlessly deploy a centralized storage infrastructure and serve it without compromising on application performance. Centralized storage offers numerous operational advantages. For instance, with data stored centrally, operational teams can easily manage and safeguard data, implement modern backup and recovery methods and monitor its usage over the network via the DPU. BlueField DPUs also provide a comprehensive suite of data security features, including encryption for both data in-flight and data at rest, coupled with solid encryption key management and data integrity checks. Rather than providing hosts with unfiltered access to the drive, they now access them through a restricted interface, enhancing overall security. This design safeguards [ further ] configuration and data. Encryption keys are managed by agent within secure DPUs, separating them from the host. NVIDIA BlueField, powered by the BlueField SNAP technology, revolutionizes data management for generative AI workloads. By offloading storage and networking tasks from the CPU, it accelerates data movement and introduces intelligent storage functions. BlueField emulates network storage as if it was local storage, providing the whole systems with a standard block device while hiding its network-based nature. With the ability to efficiently fetch and store data at incredibly fast rates, BlueField combines the advanced functionality of software-defined storage with performance surpassing that of local storage. Additionally, it offers seamless scalability, optimized data operations and centralized storage deployment, enabling data center operators to expand storage capacity and boot compute nodes directly from remote storage volumes. With robust data security capabilities, including encryption, key management and data integrity checks, BlueField ensures enhanced data protection and integrity for AI workloads while maintaining high levels of performance. So far, we looked at networking, security and storage as key workloads and use cases for integrating BlueField DPUs in AI cloud data centers. The fourth one is rather different. It centers around how modern cloud computing can take advantage of BlueField to make the GPU computing infrastructure scalable and elastic. Let's take a closer look. As organizations infuse AI into their business, they struggle with the operational aspects of managing the AI infrastructure. Accelerated computing is a full stack data center scale challenge. From specialized hardware to system software to platforms and AI frameworks, every layer of the stack needs to be optimized. There is not one size fits all here. It's very different than traditional general-purpose cloud computing, where enterprise applications can be easily adapted to serve different organizations. Another challenge faced by large organizations is the transient nature of AI workloads. In many cases, the path to monetization requires organization to continuously train and improve their AI models. What's more, large organizations have different departments and project teams, each developing their own AI models and apps. This often results in over- or under-provisioning of IT resources, which has tremendous economical impact, especially if you consider the cost of those specialized resources. These are 3 challenges that make organizations struggle when they attempt to operationalize generative AI in their business. This is where BlueField can help. BlueField accelerates time to market for generative AI, enabling organizations to operationalize AI much quicker. Let's understand this better. By deploying BlueField in server infrastructure, data center provisioning becomes significantly faster and more efficient. This means reduced bring-up times and smaller execution of workloads, enabling organizations to focus on their core objectives. Traditionally, server provisioning is network-based, using protocols such as iPXE. Deploying BlueField enables booting over block storage, which makes network installation much faster and also error-free. This saves many hours at data center scale. Network provisioning entails configuring all switches related to configuration, including MLAGs, subnets, gateway addresses and more. This work gets radically simplified with BlueField, specifically with BlueField HPN. With HPN, the configuration can be the same, using unnumbered IPs with BGP, making the top-of-rack switch configuration much simpler and also more streamlined. With rapid provisioning, BlueField helps operationalize AI clouds faster, which accelerates time to market with generative AI. Earlier, we discussed how generative AI workloads are transient. Organizations need flexible and robust infrastructure that can address changing needs rapidly. With BlueField, the AI data center becomes elastic, which means it can be repurposed and allocated to users very quickly. A large enterprise can have various AI initiatives. Users can be different teams or application owners, sometimes within the same organization. Each user may present different workload requirements for the infrastructure for the compute type, either bare-metal, virtual machines, Kubernetes containers, various requirements for the operating systems and scale in terms of the number of nodes, storage capacity and more. BlueField streamlines the reprovisioning process of the underlying infrastructure, including servers, network and storage resources. As an example, let's look at network provisioning. In traditional environments, every data center server needs special config at the top-of-rack level. Now imagine a new user on a workload appears. The IT team who is managing the infrastructure needs to redo the config, which is an extremely time-consuming effort at data center scale. With BlueField, no network config change is required. The new workload will launch seamlessly, saving weeks of work and accelerating time to market. To recap, modern AI data center need to become elastic and embrace rapid change. BlueField enables elastic computing by transforming traditional computing environment into software-defined, hardware-accelerated data center, which again accelerates time to market for generative AI. The third piece is scalability. Organizations need scalable infrastructure so they can scale their workloads over time. This is especially true in generative AI, which is a very new field that is getting a lot of interest and investment. With BlueField, organizations can effectively scale every aspect of the infrastructure from compute, storage and networking. For compute, BlueField is integrated across the NVIDIA AI systems and platforms. By adding or allocating more servers to an existing cluster or workload, organizations can add GPU compute capacity and take on the most demanding workloads. For storage powered by BlueField SNAP, adding storage instances to an existing workload is extremely easy. And they can be accomplished at one time. In terms of networking, BlueField provides robust, low-latency network connectivity at speeds up to 400 gigabits per second. BlueField is being integrated into the NVIDIA reference architecture, which is a turnkey AI data center with world-class computing software tool expertise and continuous innovation delivered seamlessly. Allow me to summarize here. BlueField accelerates time to market for generative AI by streamlining data center deployment and operation. It speeds up data center provisioning and creates elastic, scalable and high-performance computing infrastructure for AI. This concludes our webinar on how to accelerate AI cloud computing with BlueField DPUs. We discussed the unique attributes of an AI cloud infrastructure and how BlueField DPUs are foundational to the NVIDIA accelerated computing platform. In the bulk of the time, we looked at various use cases for BlueField DPUs in deploying and operating NVIDIA AI clouds. Accelerated networking, zero trust security, storage composability and elastic computing are critical in every modern AI cloud infrastructure. I would like to invite you to join on February 7 to the NVIDIA virtual workshop on how to power networks for AI clouds. This one will get more hands on with some of the AI cloud network technologies and solutions from NVIDIA. Note that our team is still here and answering questions throughout the chat. With that, I would like to thank you for joining this webinar. And have a wonderful rest of your day.

This call discussed

For developers and AI pipelines

Programmatic access to NVIDIA Corporation earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.