Intel Corporation (INTC) Earnings Call Transcript & Summary
April 19, 2023
Earnings Call Speaker Segments
David Shaw
executiveWelcome to the Tech Decoded Software Developer Technical Webinar Series. Thanks for tuning into the episode today, Deploying Optimized Cloud-Native AI Workloads on AWS. I'd like to introduce today's speaker, Eduardo Alvarez. Eduardo is a senior AI Solutions Engineer at Intel, specializing in applied deep learning and AI solution design. He will be joined today by Kelli Belcher, an AI Solutions Engineer and developer evangelist for the oneAPI AI Analytics Toolkit. My name is David Shaw, and I'll be your host today. A quick note about our new platform. You can access the biographies of our speakers at the top of the left-hand column. The following section contains downloadable resources, including a copy of today's presentation. Across the bottom of the screen are a series of icons, which allow you to access the slides, Q&A, survey and close captioning. You can also resize and move the separate windows to best fit your view. Hover over the video window to access close captioning. If you have any issues with the presentation, sound, slides or other, note it in the Q&A window, and our producer will help you. You can post questions at any time in the chat window. Our Q&A moderator will respond in real time and answer questions at the end of the workshop presentation live, time permitting. And with that, Eduardo, I'll hand it off to you.
Eduardo Alvarez
executiveAll right. Hello, everybody. Hope you guys can hear me okay. I'll wait till the slots go up. I think we're ready to go. And yes, let's go and get started. So thank you, everybody, for joining us today. I'm very excited to give this talk, and I appreciate all the work from the webinar team for setting up this great event. So I want to start off by setting the expectations for this talk. So what I -- the first thing I want to say is this is not a talk about how to optimize Kubernetes, but realistically about how you can partner distributed architectures based on Kubernetes with Intel AI software tools and Intel hardware to accelerate and get the most out of your application. What I want to highlight are things like what you could possibly be leaving on the table if you don't consider the right software on the right hardware and things like I want to be -- you'd be able to go with your friends to the bar or go back to your family at the dinner table and be able to talk about Kubernetes like you're an expert. But I mean, let's take a step back here and really talk about one of the main focuses, which is deploying within the context of cloud native. So these applications that we'll -- this application that we'll look at today, we have built it with cloud native in consideration on the AWS cloud. And we'll talk a little bit about the architecture that we use, the different services that we utilize. And then we'll have an isolated demo at the end where we'll actually test out the application and do a little bit of the auditing of the Kubernetes infrastructure so you get an idea of what it looks like to deploy one of these applications. I wanted to get -- I wanted to start this talk by talking about the AI tools that you -- that we considered within what we call Intel cloud optimization modules. So Intel cloud optimization modules are open source code bases with codified Intel software accelerations meant to run cloud natively in 1 of the 3 top CSPs, so Google Cloud, Azure and AWS. With the demo that you've seen today, we have versions of it coming out in Azure and GCP very soon. And this is the stack that you can expect. So from the bottom to the top, we're working with various generations of Xeon. Currently, the demo you'll see today is with third-generation Xeon, but we expect to have future Intel cloud optimization modules that leverage fourth-generation Xeon, previously known as Sapphire Rapids and the Ponte Vecchio GPUs. Part of that stack and the key part of that stack is the oneAPI lower-level libraries where -- that we leverage for accelerations in the abstracted higher-level packages like Intel's extension for PyTorch, Scikit-learn extension and various others that we expect to use. And then one key component here is that DevOps layer. Part of the effort of Intel cloud optimization modules, in this case, for Kubernetes within the context of AI is to make sure that we take it further down the pipeline closer to end-to-end and show you guys that you can use some of these tools like Kubernetes, Kubeflow, Docker, Terraform and API design and creation and management packages like FastAPI, which is what we'll be using -- which is what we've used in the implementation that you'll see today. You can see here that we have quite a few workloads expected for Intel cloud optimization modules. So the first couple that are out today are leveraging XGBoost within the context of a distributed Kubernetes application. A little bit of an asterisk there is that the training is not distributed, the infrastructure of the application is distributed across the compute available within the cluster. So we've designed 3 different endpoints that we'll look at today to handle an end-to-end pipeline like data -- from data processing all the way through training and inference. Another one that's available today is one that allows you to build -- implement oneAPI AI kit libraries inside of the inside of SageMaker. So both of them that are out right now are AWS-focused, but we do have a couple coming out very soon for Azure. And then we'll be looking at different workloads like distributed training for computer vision and natural language processing. So all of these will come out as open source code bases in GitHub that you can access, you can clone. And you'll be able to adapt for your own implementations or pick and choose what you -- what makes sense to you and be able to leverage and for your particular workloads or particular applications. So let's take a step back for a second. Now that we've looked at what Intel cloud optimizations are and what you can expect, one of the key kind of ideologies or things that we want to make sure that is communicated properly through these efforts that we put together through Intel cloud optimization modules and other things that our team does is this concept of becoming a compute aware AI/ML developer. So I'll talk about this from my personal experience. My -- I come from a start-up background in energy tech. And my experience was very -- probably very similar to a lot of experiences of AI/ML engineers or data scientists or data engineers or ML ops engineers who come from a different background and are not traditionally trained in software and don't come -- don't have a computer science degree, for example. It's -- you essentially adapted a skill set, but in some ways, you haven't built some of the foundations that some other professionals have. And one of those things is you haven't gone through potentially courses and looked at subject matter associated with compute optimization. And this is one of the things that I -- that we see is extremely valuable is trying to become more of a compute aware AI/ML developer that you leave less and less on the table every time. So some of the things I wish I knew about when I was developing applications as a software engineering manager and as a data scientist in a previous job was knowing that there are optimized libraries and software packages that you can use that run on specific hardware more efficiently than others and can give you more bang for your buck to increase your ROI. This is something that as a developer is going to be increasingly more important as we move away from -- in this new world of working with -- mostly working with foundational models and doing more fine-tuning than training from scratch and trying to actually work on the deployment and the productization of AI rather than the vanilla or the ground-up development and design of architectures. We're going to be looking at skill sets that are able to take these workflows, these workloads to the next level and make them more efficient, make them more cost efficient. So skills like this knowing how to pair the right sort of that hardware going to become increasingly important. Another thing I wish I knew was that there are accelerators or instruction set architectures inside of general compute like CPUs that allow you to accelerate AI workloads. We tend to think that that's only something that comes in GPUs or other accelerators, but things like AVX-512 or advanced matrix extensions in the new Sapphire Rapids fourth-gen Xeon are things that can really help bridge that gap and make a strong argument for leveraging the CPUs for training as opposed to always relying on the GPU and accelerators. And other things were -- that I wish I knew about were things like quantization, model distillation, accuracy aware model compression, things like this that can help models become more efficient post training so that you can accelerate at the point of inference and just get more bang for your buck in that sense. So like I said, my previous MO as a developer was very much pick hardware based on core count and memory and just run with it. And I would just start writing my code. And I think after coming -- I don't think, I know after coming to Intel, I just look at things so much differently. And I encourage everybody else to do that as well. Whether you're using Intel or using another hardware provider, it's important to consider validating your cores, memory -- look at your memory networking speeds, look at the optimized packages that you can leverage, partner that with the right hardware, look at the right lines of code that you need to include in your software. At times, it might surprise you, it might just be a couple lines of code. It might not have to be too much. And you can get a significant -- significant benefits and uplift from very little work on the development side. And at the end, make sure that you're always auditing your core utilization benchmarking and using -- and making sure you're using proper resource allocation when doing your software development and preparing your workloads. And before I move on to the rest of the talk, let's go ahead and bust this myth real quick, which is typically, we consider that the best -- or a lot of people think that the best hardware for machine learning workloads is the GPU. But the truth is that the best hardware for AI/ML workloads depends on the workload. Historically, CPUs have always been the king of inference because of their ability to be lower -- they have lower cost, more accessibility, you can scale them up because of the general availability that you could have in public or private clouds. And typically, the deep training -- sorry, the deep learning crown has gone to GPUs because of their ability to do very efficient matrix operations. But when you start to look at things like a dedicated instruction set architectures and just general instructions at the hardware level for CPUs that are starting to give you some of the tooling available and accelerators trying to bridge that gap, you can start trying to figure out what makes most sense to you from a performance cost basis, right, price performance basis. Okay. So let's talk a little bit about the suite here. So I'm sure if you've attended many of our -- any of our talks in the past, you'll be familiar with the suite. From the bottom to the top, the Intel AI suite covers quite a bit of the data model training and deployment kind of end-to-end scope of AI/ML tooling. If we look at how we're going to leverage that in today's implementation, we're going to be looking at an example of loan default risk prediction. So we're going to be -- you can imagine this is a scenario where you are sitting at home and you're applying for a loan, and you put in all your information. That information gets sent to the server. The server puts your information through this model, and then it comes out with the probability of how likely it is for you to default on this loan. You can imagine that depending on the popularity of your service, you might need to scale up or scale down the amount of resources you have backing this particular service. And that's what we'll talk about today within the context of Kubernetes and a distributed architecture is how to handle varying workloads and how that -- how, in this example, using XGBoost and converting it into Daal4py format to get up to 2 -- 2.25x speed up during inference can help you optimize that -- your compute and your resources to handle ever-growing demand for popular applications like these. One thing I'll say before we move on is you'll be able to find the code base for this particular application. And this is the -- that -- what you see on the right there in the image is the name space for the GitHub repo. We'll go ahead and make sure to include it in the chat so that you guys have access to it before we conclude the talk. And each one of these repos come -- recipes. So these little images on the left are PDF recipes where you can download them. And you'll be able to see within the context of the workload and the different software packages and the hardware, what we're trying to highlight in each one of these intel cloud optimization modules, kind of get you -- make it a little bit easier to come up to speed with what we're trying to do here. All right. So I promised that I'd give you something to take away from here, where you could sit down with some friends at a bar or you could go back at the dinner table and explain Kubernetes to your friends or at least be able to communicate with friends who are DevOps engineers or do this for a living. But we're going to talk about it at a level where I think it's quite satisfactory for AI/ML engineers like ourselves to understand this workload even if we're not the ones implementing it. Okay. So okay. So when we think about Kubernetes, we can think of the entire scope of -- or the architecture of a Kubernetes, the anatomy of Kubernetes application like this. We can break it down to the Kubernetes service, which is going to basically manage our entire -- is going to manage our clusters, and we think of our cluster as a set of workers on the right side there. And then on the bottom in yellow, we see pods. Pods are basically -- you can think of individual instances of our -- of an application. So we can deploy a Docker container into each one of these pods, and then these pods are deployed on to the workers. So then you have an application running in a distributed fashion on the nodes available inside of your cluster. Your workers in the cluster. On the left side, at bare bones, we're looking at the deployment.yaml files and the service.yaml files. Those are the ones that -- those are configuration files that we use to define various aspects of our application. We can use them to define conditions like limiting the CPU utilization on a per worker basis. We can limit the -- how much each request your server can utilize from your available compute resources. And using your service.yaml file, we can configure things like a load balancer, which is what we'll do today to be able to handle the various -- the varying load across our available compute. So like I mentioned a second ago that when you deploy your application or when you configure your deployment, typically what will happen is you'll take -- your pods will be deployed onto your worker nodes. And then you have in each -- for example, in this case, we have one pod running per worker, okay? And what that enables is that as -- we have increasing levels of requests coming into our server. We're able to handle those requests and distribute them across the workers that have the capacity to handle them. So one really nice thing about Kubernetes, and I'll use an analogy that I think might resonate with everybody is like, let's say, you're hosting a dinner party, okay? And initially, you invited 2 other people. It's you and your significant others. So a total of 4 people. You set out 4 plates. You have a single table that can hold 4 people and enough food for 4 people, okay? But what happens when your family members who live down the street go knock on your door and say, "Hey, we came over for dinner. Sorry for not letting you know. Do you have room for us?" Kubernetes is like if you had a cupboard or had a closet where you could pull out another table, another set of silverware, it automatically like pull like leftovers out of your fridge, heat them up and serve them up for those people. And then let's say that those people get done very quickly. The first couple gets done very quickly, and they decide to leave because they have to go pick up their kid from daycare or for wherever they are, and then you no longer need that table, you can scale down those resources, put away the table, put it back in the closet, put away the food, freeze it, leave it as leftovers, and now you're back to your original capacity. The ability to scale up and scale down is extremely important and one of the key attractive components of Kubernetes and being able to manage that -- those resources in an efficient way. Now when we talk about a load balancer, which is what we configure as our service here, is this -- the analogy that I would use is if you're at a restaurant, and let's say that you have 6 people sitting at a table. And let's say, your load balancer is the waiter. And the waiter's job is to make sure that everybody's water is topped off every single time, make sure that nobody is missing, has an empty glass, okay? So let's say, for example, one of your guests at your table drinks all their water. Your job is to come fill it back up and make sure that they're at capacity, and nobody is just kind of empty handed. And that -- and essentially, what would happen is in a case where you had to auto scale your infrastructure would be like if all the glass are full and now you need to -- but you need to get rid of -- you came with water anyways in your jug, and you need to fill it up somewhere. It's like filling up somebody's glasses like back up or something like that. It's not a perfect analogy, but I hope it gives you an idea of how a load balancer essentially tries to make sure that nobody has too much but that the capacity is filled and that we're leveraging our resources effectively. Okay. In the case where we have our resources, for example, go down, let's say, for example, that one of your workers failed for whatever reason. So you had a node that just completely crashed for whatever reason. Because of the conditions that you've established across your Kubernetes configurations, your service as well as your deployment will define how you need to address these kind of situations. Like for example, if you need to have at least 3 clusters running at all times and you need to -- and you have the load requirements for those 3 clusters and you don't need to scale them up or down, then initially what you would do is just spin up -- your cluster would automatically spin up an additional node for you to be able to then redeploy your -- a version of your applications. So another pod basically onto that worker and then be able to handle the additional load. This is a very high-level introduction to Kubernetes and how you would handle the load of an ever -- kind of like an application that has either varying loads or potentially has some kind of failure in nodes. So this is just a general explanation of that. When we look at those configuration files that I mentioned, if you look at the first one, which would be when we talk about Kubernetes on AWS, we're working with the Elastic Kubernetes Service or EKS. The simplest way to deploy your Kubernetes service is with a Kubernetes cluster configuration script. So this is a simple yaml script, where here at the top, we have where we specified the region we want to run our cluster on the version of EKS we want to run. And then at the bottom, with our managed node group, we specify the capacity and the instance type that we're going to use. So the instance type there, m6i.large is a third-generation Xeon. It's important to consider that if you're going to be trying to run -- if you're going to try to run this workload and you're going to try to use that conversion from XGBoost to Daal4py, you're going to have to run on Intel hardware. So a recommendation is m6i's are the third-generation Xeons. Once they become available, the R7iz's are going to be the fourth-generation Xeons on AWS. You recommend running on either of those and just making sure that your capacity in terms of CPU count and memory are appropriate for the workload you're going to run. For your deployment file, we have 3 different locations, 3 different components that I wanted to highlight. So the first one is if you look at the very top of the file where it says for the key kind, we're saying that this is a deployment, which is a type of Kubernetes resource. Then just under that in the replica section, we're specifying the number of pods that we want to spin up. Now we can then configure something called the horizontal pod auto scaler, which is going to allow us to spin up and create more pods if we need them to handle more load. And then -- and if we have the underlying capacity to spin up additional clusters, then we can redistribute those across -- not clusters, across more nodes. But in the beginning, in this implementation, we're just starting with 3 pods. And in the previous slide, we started with a capacity of 3. So we're going to have one pod per node. And that's exactly what we specify in this next section, which said -- the one that says the third blue box, the topology spread constraints. The configuration there basically says, I want to have an even distribution of my pods across my nodes. But if for whatever reason, I have less nodes than I have pods, for example, or more pods than I have nodes, then just distribute them anyway as best you can. So let's say, for example, I only had 2 nodes, but I had 3 pods that I need to get out. You might end up with 2 pods on 1 node and 1 pod on the other. And the section just below that is where we specify the application. So in the image section for the image key under the containers, that's where we'll specify the image that we're going to be running inside of one of these pods, okay. And then lastly, the service. When we look at the service in the first blue box, that is the type of Kubernetes resource that we're spinning up. The next section, we specify the ports. We want to specify port our load balancer is listening to and what port its passing information through to the pod that's running inside of our nodes. In the last section in the third blue box, we have the type of service. There are various different types of Kubernetes services, and I encourage you to try and implement different versions, try to see what works for you. I think the most important thing to consider is the level of security and access that you want to have and give the outside world when it comes to interacting with their applications. The load balancer tends to be the greatest level of access. So basically, anybody, in some ways, could send a request to an application that runs on the load balancer, for example, okay? It just depends on how you set up your security groups as well on AWS. Okay. So that's on Kubernetes. And I hope that gave you at least a taste of what Kubernetes is and kind of baseline everybody, if you try to implement this application on your own which, like I said, we'll share the link to the GitHub repos that you can clone it. Others, there's a really nice tutorial, both on the Readme inside of the repo as well as in medium, we have a tutorial that we published. And you can expect a video soon to show you how to walk through the entire application and implement it on your own. But I hope that gave you an idea and a good baseline for everybody so that when we go into the demo, everybody more or less understands what is going on. I don't want to leave you guys too in the dark there if you've never seen anything or know anything about Kubernetes until this stock. So let's look a little bit at the solution architecture in AWS. So this is essentially what the application looks like. So we're using -- our loan default application consists of 3 endpoints: one for data processing, the other one for training and the third for inference. That application we've built an image from it and pushed that image to the Elastic Container Registry on AWS. Then we take that -- we pull the image from the registry and when we deploy our application on the Elastic Kubernetes Service. The cluster is essentially a set of EC2 instances, which is AWS' equivalents of discrete compute like VMs on Azure or GCP. Then we use the S3 as our object storage. So this is where we're storing all of our data, where usually essentially is kind of like a data lake or a data warehouse. And then we're using a load balancer, so the outside world and the clients can interact with the rest of the -- with the entire application. So we'll have a demo on that at the end of the talk. So just a quick highlight of each one of these services. Like I mentioned, ECR is where we're storing our images. On the Elastic Kubernetes Service is what's helping us manage the Kubernetes infrastructure on the AWS cloud. The EC2 instances are independent nodes within our clusters. So in this case, we have 3 nodes that we've deployed inside of our cluster. We have the load balancer, which is what's kind of the interface between the outside world and our application. And then lastly, we have our S3 bucket, which is where we're storing all of our data, our model and essentially using it as our data lake, object store and our model repository. One thing you'll notice here is we don't have a database layer. That's beyond the scope of the Intel cloud optimization implementations, but you're welcome to leverage something like Amazon's RDS or Athena or anything like that to try to build a database layer to kind of underpin your application to make it more, I guess, production ready. Okay. So let's go ahead, and I'm going to stop the slides now. And I'm going to go ahead and take the screen, and we're going to look very quickly at our infrastructure that has been spun up. We're also going to run a couple of tests to interact with our application, okay? So I'll go ahead and now share my screen, and we're going to move on to the live demo portion of the talk. Okay. All right. So Vicky or anybody else on the call, if you guys can just let me know that you can see the demo screen and that we moved on from the slides. Let's see. Okay. Perfect. All right. Thank you very much. All right, So what I've done here is on the left side, I have the -- I've [ visisched ] into an EC2 instance in the cloud, and I'm going to be interacting with the various resources that we have deployed. So I'm going to be using 2 different tools, one is called EKSCTL, which is the command line interface for Elastic Kubernetes Service, and I'm going to be using KubeCTL, which is the command line interface for Kubernetes. So I'm going to start off by using this command, EKSCTL get cluster. If I look at what I have available to me, so you can see here that this is the cluster that we spun up for the webinar. So EKS cluster loan default webinar. And I'm going to go ahead and look at the resources available inside of this cluster. So I'm going to run this command, get node group [ EKS ] cluster. And I'm going to give it the name of this cluster here. And we can see what we spun up. And it's a little bit hard to see because it's all jumbled up. Let me stretch this out. So we can see here that I have a cluster that is currently active with the -- the parameters for this cluster are that there's a minimum of 2 nodes that are spun up at any given time and a maximum of 6 nodes that I'm willing to spin up at any given time. And the type of node is this M6I.xlarge, which if I might be mistaken, but it is the third-generation Xeon with I believe is 4 VCPUs and 16 gigabytes of RAM, okay? So that is -- those are the resources that we have spun up in this cluster. So now let's take a look at -- I'm going to expand this back up to the right because I think it will be easier for us to diagnose this and give you a better idea of what we're looking at if we have that bigger screen. So I'm going to use KubeCTL now to diagnose what's available inside of our Kubernetes applications. So what does our deployment look like? That's essentially what we're going to look at now. So I'm going to get all of my resources, Kubernetes application. We can see here that we have 3 different pods running like we configured in the previous application. So each one of these pods is running one per node that we saw -- that we just saw. Here's our Kubernetes service. You can see that it's a type load balancer, and this is the external IP for that load balancer. So actually, on your end, I don't know how quickly you are typing, but you could send a quick get request to this external IP, and you should be able to ping that server if you really wanted to. We can see here that our deployment has 3 pods, and all 3 pods are ready. And then this is our replica set. Our replica set is essentially what defines how many replicas of our applications are running at a given time. And it says that we have - we desired 3, and we have 3 and 3 are ready for use, okay? So now that we have -- I'm going to go ahead and rerun that because I'm going to need some of the information from that page. Now that we made sure that our Kubernetes application is well set up and that we have everything we need, we can go ahead and take a look at some of the individual resources inside of the Kubernetes deployment. So one thing we're going to do is run this command, KubeCTL describe, and we're going to look at the parameters of one of these pods, okay? You'll notice that I'm providing this -n parameter here. That's a name space. The name space is essentially a way for you to be able to manage in a more discrete way the different resources you deploy onto a cluster. So if you have multiple applications running on the same cluster, you can divide them by name spaces essentially. Okay. So this is a lot of information that just popped up, but I'm just going to highlight a few pieces of it. So we can see here that we have this particular pod running within this name space, running the service accounts. We can see the date that we started up the pod. We can also see the node that it's running on. So we can actually see the EC2 instance that this particular pod is running in. So this particular instance of our application running on a very specific EC2 instance at this moment. So matching what we had requested of our -- during deployment of our application. We can also see the image that we build. So we can see that we have -- we pulled the latest version of this ICOM pilot, which is the image that we have in our ECR repository. And we also see the image ID and container ID for that, okay? Another thing that we can see is the various tokens and roles that we've assigned to this node, so that has the right permissions to interact with other S3 resources. And then what I think is one of the most useful pieces in terms of diagnosing whether things are working or not working is taking a look at the event log at the bottom is -- so we can see that, yes, we were able to successfully assign the -- this particular version of the application to this EC2 instance. We were able to pull the image from the ECR and deploy that image onto that -- within that pod, within that instance, and we can -- we see that the container was properly created. Okay. That's an example of how you'd go about diagnosing or taking a look at what's taking place inside of your Kubernetes application. Another thing that we can look at and we can describe is our Kubernetes service. So we can do a KubeCTL describe, you can kind of start to see a pattern here using that describe functionality from KubeCTL to diagnose and take a look at your application. But now that we've run that there, we can see that we can get some information about our service. You can see here that, again, it's operating within the same name space, and it's a type load balancer. We can see the ingress and endpoints available for this particular load balancer, which essentially are the -- where we're feeding our information through to our application. So basically, this is where it's listening to the world and then this is where it's going to be sending all of your request to the underlying resources, okay? And again, we can see the events on here where it spun up the load balancer. Okay. So now let's try to put my money where my mouth is and try to go ahead and give you a quick demo of what this looks like. So I mentioned earlier that we could go ahead. And I'll do this by hand very quickly, and then I have them prewritten in a text document on [indiscernible]. So I'm not here just typing out a bunch of commands. But I do want to do one of these by hand, so you can see what it is that -- how you could try this on your end when you deploy it and when you're testing out your application at a development phase. Obviously, these are things that you would be doing. If your application was in production, that's a little bit different of a circumstance. But we're going to be using [ curl ] to make a request to this endpoint. So we're going to go ahead and ping our server to make sure it's still running. And this being a live demo, anything can happen, but let's go ahead and give it a shot. So we're going to take the external IP and we're going to -- this is where we're going to send our request. Forgot the HTTP at the beginning. And we're going to send this through port 8080, and the endpoint we're going to hit is called ping. So we're going to run that, and perfect. This is what we should see. We should see the reply message is that the server is running. So that is essentially a key component of making sure that we have the right things going. So now that I have that, let's go ahead and do some interesting things here and actually run this application. So the first thing we're going to do is we're going to process some data. So we're going to take some data, and we're going to perform some transformations to it using some Scikit-learn functionality and accelerate using the Scikit-learn extension. The main thing we're accelerating here using Scikit-learn extension is the train test split component of dividing up our data. You can see here that this curl command is a little bit more complicated. The payload that we're sending to our backend is this right here, where we're specifying our S3 bucket. We're specifying where our raw data is stored, which is called credit risk data set, and we're going to be processing 400,000 lines of data, okay? So this might take a second, but let's go ahead and send this request to the server and give it 1 second as it creates our data. Okay. Perfect. So our data was successfully processed and saved. And while I do, like I said, to put my money where my mouth is, I will go over to the AWS console if we have time to show you where that data was stored and in the S3 bucket so that you guys -- you guys don't know me. I want to make sure you guys believe me, right? So let me go ahead and put the next one. So the next one is where we're actually be training the model. And what's happening here is we are taking a model, we're training an XGBoost classifier. And then we're converting it to Daal4py format. That Daal4py conversion is just one line of code. I'll show that in a second once we're done running these commands so that you have an idea of where to find that in the code and where that's happening. And once it's [indiscernible] is converted to Daal4py format, we stick that model in S3. Again, we're using S3 as kind of like our general model registry or data lake, everything. So that's what's happening there. So we'll go ahead and run this. And what we'll get is also a response with some metrics such as the different accuracy scores and recall scores for our model. So let's give it a second while that model is trained and saved into the S3 bucket. Okay. Perfect. So we can see here the message from the server is model has been trained successfully. We see our validation scores, our precision, our recall, F1, et cetera, for the model. And yes, everything looks quite good in terms of our scores for this initial model training. Let's go ahead and send the server some data, okay? So the way we send the data is we send the data as individual samples as JSON -- a list of JSON dictionaries. So you'll see that in a second once I paste the command into the command line. Okay. I'm going to -- so this is a much larger command here, of course. And what I will say about this one is that typically how your server would process this is you would have maybe the -- the table that the client fills out, maybe that goes first into some kind of database and then you send -- you go from database to your predictions or your prediction service or you would go straight from the user and send this information to your backend directly to the prediction service in JSON format. Okay. So it's very fast at that point. So we sent 2 different samples for 2 different clients. And happy days for these clients because they're both at least -- actually, my apologies. So if it's -- no, correct, yes. So these people would be, I think -- okay. Yes, sorry. I wanted to make sure that I got the nomenclature properly, and I didn't confuse you guys. So when we have a true and a high probability that means that it's highly likely that the individual will default on their loan. So actually, it's not a good day for these people. It's really a sad day for these people because now they're not going to be approved for the loans. But for the sake of our implementation, that doesn't necessarily impact us too much. So we'll get over it. But yes. So that is essentially the entire workflow. So we've hit all 3 endpoints all the way from model -- well, 4 endpoints. We ping the server to make sure it's healthy. We processed our data. We've saved all of our data into S3, passed all the data and transformation files to our model training. We've trained our XGBoost classifier and converted it to Daal4py format. And now we've performed inference by sending a payload to our server. And now we have an actual output and tangible evidence that this works, right? And hopefully, this is quite clear in terms of what we're trying to accomplish here from an end-to-end point of view. Now we'll pivot back over to our Kubernetes stuff, and we're going to go ahead and take a look at a Kubernetes pod log. And what that's going to show us is what actually happened inside of our pods while we were sending all of those commands. So we can do a KubeCTL, get all again. So we're going to go ahead and -- a typo there, the typo. And we're going to go ahead and look at a log inside of one of these pods. So we're going to do KubeCTL log and we're going to give it this pod, and we're going to look at the log inside of this particular pod. We still need to provide a name space, though. So I need to do that, logs. Okay. So we can see here that this particular log here, let me just clear this and run it again so that's easier for you guys to see. So we can see here for this particular pod, we can see that our server was spun up. It's running on port 5000. And we can see here that this particular pod received 2 different get requests. So this pod actually wasn't the one that was doing our processing, but it was the one that was pinged when we did our -- when we checked the health of our server. So let's go ahead and we'll check the other pods to look for the one that was actually doing our processing, how about that? Let's go ahead and replace this particular pod with the second pod in our -- that we deployed, that one. Okay. Let's run this. Okay. Now we've got something to work with. So I'm just running that again and clearing out my terminals so that's easier to see. We can see here that this is the log of what's actually happening inside of my pod. So my pods actually, for example, here, it's received the request to process our data set. And so here, for example, okay, I've loaded the data set. I'm generating 400,000 rows of data. I'm creating this column transformer. I'm saving the column transformer, save data. Now here, I've received a request to train a model. Here I am training the model. Once I've trained the model and save the model to S3, here are the metrics associated with the model training. And then here is -- now I've received the training request, and we can see here starting Daal4py inference, okay? So starting the inference with Daal4py, completing it, exporting predictions, and here are the outputs of my predictions. So I hope that's enough evidence for you guys to believe me that it's not magic, and we are doing -- and it is working. And then some of this stuff can be a little bit black box, and you're working in the terminal. That's the only reason I say that kind of jokingly. So one of the other things we can do is actually run -- is run that log, but actually follow that service. So now we're going to -- by adding that [indiscernible] follow command at the end, we're actually going to be able to follow this log in real time. So what I'm going to do is I'm actually going to send this request a couple of times on the right side. And you can see that, that first one didn't go to this pod into this node, but the second one did. I'll see what happens when I run it again. So that one came to this one. I just ran another one, but it went to another pod. And again, I'm just kind of trying to showcase the fact that this is a distributed application. The load balancer is doing its job of distributing that to different nodes depending on where the workload is. A lot of the times, you might get repeat things in a single node. Sometimes you won't. It just depends on how you've configured the limitations of how much compute each request can access from the specific node. Okay. So now that we're done with this piece, I have one more thing I want to show you guys, and then we're going to be done for today, and we'll go to Q&A. So I actually want to show you guys in the code what the -- what this looks like. So when you go to your Kubernetes, when you go to your -- when you go and clone this repo, this is what you'll see. You'll see 3 different folders. The Kubernetes folder contains all of your Kubernetes files, okay? So you're going to have everything there. And one thing we didn't actually do in this demo is use the pod auto scaler, but we do have a pod auto scaler configuration in this repo. And in the read me for the repo, you'll find instructions on how to deploy that. Inside of the app folder, this is the actual application that's packaged up, and we have the Docker file that allows you to do that. So the Docker file is based on this Scikit-learn Intel [indiscernible] image that we pull. And we install the dependencies, and the Docker image is then when it's deployed, and we build a container for it automatically runs the server, okay? That server file is here. So that server file, like I mentioned, is a fastAPI server. And there's 4 endpoints. We have a ping endpoint; a data endpoint, which does the data processing; a train endpoint, which does the data training; and the prediction endpoint, which is our ML prediction service, which handles all of the inference. Okay. So the last thing I'll share is inside of the loan default folder within the app folder, this is where all the kind of the ML magic is happening and the data processing magic is happening. The model.pie module is actually -- it's just a model class, okay? And I'll highlight very quickly that one of the methods inside of this class, inside of the train method, this is where we're doing our training of our XGBoost classifier and then converting it to Daal4py format. So like I mentioned in the beginning of my talk, this comes back to that concept of being a compute aware AI/ML developer. This is just one line of code that I added here. That didn't take a lot of work. And had I not done that, I'd be missing out on approximately 2.25x performance boost during inference, which means that I can handle approximately 2x as much inference request, which means I might need half as much compute to handle the same amount of load to my infrastructure. So these are the kind of things that you need -- you can start thinking about if you're not already thinking about them today. And like I said, just one line of code to convert to Daal4py format in this case. When I go over to my prediction endpoint, there's also one more line of code that I add. So I use Daal4py classification prediction. This right here is a method in the Daal4py library. And again, one line of code essentially that I'm using to then leverage Daal4py during inference. So total 2 lines of code that you need to add to check that box of being more compute aware and matching the right software to the right hardware when you're trying to build out your applications. And like I said, within the context of a distributed application, which are designed to handle varying types of load, let's say, for example, that on one -- on a, I don't know, Tuesday, at 3 in the morning, maybe we're getting 100 requests per hour. But let's say, then we go and then it's like, I don't know, Thursday afternoon, and a lot of people are trying to apply for loans, and you've got, I don't know, 10,000 requests coming in. You could handle -- again, it just depends on the infrastructure and the rest of the things that you're running. Theoretically, you could handle almost twice as much requests with the same amount of inference request with the same amount of compute resources that you would otherwise. And that's money in the bank, right? That's directly tied to ROI how much money you're getting out of this application and what your return on your investment is. So again, just kind of closing the loop with that idea. Two lines of code, and that will allow you to -- within the context of distributed applications and high load applications that we typically build with Kubernetes. There's things that you might be leaving on the table if you're not -- you're not considering that connection between the right software to the right hardware. So that's it for my talk today. I think I pushed it really close to the end here. I know -- I'm not sure if we're going to have -- how much time we'll have for questions, but I'm happy to address any that we have in the chat. So Kelli and team, I'm ready for that when you are.
Kelli Belcher
executiveThanks, Eduardo. There was a question, if you could remind us what type of instance you used and why you selected that?
Eduardo Alvarez
executiveYes, of course. Let me stop sharing -- so the type of instance that I selected was an m6i Xlarge instance. The reason I selected that particular instance, like I mentioned earlier in the talk, is you need to be running on Xeon, on Intel hardware to be able to leverage these accelerations. So running on something else that -- the 2 lines of code that I just mentioned won't give you the acceleration and wont be compatible with other types of hardware.
Kelli Belcher
executiveAnd that also leads into our next question about deep learning boost and how that is helping?
Eduardo Alvarez
executiveSo I'm actually -- I'm not sure what a deep learning boost is referring to. So let's maybe consider that question as like, for example, if we're talking at the level of the accelerators that I mentioned are available, for example, in Sapphire Rapids like AMX. So advanced metric extensions in terms of boosting your deep learning performance, what AMX, the 2 kind of instance of AMX are its ability to handle more data in a more efficient way and also be able to perform matrix operations in a more efficient way as well. So again, moving a little bit closer to the kind of optimizations that we see in accelerators but within a CPU architecture.
Kelli Belcher
executiveOkay. Thanks. And where did you specify the number of pods and nodes that you were running?
Eduardo Alvarez
executiveYes. So I can go ahead and share my screen again just to show where that is because if you guys are trying to implement this on your own, I want to make sure that you have that on your end, and that's quite clear. So let me share my screen one more time and go back to that file. Okay. Kelli, just give me a thumbs up when you can see my screen.
Kelli Belcher
executiveYes, I can see it.
Eduardo Alvarez
executiveOkay. Awesome. All right. So the number of pods inside of the deployment file, this is where I specified it. So like this replica group that gets generated through this particular spec essentially tells my deployment that I need to have 3 versions of my application running. So I'm going to have basically like -- it's one replica group, so it's one version of the application running across 3 different pods, and that's what this specifies here. And the other thing I'd add there is if you do try to implement the pod auto scaler, there's this parameter here. I know I didn't go into this file, by the way. So like I said, it's in the readme for the repository. You'll be able to find this information there on how to implement this. So in this case, I've said the minimum amount of replicas I want running at any given time is one and the maximum is 6. So this allows my pod auto scaler to increase the amount of pods or decrease the amount of pods depending on the load.
Kelli Belcher
executiveOkay. Thank you, Eduardo. I think that was all the questions we had.
Eduardo Alvarez
executiveAwesome. Perfect. Thank you so much, Kelli. Appreciate it. So I guess we'll sign off for now. The -- well, thank you guys so much for joining us today. Kelli and I had a great time preparing this talk for you guys. And we sincerely hope that you'll give it -- that you'll try out the Intel cloud optimization modules. Like I said, keep an eye out for the cloud optimization modules that will be coming out in the near future for distributed training workloads. That's going to be super popular with the current type of -- with the current AI climate, where we're looking at the larger models but not necessarily training from scratch, but we're fine-tuning models. We're going to be looking at some of those workloads and those future optimization modules and teaching you how to do those within the context of the cloud and leveraging the right software with the right hardware. So before we sign off, I would encourage you guys to please complete the survey and give us some feedback. Tell us what you thought about the webinar. You can scan the QR code that's right there or give us any feedback. I don't know if Vicky or anybody else on the call would like to say anything before we sign off, but thank you guys for your time.
David Shaw
executiveThanks, Eduardo and Kelli, for today's webinar. A recording of the webinar will be e-mailed to you in the next few weeks. If you'd like to watch a replay, you can do so at any time using your attendee link. The Intel academic program for oneAPI collaborates with academia worldwide to enable the next generation of developers, scientists and engineers to advance accelerated computing. This program includes numerous programs and materials, including a certification program for educators. Students can also become oneAPI ambassadors, and developers can join the Intel software innovator program. Our oneAPI Center of Excellence focuses on accelerating the open standard space programming model by enabling widely used codes. Learn more about all the offerings by downloading the materials from the resource list on the left side of the screen. Also a quick reminder to complete and submit the short survey. It's your chance to help shape the series, tell us what you want to hear about and what we can do better. Thanks to all of our attendees for your questions and engagement. Please check out our event website, see the link in the resource box for a full list of webinars and workshops upcoming and past. Thanks for joining us today, and we'll see you next time.
This call discussed
For developers and AI pipelines
Programmatic access to Intel Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.