Intel Corporation (INTC) Earnings Call Transcript & Summary
February 15, 2023
Earnings Call Speaker Segments
Austin Webb
executiveAnd welcome to Tech.Decoded entails technical Webinar Series 4 software developers. Thanks fortuning into today's episode. Optimize end-to-end transformer performance on the latest Intel Xeon processors. I'd like to introduce today's speakers. Julien Simon is the Chief Evangelist at Hugging Face. He recently spent 6 years of Amazon Web Services, where he was the Global Technical Evangelist for AI and machine learning. Julien will be joined today by Ke Ding, Principal Engineer and Engineering Director at Intel. Ke is responsible for applied machine learning, end-to-end workload development framework optimization and new AI technology exploration for Intel's latest Xeon CPU and GPU platforms. My name is Austin Webb, and I will be your host. A quick note about our new platform. You can access the biographies of our speakers in the top left-hand column. The following section contains downloadable resources, including a copy of today's presentation. Across the bottom of the screen, you see a series of icons. This will give you access to the slides, the Q&A and the survey. You can also resize and move around all of these windows to best fit your view. If you hover over the video box, you can turn on close captioning. If you have any issues with the presentation sound, slides or other please note it in the Q&A and a producer will help you out. Our Q&A moderator will be going over these questions in real time. And at the very end of the presentation, they will go over them once again, time permitting, of course. Due to the pandemic, we are hosting these webinars from the confines of our own homes. So please bear with us should we face any technical issues, dogs barking or children crying. And with that, Julien, I will hand it off to you.
Julien Simon
attendeeThank you, and good morning, good afternoon, good evening, everybody, wherever you are. It's a pleasure to get a chance to talk to you today, and thank you to Intel for giving me the opportunity. So we're going to talk about transformer models and how to make them fast on the Intel Xeon CPUs. So before we dive into the code and some practical demos, let's just spend a minute to explain why we even talk about transformers today. The transformer architecture exploded onto the stage at the end of 2017, beginning of 2018 with the launch of the Google BERT model that established new state-of-the-art benchmarks for natural language processing tasks. And since then, not a week has gone by without a new transformer-based model showing up and again, improving on state-of-the-art benchmarks for natural language processing or computer vision or speech or additional machine learning tasks. And so models like BERT, BART, GPT-2, GPT-3, our very own BLEU model at Hugging Face. And of course, Vision models like the Vision Transformer and CLIP and Stable Diffusion and audio models like Whisper, Wav2Vec2 and thousands more are really redefining the way we work with machine learning and deep learning, generally improving on benchmarks, enabling new use cases in ways that just weren't possible before. And so traditional deplaning models like CNN, LSTMs, RNNs, are losing steam in a way and are being gradually replaced with transformer models. And this trend has been picked up by the state-of-the-art report in 2021 and 2022, calling transformers, a general purpose architecture for machine learning. And in the Kaggle Data Science Survey last year, or I should say 2 years ago, we saw already that transformers were gaining adoption while traditional models, traditional deep learning architectures were actually losing steam. And in this year's or last year's Kaggle's Data Science Survey, over 60% of machining practitioners report that they already use transformers. So given that those models are only a few years old, this is very, very fast adoption. And when we look at transformer usage in research papers, this is actually a slide from the 2022 state-of-the-art report, we see that a few years ago, transformers were really applied mostly to natural language processing tasks, right, over 80%. And other modalities were very, very tiny. Now 2 years later, of course, NLP is still the majority task at 41%. But we see all the other tasks, all the other modalities have grown and the computer vision is pretty spectacular coming -- going from 2% to 22%. And if you add Willow on top of that, it's almost 30%. So researchers are increasingly applying transformer models to different modalities and also to multiple modalities, and we'll see an example of that in one of our demos. And this is very promising. And this is just another proof that transformer models are indeed becoming the de facto standard for deep learning processing. Hugging Face is supporting the growth of transformers since their inception literally. People call us the GitHub of machine learning, which I think is a fair analogy. Just like you go to GitHub to fine code for your projects and share your own code with the community. Hundreds of thousands of users visit the Hugging Face hub, our main website every day to find models for their projects and, of course, share their own models. So this slide is really the slide that's never in sync because our community keeps adding models at a crazy pace. But roughly, we have about 100,000 -- 130,000 models on the hub today, about 20,000 datasets. And it's not just transformers. Of course, we have a lot of transformers, but we have integration with additional machinery libraries like Keras, spaCy, scikit-Learn, fast.ai and more, meaning you can push those models to the hub, share them with the community and enjoy the hub features for those non-transformer models as well. We have over 10,000 organizations from Google and Microsoft, of course, Intel and many more, all the way to research labs, open source projects and, of course, individual developers, machine-learning engineers, sharing model on the hub, way more than 100,000 users daily and way more than 1 million downloads. So pretty large-scale website and a growing community. And in fact, our open source -- our main open source project, which is called the Transformers Library, which makes it super easy to download, train and predict with all the transformer models is just one of the fastest-growing open source projects ever. I think we just passed the 80,000 star mark, which is amazing. Thanks, everybody, if you haven't start the project yet, there's still time to do it. And what's even more impressive is that our popularity measured in GitHub Stars is growing faster than pretty much any other machine learning project out there, including amazing projects like PyTorch and tariffs and others. And we're even growing faster than Kubernetes, which I still find mind-boggling. If you've never seen the transformers library before, this is really the shortest, simplest example I could come up with. But it's just 2 lines of code or 3 lines of code. I'm creating a pipeline for object detection, a computer vision task using a pretty fancy model from Facebook. I'm passing an image to this object detection pipeline and I'm printing predictions. And in this case, object detection, of course, are class labels and bounding boxes around the different objects that have been detected. So motorcycle, backpack and person. And that's all it takes. So you really don't need any machine learning skill to do this. You just need to understand what optic detection is and what else, as we say, right? So in just a few lines of code, you can add a state-of-the-art object detection to your application. And the same goes for all the other test types for NLP, computer vision, speech, et cetera, et cetera. So we're trying to be as developer-friendly as we can. And the reason why I'm here today is for a while now, we've been partnering with Intel and Intel AI to make transformers even greater on Intel platforms. And you can read all about the partnership at the blog post mentioned on this slide. But what we're really trying to do here is we're collaborating on making transformers very fast, as fast as we can on training platforms and on inference platforms. And we're going to show you both today. And this is a great partnership. And again, thank you very much to Intel for giving me the opportunity to speak with you today. So the main artifact of this partnership and collaboration is an open-source library called Optimum Intel. So Optimum is an open-source library by Hugging Face, but that accelerates transformers on different platforms. And as you can imagine, Optimum Intel is the Intel version of this library. So you can find all of it on GitHub as you would expect. And it supports Intel acceleration libraries like the Intel extension for PyTorch, IPEX, which we'll use today, the Intel Neural compressor that implements different techniques to compress and shrink models. And of course, Intel OpenVINO that optimizes models for a wide range of Intel platforms all the way to edge devices. And maybe you think, well, that's crazy complicated. Transformer models are already complicated. So now you're talking about shrinking them and do I need to do some hardware specific developments? Well, no. All you need to do is something similar to what we have here. So here, this is an example of quantizing a model, a tech classification model for Intel platforms. And really, the only thing you have to do is start from your vanilla transformers code, so the red lines that you see here and replace the red lines with the green lines, so literally replacing a couple of objects from the transformers library by objects from the Optimum Intel library, in this case, OpenVINO objects. And that's it. So it's super fast. You can do this in minutes and running this code will automatically optimize your model with OpenVINO for the underwriting platform, right? Again, you don't need machine learning skills and you definitely don't hardware skills to do this. And you will enjoy a pretty nice speed up. But the big news really is the launch a few weeks ago of the new generation of Intel Xeon Scalable CPUs. And Intel has those cool code names, so you need to get familiar with them. And this new generation is called Sapphire Rapids, okay? It improves on the third generation, which was called Ice Lake. And the main thing as far as machine learning gigs are concerned is the introduction of Intel AMX, which stands for advanced matrix extensions. And the name says it all. It's a built-in accelerator for Matrix operations and specifically for matrix multiply and accumulate, which means multiply to matrices and add the result to a third matrix. And that's a common operation. It's the bread and butter of deep learning. And so having custom silicon to accelerate those operations should deliver pretty cool speed ups. So what this means is thanks to that acceleration, it's now possible to deploy large transform the models on CPU. With previous CPU generations, it was more challenging and the performance could be a problem, prediction latency could be problematic for large models. Now with this hardware acceleration, you can actually beef up the models that you deploy on CPU, and you will still get good performance, as we will show you. You can even train models on CPU. And I'm sure a lot of you go, what, we generally, we train on GPU. And that's okay. GPUs are useful. But now, again, thanks to this hardware acceleration, it's possible to train small and even midsized models. And when I say train here, I mean find tune CPU. And again, we'll show you a demo of that fine-tuning a model with a distributed cluster of CPUs. And you'll probably be surprised by the performance that we get. And generally, I think this is good news for machine learning practitioners because CPU servers are generally easier to procure, repurpose, you can use them for anything. One week, you could be using them for distributed training and then the next, you could be using them for database or web caches or web servers and whatnot. They're flexible. You can use them for anything, which unfortunately, is not really the case for GPU servers, which are highly specialized. If you use large fleets of CPU servers, whether in the cloud or on-prem, it's easier to cost optimize them, scale them. You just have [ Tor ] server types to work with. And again, that makes your ops life generally easier, okay? So this is really great news. And let's zoom in before we start looking at the demos. Let's zoom in on Intel AMX, which again is the really great news when it comes to those new Sapphire Rapids chips. So what AMX brings is, first of all, a set of new 2 dimensional registers, okay, as in hardware registers because, of course, if we want to multiply matrices that we need to have 2D registers. They're called tiles, and you can use them to store submatrices from a larger matrix, right? So we have 8 tiles, 1 kilobyte each, okay? So now we can work with 8 kilobytes of matrix data, so to speak, at a time. Obviously, to work on those tiles, we have an accelerator that implements the matrix multiply and accumulate operation for int8 and BF16 data types. So that lets us work with floating point models as well as integer models, potentially for quantization. This is supported by PyTorch natively. And if you add the Intel extension for PyTorch, IPEX, you will get even more performance, and that requires very minimal changes to your code as we will see. So that's the AMX bit that's implemented in the CPU. But again, don't worry, using IPEX and PyTorch, you don't need to even understand what's going on under the hood. You can just enjoy the speed up. Speaking of which, this is the kind of performance that you could expect up to 10x inference speed-up compared to the previous generation of Xeon CPU, so compared to Ice Lake. So of course, from one model to the next, from one test type to the next, from one batch size to the next, your mileage may vary a little bit. But as you can see here, running Reset and Resnet and then transformers and then other computer vision models, we see generally 5 to 10x inference speed up. And that's huge, right? That's huge because it means you go from 100 milliseconds to 20 or even maybe 10 milliseconds. So that's a game changer when it comes to the user experience and the application performance. okay? So these are numbers, but how about we run a first demo. So the first thing I want to show you, and let me share my screen before we go and look at the numbers. Here we go. All right. You should all see my screen now. I want to show you a first example of running transformer inference with Sapphire Rapids, okay? So what you see here is just a bunch of terminals. I'm actually using AWS instances to be precise. I'm using instances from the R7iz family, yes, family names are getting a little bit complicated on Amazon EC2. So the R7iz family is based on Sapphire Rapids CPUs. And the best way to check this is to call lscpu and we see all those nice CPU flags, and we see the AMX flags here, telling us, yes, we have tiles, yes, we have intake. And yes, we have BF-16 operations, okay? So that's good news. That's exactly what we need. So let me run the demo first because it takes a minute to run. And while it runs, we're going to look quickly at the code. So I'm just here enabling my virtual environment and running my script. Okay. It's going to run for a minute. Let me show you what we're doing here. So what we're doing here is when benchmarking different machine learning models. I'm actually only benchmarking this to the BERT in the interest of time. But with the same code, you can benchmark other NLP models like BERT-Base, BART base, et cetera. And I'm actually benchmarking them on tech classification, okay? And to try and get a real-life picture of the performance of those models, I'm predicting a short sentence, which is actually a [indiscernible] with you here. I'm predicting the same in -- as an 8-sequence batch. And then I'm predicting a long sequence, a much longer token sequence, which I believe is about 128 tokens. The short one is about 16, and again, batching the long sequence, okay? So we've got short sentence, long sentence with batches. And I'm measuring both the prediction time and computing the mean and the P99 quintile. So hopefully, that's close enough to real life. So what I'm doing for each model is very simple. I'm building a pipeline for sentiment analysis, right? And then I'm just benchmarking 4 times the inference time for the short sentence, the long sentence and the batched versions, okay? And this is the vanilla pipeline, right? So here, I'm using the vanilla PyTorch, and that's okay. And I'm already getting a nice speed up compared to Ice Lake CPUs. But then using Optimum Intel and using model compilation and using the bfloat16 inference type, I do the same, right? You can optimize pipeline and predict the same things all over again, right? And so that's what we see running here. And again, I am just running DistilBERT in the interest of time. So what we see generally is for the short sequence, we see about 5 millisecond latency, right? There's a bit of an artifact here on P99, should be a little bit lower. For the long sequence, I get about 9, and then, of course, batching takes a little longer, okay? So that's already pretty fast for CPU inference. So even if you don't use Optimum Intel and bfloat16 and all that good stuff, you already get low single-digit latency, for those models, right? But then if you switch to Optimum Intel, look at what you get, right? Now we're sub 2 milliseconds, okay? So that's about -- that's almost a 3x speed up here, okay? And even with long sentences, we're at 5 millisecond latency, okay? And the batch versions also run very fast, okay? So that's all it took, right? That's all it took. Install IPEX, install Optimum Intel, use the optimized pipeline and literally get close to 1 millisecond latency for short sequences and about 5 minutes in latency for long sequences, right? So looking at numbers, again, right? We should be back to slides. This is what we see, right? This is actually from a blog post that you will find on our -- on the Hugging Face blog at huggingface.co/blog. We can see the difference between -- so the blue lines are the Ice Lake instance. The red bars are the Sapphire Rapids instance with the vanilla pipeline and the yellow bars are the Sapphire Rapids instance with the Optimum pipeline, right? And you can see from one generation to the next, we get more than 3x speed up, right, for short sequences, which is really cool. The blog post has detailed numbers also for BERT-Base, RoBERTa to go and run that code on your own models and see what you get. But you can see how nice the speedup is with such minimal effort. And I'm lazy. So I just love getting that performance by just changing a couple of lines of code, right? So what about weird models like Stable Diffusion? So do we also get a speed up with those crazy text to image models? So again, let's give it a try. Let me share my screen again. And this time, we'll go to the Hugging Face hub, okay? So this is a really, really cool Hugging Face space built by Intel, okay? And I just need to enter my Hugging Face token to log into this. Yes, I agree, okay? And what this shows you is generating an image with Stable Diffusion simultaneously on an Ice Lake machine. It's just run this photo of an astronaut riding on a horse on Mars, why not, okay, generate the image, okay? So the top panel here is generating on Sapphire Rapids. And it's done. Wow, 5 seconds or something for a CPU-based Stable Diffusion, try this on your local machine. And of course, on the Ice Lake machine, it takes a bit longer. And yes, from one run to the next generally, it could be anywhere from 5x to 7x lower. So let's see, it should show up. Yes, there you go, 33 seconds. So about 6x speed up is right. So again, when we say you can deploy large models on CPU, here's a good example, right? Stable Diffusion is a reasonably large model. And yet, you can generate high-quality images in 5 seconds, right? So that's a pretty good number. Again, you can go and try this with your own images. The space name is Intel, Stable Diffusion side by side on the Hugging Face hub, okay. All right. Let's continue with our story here. And as you can see, inference is enjoying very, very nice speed ups, thanks to Sapphire Rapids. Now what about training? Well, these are, again, benchmarks from Intel on fine-tuning models, and we see a DistilBERT model, fine-tuned on the Internet Movie Database, Movie Review dataset, which is DistilBERT on the SST-2 dataset, and we have a computer vision model trained on a medical dataset, okay? And so the blue bars are the Sapphire Rapids instance. And the green bars are the A100 GPU, okay? So lower -- these are training times in minutes, so lower is better, okay? So we can see, obviously, GPUs are still quite faster here on the DistilBERT example. But the ratio is not as detrimental to CPUs as it used to be. So if you have no particular incentive to train extremely fast, CPUs can be an extremely interesting option and a quite cost-effective option because maybe they're 2.5x lower here or something, they could be quite cheaper. And so again, if you look at the cost performance ratio, it's quite possible that CPUs are a much better option. If we look at this particular one, again, we see a small advantage in the favor of GPUs, but do you honestly care that you train in 0.45 minute or 0.7 minutes, especially if that CPU training is that CPU instance is much more cost-effective. Training times don't tell the whole story. You have to look at cost performance. And if there's any actual business benefit in training faster, a lot of times, there isn't. So you could just wait a little longer, makes no difference and save a whole lot of money, right? And looking at this one, we see, in fact, that on this particular model and dataset, the CPU training is a little bit faster than the GPU training. So there you go, the gap is definitely closing. And if you look at cost performance, not just training times, you could have a very nice surprise if you try GPUs. And obviously, I encourage you to try those -- the new Sapphire Rapids CPUs for your fine-tuning jobs. So let's give it a shot. First, we're going to do something pretty cool. We're going to do a few-shot learning. Okay, let me explain like share my screen again. And you'll see why a few-shot learning is actually a great candidate for CPU training, okay? So let me clear the stuff out of the way. Okay, let me clear this stuff out of the way. Let's just move to other environment. And again, let me launch this thing and while it runs for a minute or 2, I'm going to explain it. Okay. So what are we doing here? Well, we are doing a few-shot learning for text classification. So what is few-shot learning? Before you even train a model, you need a dataset, right, certain obvious. And that means, of course, extracting that data from whatever back end it lives in and cleaning it and labeling it, okay? And we know from experience, we need very large label datasets for supervised learning, okay? And that's time consuming, and that's expensive and let's face it, it's not very, very fun to label data. So few-shot learning is a new technique that lets you train or I should say, fine-tune the model with just a few samples. And when I say a few, what do I mean? Well, in this example, in this example, we are using 8, you heard that right, 8 labeled samples per class. So my dataset is the Yelp polarity dataset, so it has 2 classes. We have highly polarized restaurant reviews or doctor reviews or business reviews. And so two classes. So I'm just taking 8 samples, 8 label samples from that dataset, as you can see here, okay, for each class. So well I'm just taking 16 random samples in that training dataset. And that's all I'm going to use, right? So I'm just literally pretending, I only have 16 label-label dataset to train my model, okay? And then starting from sentence transformer model, okay, from the hub and using the SetFit library, I create set SetFit trader object passing the model, the training set, the evaluation set, my metric, my batch size, how many iterations I want to use, and that actually impacts the number of sentence pairs that the model will generate from my 16 samples to augment the dataset and then learn from it and the number of epochs. And then -- I just call train, okay? So again, imagine yourself labeling 16 samples and running this, right? So that's what we're doing here. Let me switch to the left-hand side here. So 16 training samples, 10,000 evaluation samples, okay? And our evaluation is running. I'm hoping it's not going to take too long, okay? Those 16 samples are actually combined in different ways to generate 512 actual training samples. If you are not familiar with few-shot learning and SetFit, I encourage you to, again, to read the SetFit blog post on our Hugging Face blog, okay? And we see we train one epoch in 2 minutes in 25 seconds, right? So that's why I think few-shot learning and SetFit in particular, is a great fit, not intended for CPU training because we're using so little data to begin with. Than in a couple of minutes, we can build a high accuracy model, right? Sure, we could trade on GPU and maybe it would train a bit faster, although GPU start-up times could very well make it that, we're actually training slower than this, who knows. And -- but then again, the cost performance ratio here is going to be much better, than on GPU instance, okay? So we should get accuracy anywhere between 93% and 94%. I will just wait for this thing to complete. If not, we'll just move on. But in my slide, the actual example that I run, when I was preparing for this gave me 94.6% accuracy. So I think that's there we go, 93.68. It wanted to contradict me. So how often do you train a 93% or 94% accuracy model in 2 minutes on a CPU? I would think not every day. So you get everything here. You don't need to label a ton of data. You don't need to fire up a complex GPU environment and fight with CUDA drivers, just grab a CPU server and train. And thanks to SetFit and few-shot learning, you can get very good accuracy, right? Again, your mileage will vary depending on datasets and models, et cetera. But I think this example worked very well for me. And it made total sense to run it on CPU, right? And the last demo I want to show you is distributed training, actual distributed training. So again, let me reset my environment. And I'll show you what we're going to do here. Okay. Here we go, and we don't need this one anymore. And you -- maybe you're wondering what those windows in the bottom here are. So those two top windows are the same machine. They're actually running on the master instance that we're going to use to launch distributed training from. And those 3 windows, as you can see from the IP addresses are 3 separate nodes, okay? So in total, my training cluster has 4 nodes, okay? The main node and the 3 secondary nodes, okay? And they're all the same. They're all the same instance type again from the R7iz family. So in this example, we're going to train, let me set up the environment really quick, okay? We're going to train. And this one is just, I don't want to hide anything. This one is just the environment of variables, right, the IP address for the master node and a couple of environment variables for the Intel distributed communication library, okay? So we're going to train a question answering or fine-tune, a question answering transformer on those 4 nodes, okay? And so first of all, we could run it on maybe let's try 2 nodes before we go and scale. As I want to run 2 training processes per node, and I want a total of 4 processes, okay? So I'm going to be training on the main node and one of the secondary nodes. And what I'm going to do here, and it's a bit of a mouthful, so let me break it down for you here. Okay. This is what we're doing. So first, what we're really doing is this, okay? We are -- and maybe I can close this, and I guess we don't need it. So we are using the run-qa script from the transformer example, okay? And we're going to find tune the DistilBERT model on this quad question answering dataset, okay? And we don't want to use CUDA because there is no GPU on this machine. But what we want to do is, of course, we want to use the Intel -- we want to use the Intel PyTorch extension, which is already installed, and we want to use the BF16 format, okay? So let's try and run this. If you're interested in the actual setup of the distributed cluster, again, it's all in the blog post, you have the URL on the slides. You can find it again, huggingface.co/blog. And so what we see here, we see obviously the main nodes starting the job. And on node one, let's call it node one, we see 2 additional Python processes, right, running there. And let's try and keep an eye on the training time. So on those 2 nodes, it's about, okay, let's call it, 14.5 minutes yes, I guess we can agree to that. And let's not wait 13 minutes, okay. Let's just make a mental note of that time, okay, about 14 minutes. And now let's go and try maybe 6 processes, okay? So obviously, I'm going to use the next one node here. So now we see 2 processes on node 1, 2 processes on node 2. And of course, we still have 2 processes on the main node. And we see now the training time is about just under 11 minutes, 10.30. It's going to converge in a minute or 2, okay? So if you plot this, you will see a near linear scaling, meaning I'm adding a node and I get almost 100% efficiency, thanks to this new node, right? And now if we try and add another node, so that will be 8 processes, run this thing again. So now we should see 2 python processes on every node. Node 1, node 2, node 3 plus, of course, the 2 on the main node. And now we see training time is going to be just under 8 minutes 7.30 -- okay, 7.40, 7.30 minutes, okay? And again, trust me, if you plot this, you will see near linear scaling, right? So I start at 4 nodes because I couldn't get my hands on more, if you want to know. But I'm pretty confident if I try more nodes and potentially bigger datasets to have longer training times, I would keep seeing linear scaling here. And so what this means for you as a machine-learning practitioner is, if you have a bunch of CPU servers in your data center or CPU instances on AWS or on another cloud, you can put them to work very easily. Again, the setup is not very difficult. All the steps are in the blog post. You can really do it in a few minutes. And you can add nodes very easily to that cluster. So if you have underutilized CPU server in your platform, you can put them easily to good work to run distributed training and distributed fine-tuning just like that. And chances are you have way more CPU servers lying around than you have GPU service lying around. So you could have 16 or 32 servers that you can use maybe for a day or 2 and then repurpose them, like I said, for other workloads just like that. So I think it's a very, very flexible and very cost-effective way to around training. And I just wish I could use a few more servers but that will come and try to build a bigger cluster, that maybe we can do that in the future, now, okay? All right. So that thing is going to train in just 7 minutes and 30 seconds, which again, from a human perspective, is we and way more way faster than you really needed to, right? Again, how long will it take you to set up a GPU machine and a GPU environment, way more than 7 minutes. So again, try it out, and you will certainly be surprised by the performance that you get here. All right. I'm almost done. Let me quickly tell you about Hugging Face in general. So as a quick recap or maybe as a first intro, you've never heard of us. So we have way more than 100,000 dataset models on the hub. We have over 20,000 datasets as well. You can use our open source libraries like transformers and Optimum, which we looked at today to train and predict and optimize them. We have additional libraries like Accelerate for distributed training at a lower level on different hardware platforms. You can use the Diffusion libraries, Diffusion library for Stable Diffusion models, et cetera, et cetera. And we have a no-code OTML service called AutoTrain, at lets you train NLP and computer vision tasks in just a few clicks. So very, very interesting. SpaCy, which I'm pretty sure you've seen, and we've seen that again that cool intel space on the side-by-side diffusion models to build and host machine learning demos using frameworks like Gradio or Streamlit. And then when it comes to deploying in production, you can deploy anywhere using the open source libraries, build your open -- build your container or your prediction API, any way you like. Or we have a production-grade scalable and secure service called Inference Endpoints that runs on AWS and Azure. I highly encourage you to check it out. We got really good feedback from customers in just a few clicks or one API call, you can deploy any model from the hub on secure, scalable cloud infrastructure. So we think we mostly solved model deployment with this one, but go and check out, right? Last but not least, we have partnerships with AWS on Amazon SageMaker. So Hugging Face is the first-party framework there. And we also have a partnership with Microsoft on Azure, and you'll find the Hugging Face end points in the Azure marketplace. And again, you can, in a few clicks, deploy models from the hub to manage infrastructure on Azure, okay? Well, that's really what I wanted to tell you. So here are a few links to get you started. I guess it's the one you want to screenshot. So the task space to understand what all those transformer tasks are, if you're not sure what -- zero shot classification or a semantic segmentation are. Well, we can explain it in plain English. We have a really, really cool course, completely free. That takes you from Hugging Face newbie to Hugging Face expert, I think, in no time with plenty of go to Run. If you have questions, you can go and ask them in the Hugging Face forum. And if you like commercial support, consulting, custom training and generally, if you need help with bringing high-quality transformers to production. We can support you, and we have an expert team of machine learning engineers and researchers to solve all your problems there. Go and check out the Intel page on the Hugging Face hub as well. They have lots of interesting models for you to work with. If you're interested in the hardware side of things and the Optimum work we're doing with them, you have some more links here. And of course, we're happy to be featured on the Intel website with, again, [ deep ] content, right? And if you'd like to stay in touch, you can find me on Twitter while it's still running. LinkedIn, of course, happy to connect there. I have a blog on medium and a YouTube channel with lots of Intel-related videos on YouTube as well. So I hope to see you there, and I'm looking forward to questions. Well, that's it for me. Thank you very, very much, again, Ke and Intel team for having me today. Thank you, everybody, here for listening to me. I hope you learned a few things. And I think now we have some time for questions, right, Ke?
Ke Ding
executiveYes. Yes. Thanks, Julien. And excellent presentation and so great demos, right? And so my name is Ke Ding, I'm Principal AI engineer and Engineering Director at Intel. So I will be your moderator today for the next Q&A session. So while Julien is presenting already received many excellent questions. So we will now address one by one. And please continue to submit through our platform. So if your question is not answered through our live sessions, we'll address offline after the webinar, okay? Now let's go to the first question. Yes, so people ask what are the advantages to train or influence transformer models on Intel hardware, right? Isn't running them on GPU 10x faster at least?
Julien Simon
attendeeProbably not. I think we showed some benchmarks where that 10x number is just not a reality anymore. Of course, like I said, depending on datasets, depending on models, depending on batch sizes and training configurations, your mileage will vary. But I think that 10x number for a lot of models is a thing of the past. And again, people -- I understand why people look at training time as a key metric, right? And it's certainly an important metric. But as an organization, the more important topic is the total cost of ownership and the cost performance ratio. A lot of times, you -- there is no strong evidence that training faster benefits your organization and your business workflow. I mean, yes, there are some particular use cases where you need to retrain models every hour and every model of freshness is super important that there's such a tiny minority. I mean a lot of folks, they train once a month. So it doesn't really matter or sometimes less than that, right, once a quarter because it's such a large effort then to get the model to production and to get it approved by the compliance teams, et cetera. So they don't have the luxury of training every hour and deploying every hour. So then it doesn't really matter if you're training in 2 hours or if you're trading in 4 hours, especially if you factor in the cost of the underlying hardware. So again, I understand the question, I understand the concern for training fast. But if you zoom out, you will see that for a lot of workloads, a lot of models, the better answer is actually to train and find -- especially when fine-tuning right on CPU. It will make no difference to your agility and certainly, it will benefit your budget, your infrastructure budget. But again, it's just me go and run your own tests, but I think in a lot of situations, CPUs are very strong containers.
Ke Ding
executiveYes. And thanks, Julien. So we talked a lot about training and fine-tuning, but how about inference?
Julien Simon
attendeeWell, for inference, I would be even more of a believer. And I mean, it's nothing new. I remember even 4 or 5 years ago, I was already -- I was working for AWS, no big secret. And I was already running some benchmarks, particularly for natural language processing, RNNs, LSTMs, et cetera, model that didn't exhibit a very high level of parallelism. And they did already run faster on CPU than on GPU. So yes, you needed to tweak and you needed to know what you were doing. But you could already get a great performance on CPU platforms at much lower cost. And now I think with Ice Lake and Sapphire Rapids even more so, like I said, the gap is closing. And I think for NLP, in particular, NLP models, as a CTO, I would absolutely try CPU platforms first. I would only go to GPU if my business partners showed me that their KPIs were negatively impacted by high latency on CPU platforms. For large models, larger models, big computer vision models, I think there's still a little bit of a way to go for our CPU inference. But you saw with the Stable Diffusion example that it's doable. I mean you can generate images in 5 seconds. So maybe that's just fine for your business use case. If you need to generate in 1 second, then okay, use GPUs. But again, each use case needs to be looked at independently. What are the training requirements, what are the inference requirements, what's the budget, what's the incentive to train faster or inference faster? Is there any -- what's the KPI, what's the business KPI that actually shows that going faster helps? All these are great questions. And if you answer them in good faith, you will see that a lot of times you can use CPUs for inference and not look back.
Ke Ding
executiveYes. Julien, during your presentation, so you talked a lot about CPU-based transformer models, right? So accuracy is good and the performance is good, right? Does it mean other Intel platform, other Intel support for transformers is not planned at all?
Julien Simon
attendeeI'm sorry. Is it not?
Ke Ding
executiveIt's not planned at all. So like we have other platforms, let's say, like we have Gaudi, and we have GPU and do we have any demand for the...
Julien Simon
attendeeWell, Gaudi is a whole different story. And I think it's great. A few years back, a few years back, you add GPUs, and that's it, okay? And fair enough, GPUs are awesome, don't get me wrong. And the company that's building them is awesome too. No worries is there. But as an engineer, I just love to have choice. I love to have the freedom of choice. If you give me only one tool, then I'll use that because I don't have another option. But you know what they say, if all you have is a hammer, then everything looks like a nail. And I think, and no offense to anybody, feel free to disagree. I think there are a lot of practitioners who actually see things that way. They think GPU is the only option out there. So they run everything on GPU. Sometimes it makes sense, sometimes it doesn't, and they end up spending way more than they should. So now there's more diversity. We still have great GPUs. We have Habana Gaudi which is really great. And we covered it in other Intel webinar. And now we have Intel CPUs and the new Xeons joining the party, being very strong contenders for inference and even becoming contenders for small to midsized fine-tuning. So that's great because for us, engineers, we can rightsize our infrastructure to the use case. And again, one size never fits all. So do your homework, experiment, and you will be surprised at the performance and the cost optimization that you can achieve. And it's great use. If you optimize your budget, then you could say there's more money for Friday night beer or there's more money for machining projects or hiring new people. So efficiency is important. Energy consumption is important, et cetera, et cetera. I think we only to be aware of that. So -- and for one, I'm very happy to have more tools in the toolbox and be able to find the one that fits my needs exactly.
Ke Ding
executiveYes. Thanks, Julien. Let's go to the next question. So you showed the multi-node distributed fine-tuning, right, that example. So in that example, we use BF16, right? You mentioned BF16. Do you know so if training suffer from like BF16 or whatever, like a commerce...
Julien Simon
attendeeThat's a really good question. So we all know FP32, right? It's been around forever. So it's a 32-bit format, high accuracy, et cetera, et cetera. So FP16 was introduced because it was faster, okay? And so we could train and influence faster with FP16 and with minimal loss of accuracy. But the problem with FP16 is it has fewer -- it can represent a smaller range of values than FP32, right? So the dynamic range of values is smaller. And so that means we can have overflow issues when we switch from FP32 to FP16. So to fix that, BF16 was introduced. So it's still a 16-bit format, but it can represent the same range of values as FP32 with coarse grained --- coarser granularity, excuse me, but still the same range. So BF-16 is very interesting for inference. Of course, again, you should run your own tests. But we run those benchmarks on FP32 versus FP16 versus BF16. And the blanket statement is no, you're not going to lose any accuracy. So I would still recommend that you train -- if you can train with FP32, that's great for initial training, maybe fine-tune with FP16 to fine-tune the faster. But for inference, BF16 is a very interesting format, especially again with that AMX extension in the new chips. So if you do FP16 right now on Sapphire Rapids, you're not going to get AMX. If you do BF16, you will get it. So run your tests, compare 'the different versions of the model, see if you see any meaningful difference in accuracy and if you see different distribution of predicted values, maybe you do, maybe you don't, I can't say, but I think the consensus is no, you're not going to see a lot of differences.
Ke Ding
executiveOkay. And I received many, many questions. And so let's go to the next one. So next one will be very simple. So which framework enables the [ multi ] instance if you're training that was shown as part of demo?
Julien Simon
attendeeSo all the models that are used were PyTorch. And the reason for this is because the most popular transformer models are PyTorch models these days. No, we don't discriminate against TensorFlow at all. We like TensorFlow too. But it still happens that the -- a lot of research teams use PyTorch and the most interesting models are PyTorch models. And of course, Intel has this great Intel PyTorch extension, and it fits very well together. But obviously, yes, you could do the same with the TensorFlow transformers.
Ke Ding
executiveYes. So the multiple framework support, right? So it's [indiscernible] transformer, [indiscernible] everything so that the developer experience, whatever underlines TensorFlow, PyTorch will be the same. Okay. Let's move to the next one. So could you explain like briefly like you expanded transformer, transformer compared to the basic [ Xeon ] models. What is the major difference? Why transforms becoming so popular?
Julien Simon
attendeeDo we have another 2 hours? Okay. No, just kidding. The basic building block of transformers is a mechanism called attention. And there is a super famous paper called attention is all you need. And it is a research paper. So it has a bit involved reading, but I really encourage everybody to give it a shot. And in a nutshell, attention is a mechanism that helps learn the context to the left, right? So in the past of the current token you're working with in the sequence and to the right, okay? And so you can look. If you're trying to, let's say, translate from English to French, okay. You don't translate word for word. That would sound very awkward. So you look at the French sentence, okay? And when you translate a word, so English to French, I said, okay, so you look at the English sentence and you look at one given word. And of course, you look at the work that come after it and the words that come before to pick -- to get some context and pick the right translation, okay? So the attention mechanism automates this process at scale so that every single token in the sequence is evaluated in the context of all the tokens that preceded and all the tokens that succeeded, okay? And so this makes it easy, I should say, efficient to learn very long sequences and very long dependencies, okay? And that's why transformer models are so efficient because there the at extracting context from long sequences, both with past tokens and future tokens, which is something that typically RNNs and LSTM struggle to do. And the crazy thing about TRANSFORMERS is here, I'm talking about text, but the same story applies to images. And we quickly mentioned the Vision Transformer and the Vision Transformer breaks down an image into small patches and then flatten those patches and treats them as sequences. So literally pixel sequences. And so now they become tokens again and the context can be extracted as well. So that's really the core of transformers. This attention mechanism that makes it efficient to work with long sequences and understand really long-term dependencies in the future and in the past. That's the 5-minute explanation. It's as good or as bad as I can make it. But yes, I really encourage you to go and read the attention, is all you need paper. If that's too involved, you'll find lots of good blogs that break it down for -- in plain English. Or maybe you can ask ChatGPT to explain it in plain English, who knows? Maybe that works?
Ke Ding
executiveYes. That's a good idea Thanks, Julien. I think we are already 8 minutes beyond ours. So once again, thank you so much for presenting this great work done by Hugging Face and the Intel joint team. And I hope today's session is helpful for all the NLP transformer developers. And so we welcome any kind of contribution to our project. And so don't hesitate to contact us through our GitHub issues, through e-mail, intel Developer Zone, Hugging Face Community or maybe even your like supporting channels, right? And that's it. I now I'll turn back over to Austin to conclude our talk today.
Austin Webb
executiveThank you, Julien and Ke for the presentation today. A recording of this webinar will be e-mailed out within the next 2 weeks. You can view a replay at any time using your attendee link. Also a quick reminder to please be sure and complete the short survey. It's your chance to help shape the series, tell us what you want to hear about and what we can do better. Thanks to all of our attendees for your great questions and engagement. You can check out our event website. There is a link in the resource box. This will have a full list of all of our upcoming and past workshops and webinars. Thanks so much for joining us today, and we will see you next time.
This call discussed
For developers and AI pipelines
Programmatic access to Intel Corporation earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.