Appen Limited (APX) Earnings Call Transcript & Summary

May 20, 2021

Australian Securities Exchange AU Information Technology IT Services investor_day 167 min

Earnings Call Speaker Segments

Mark Brayan

executive

#1

Hello. Good morning, good afternoon, good evening, everybody, and welcome to Appen's Investor Technology Day. Thank you very much for joining us. Our agenda today covers a number of topics. I'll provide the briefest of introductions, and then I'll hand it over to my colleagues to give you, first of all, a update on the AI market; and secondly, an update on our technology, including some demonstrations. We'll have a period for Q&A at the end of it, and you'll be able to answer -- or sorry, enter your questions via the app. And then we'll close and will be all done prior to 2:00 p.m. Sydney time today. I'm joined today by Wilson Pang, our CTO, who's coming live from California; and Ryan Kolln, our Head of Corporate Development. Both Wilson and Ryan are far more interesting than me to talk to, and they'll be doing the bulk of the talk today. Wilson is an engineer with an extensive background in search and artificial intelligence. He worked for IBM and for eBay for many years. At eBay specifically, he worked on search, and that gives him a lot of expertise in artificial intelligence. And then he was Chief Data Officer for Ctrip, a travel company, where he built many, many models to help that business use data more effectively and grow and thrive. Ryan Kolln, also an engineer, has worked for telcos here and -- or sorry, in Australia and in the U.S. and has had a stint with the Boston Consulting Group advising technology companies on growth strategy. Today, our theme is all about our transformation into being an AI-powered provider of AI data and solutions. Our talk today will tell you why this is important and how we are going about it. First of all, though, just to recap on some of the things we covered yesterday, how we got to this point and the evolution of our business. When I joined Appen 6 years ago, we were a leading provider of language data. We've evolved over that time to be a provider not only of language data but also of training data for all AI use cases, including all AI data types, so speech, natural language, text, relevance, image, video, three-dimensional data, including LiDAR, for example. So we've moved quite a lot from our initial position as a provider of language data. We're also evolving our service model -- sorry, our delivery model, essentially service-led now to being more product-led, and you'll see a lot about our products today. From a revenue perspective, we are moving over time to do more committed revenue rather than just all project revenue, and that obviously goes to revenue visibility and earnings quality. From a customer perspective, we are still very concentrated to our largest customers, as many of you know. But we're working to win many new customers to diversify our customer base over time. Yesterday, we announced a change to our organizational structure, from one that is functional to one that is more aligned to our customer cohorts. We now have 4 P&L customer-facing business units: Our global business unit, which serves our 5 largest tech companies, the U.S. technology giants; our enterprise business unit and then our business units in China and the government sector. And then finally, yesterday, we announced some changes to our reporting. We were reporting in -- by data modality, relevance and speech and image, in Australian dollars. Now we're reporting more by our customer segments and other strategic areas of interest such as our new markets, which include the Enterprise, China and Government segment but also the revenue that flows through our products from our major customers as well. To look back at the passage of time, when I joined the business, again, the majority of our revenue came from the global customers. And we provided services to them essentially on their platforms. We acquired the Butler Hill Group in 2011 and then Leapforce in 2017. And with Leapforce, we also gained Appen Connect, our crowd management program, that helps us manage the crowd resources at scale for our customers. In 2019, we made an important acquisition with Figure Eight, which gave us our own annotation platform. And this provides a number of opportunities for us: First, to sell to customers that don't have their own annotation and data preparation technologies. Secondly, it gives us the opportunity to do more types of work, so the platform covers all data modalities. And finally, we use it ourselves to improve the efficiency of our own operations. Most recently, we invested in the expansion of our business beyond our global customers with the addition of the business units I just mentioned, enterprise, Government and China, all of which is fueled by our platform and all of which requires the technology that we've invested -- that we've acquired and invested in over the last few years. So increasingly, we'll be a product-led organization. Our products give us the opportunity for scale, for quality, for productivity and underpin the growth of the business going forward. We are also, as discussed yesterday, increasingly customer-centric with our 4 customer-facing business units. That's not the topic for today. It's more the topic around product. And on that, I'll hand it over to Ryan to take it from here and provide the AI market update. Thank you, Ryan.

Ryan Kolln

executive

#2

Thanks, Mike. I'm going to talk about the AI market today, but I'm going to start by talking about the AI application life cycle. So this is a useful grounding for us to think about and particularly the role that we play with our customers to help them build AI-enabled applications. So on the left-hand side, we see a very typical view of how a customer might build an AI application. It all starts with the business need, so the hypothesis around what the application is going to deliver. But typically, the first step is to collect and bring together the available data to build the model based off. So that could be in-house data or it could be data that is collected bespoke for the application. The second part is the preparation of the data. So it's one thing to have the data. It needs to be in the right format with the right labels that are going to be able to -- for the AI models to be trained on the data. So that, for our side, typically involves a lot of data labeling, and it's a big role of what we play. So once the engineers have data that is ready for the model build, the next step is to build the model. So this step typically involves the selection of modeling techniques. There is a wide variety of different approaches that can be used to train models. But once that's selected, they apply the training data and build the model. Next is testing. So once the model has been built, does the outcome of the model meet the requirements and support the business need? And in the case that it does, then the model will be put through to deployment. So that's actually when it's put into the application, deployed in the real world. There's an interesting side loop here for some applications that may be of high criticality and the confidence of the model is not where it needs to be, there'll be a human in the loop. So that is effectively where low confidence predictions are routed to humans who will make the decision and then that closes the loop. Monitoring is very important in AI models. There's an adage that it's not a question of if a model will degrade over time, it's when. We'll talk about this a bit more later. And once models get to that point, that it's degraded below acceptable performance, it goes back to acquiring more data. This is a bit stylistic. In reality, it's a lot messier, to be honest. There are many iteration loops that can occur. The most common ones are around the testing phase. So an engineer will build a model. They'll test the application. It may or may not work. If it doesn't work or if it doesn't get the performance that they're looking for, they'll either acquire more data, prepare more and label more data and try different model building techniques. And we'll talk through some of the differences across those approaches later today. So to simplify the AI life cycle and the AI model development approach, an AI model consists of 2 main parts: The model instructions and training data, where the model instructions is an architecture for the model to learn. So it's not saying here's the output It's saying, here's a guideline for how the model should learn once training data is applied. And this can be as little as 10 lines of code in some instances. The next important part for training -- for model development is the training data. So these are the examples that the model learns from. And it's typically the more and the higher quality training data, the better. So I think it's helpful to contrast AI development to normal software or traditional software development. So in a traditional software development sense, you'll have an idea of the outcome that you're looking to get to. And you write code, and it's deterministic. And by deterministic, it means every time that application is run with the same set of inputs, it's going to deliver the same outputs. You test the code. You deploy and monitor. And if there's any changes, it's -- you rewrite or you edit the code. In AI model development, it's different. The labeling of the data is the really important part, so that training data composition. The provision of the instructions, i.e., the architecture that we spoke about, is used to write the code and then the model is tested. So you can see the difference in traditional software development. Writing the code is the most important step. In AI model development, it's the gathering and the labeling of high-quality training data. So it's interesting to think about what is an AI model. And I've put up an example here because AI models, they're really this indecipherable set of nodes, weights and biases that when you look at it from an outside-in standpoint, it makes absolutely no sense. So that's why when you hear about AI explainability and model debugging being really difficult, it's because the actual code that has been written as part of the modeling process is this highly complex system that's very difficult to debug. So -- but what's the important part and the bit that is able to be debugged and improve quality is the training data. We've spoken about training data in past presentations. And obviously, it's core to our business. But breaking it down to really simple components, training data consist of 3 things. So firstly, it's the file. So you can think about that as the example. So it could be an image or a text file or a snippet of audio. Then there are attributes to the file. So it's really important as part of the training process to assign meaning to the file. So let's say, the file was a -- used for autonomous vehicles. The box driven -- surrounding a car, and saying within these pixels, there is a car that you can do -- that is the attribute of the file. The next is the attributes of the label. So this is the metadata, what time it was labeled, who it was labeled by, under what conditions. So we'll go through and step through a few examples of what training data actually looks like. So on the left-hand side here, this is an example of our LiDAR annotation tool. So LiDAR is used in autonomous vehicles. And you can think about it similar to a radar, where it's -- a pulse is sent out and received. And what it does, it allows the sensor to measure distance and different rough shapes in a 3D point cloud environment. You can see on the top left of that image what a standard camera sees. And in the dark blue points, that is the LiDAR frame. So in this instance, the task for our annotator has been, can you draw a 3D cuboid? So a cube around this car in the frame. On the right-hand side is the label. And this is what's called a JSON file, which is the actual -- the meaning. And I've highlighted a few sections here. So that first section to highlight in light red is the center of the cuboid. So it's saying in this dimension in space, there's a cuboid. The next part is the height, width and depth of the cuboid in meters. So in this space, there's a cuboid, and it's roughly 1.9 meters high by 1.9 meters wide and 4-and-a-bit meters long. So that's the cuboid. And then it's saying within that space, there's a car, right? Super -- so when we think about how this trains the system in the model, when there's a representation of these types of cuboids in a real-world environment, because it's been trained to look for cuboids, it will be able to say, "Okay, I know that there's a car in the space. I know the dimensions of the car, and I know how far it is away from me." Obviously, super important for autonomous driving. The third highlighted section in this space is interesting, and this is saying that in a 2D image, so in the image in the top left-hand corner, there is also a car. So it's actually -- in this LiDAR frame, it's blending together 2 different types of sensors into 1 set of training data. And this is called a sensor fusion. So a simple example, but you get the idea that this can be exceptionally complex, particularly when you're looking at hundreds of different objects in a frame, could be vehicles, pedestrians, bicycles, stationary -- other stationary objects. So it gets quite difficult to label, and the JSON file or the annotation is very difficult in itself. Another example of training data here is speech, so spoken audio. On the left-hand side is an example of our speech annotation tool. And here, you've got just one speaker. So a quite -- 2 speakers, sorry, a quite simple example, speaker B saying a few things, and then you've got speaker A. On the right-hand side, you see the JSON file. And a little bit simpler than the LiDAR frame expectedly. At this start time in the file and this end time, here are the words which are being spoken, all right? What's a little bit complex here and really important for the training of voice recognition system are the noise associated with it. So you'll see that there's an insertion of certain types of noise. Here, we've kept it quite simple, just to say noise. But it can include quite specific things like a cough or a sneeze or a um. It's very important for training speech recognition systems. So training data quality is really important, and I think it's intuitive. As we think through the development of an AI model, it's providing a lot of examples. If those examples are wrong or not representative of the real-world state, the model is not going to perform that well. So low-quality training data leads to low-performing models. The thing is, though, that poor quality deep data is not always obvious, and there's many different types of quality issues with training data. And let's step through these a little bit. So we see 3 main buckets of problems. The first is where there's been an error in the labeling process. So this is on a specific label, there's something wrong with it. The next is more about the composition of the training data, and unbalanced training data is a really big issue. And that's where you may have overrepresented in some areas and underrepresented in others. And then that leads to a suboptimal performance in the models. The third is bias in the labeling process. So it may be that there's no errors in the labeling and it's balanced, but the individuals who have performed the labeling may have certain bias that leads, again, to suboptimal outcomes. And we'll go through some examples of all of these. So a really simple example here. Let's say, the task for the contributor is to draw a box around the cows, and that was the instructions provided. What we might be looking for, or in this instance, is a bounding box or a tight bounding box. It means that there's not a lot of space between the image of the cow and the box around these 3 cows. So the left is pretty clear. The cow in the middle is pretty clear. The one on the right is a little bit trickier because it's occluded. You can't see all of the cow. But the intent is that we just see -- the box is drawn around just the part of the cow that we can see. First type of labeling error that may occur is just for whatever reason, the contributor missed the cow on the right. And obviously, that's not great and pretty clear that, that's an error. The next could be the accuracy around the bounding box fit. So here, the contributor has been a little bit generous in the space that's provided outside around the cow. And while it seems somewhat trivial, it's actually really important because the way that models are trained, it's on pixel by pixel. So it needs to be as accurate as possible to get that best level of prediction. The next problem might be a misinterpretation of the instructions or bad instructions. So we said that the task was to draw a box around the cows. This -- it's not necessarily an incorrect interpretation by the crowd worker. But if you've got a few thousand labels where there's a box around each cow and then a few thousand where it's a box around all the cows, you can quickly see how this could lead to problems. We spoke about the cow on the right-hand side being occluded. Another potential error, it would be that the crowd worker could assume the length of the cow and draw the box around what it thinks the cow would look like and the actual size of the cow. But again, this isn't the desired outcome that we'll be looking for. So a bunch of errors in the labeling process, errors and misinterpretation. Another big issue in training data is -- it's what's called class imbalance. So class imbalance, you can think of it of, we don't have a representative set of examples in the training data. And we'll just talk through this a little bit more. So on the top, let's assume that we were building a model that was going to recognize cows. And it would come back with the breed of cow that -- you put in a photo, and it returns the breed of the cow. If the training data was limited to the top row, it would be probably quite good at recognizing dairy cows on green grass with a blue background. As soon as you put in a different set of cows, so on the bottom left, you've got some white cows on pretty brown grass, some dairy cows on snow, I think the third one's a yak and then the right is a Texas Longhorn, and I think that's a bull rather than a cow, you quickly see that by limiting the training data size, how that would have a significant impact on performance for this particular type of model. And class imbalance, this is a very simple and straightforward example, but this is a really big issue for the performance of high-quality models. Another type of class imbalance is around data recency. So we mentioned before that all models degrade over time, and that's because the real-world environment continues to evolve. And training data, unless you refresh it, it's static, and it represents a point in time. I've got an example here around a search result -- or the search result returns for corona. So obviously, in May 2021, corona, there's a lot of news articles and statistics around cases. If you did that same search result in April 2019, the top return is Liquorland, right? So you start to get an idea around how important recency is. I mean, this is an extreme version. But it is a problem for a lot of training data, particularly where the real-world environment is continuing to evolve and continuing to change. The next example talks about bias. And another stylistic example here, let's assume that one was trying to build a model to identify breakfast foods. And you asked a set of workers who are based in the U.S., "Can you look at each of these photos and tag which one is a breakfast food versus not?" So on the left, you've got black pudding, which is from what I hear quite acceptable in the U.K. for breakfast. In the middle is a hagelslag, which is sprinkles on toast from the Netherlands. And the right-hand side, you've got kind of our VEGEMITE. But someone in the U.S., probably unlikely that they would get these right. So it's a form of bias, and it's not intentional bias. It's just bias in -- because the crowd worker is not representative of all of humanity to represent all of the different types of breakfast foods that we see. So a lot of data sets require specific knowledge and/or context for accurate labeling. So we spoke about this equation earlier that an AI model is model instructions plus training data. What's really important is that a good AI model requires the model instructions plus high-quality training data. And our role in this AI life cycle is to deliver that high-quality training data. And we'll talk a lot more, particularly in Wilson's section, around how we're leveraging technology to do that. We -- the training data market is continuing to evolve, very quickly in some circumstances. So what we want to do now is talk through some of the trends that we see more specific to training data, and then we'll move on to some observations on the model development market overall. So there are 5 major things that we want to talk through today. So the first is that high-quality data remains a major roadblock for the development of AI. The second is that AI use cases are becoming narrower. And by narrower, we mean more specific, and this has implications on training data and then how training data is being used. We'll talk about the shift from model-centric to data-centric AI, which is a focus on -- more on how to improve the quality of data, less on different modeling techniques. Fourth, as AI models become more mainstream and more in the production systems for a lot of companies, there's an emerging need for training data operations. And then finally, and this is something we spoke about before, using AI in the data -- training data preparation space is increasing. And we'll talk a bit about this now and also on Wilson's section. So in terms of this, the first trend, data remains a major obstacle for AI. So there was a survey completed recently by O'Reilly and looking at talking to people who actually built AI models and AIs in the -- in production systems. So the first -- if you look at the first largest bottleneck, it's skilled people and hiring the right people. The second is the lack of data or data quality issues. And then third, it's identifying the right use case. Fourth is culture. So you can see about those 4 major segments, which represent roughly about 60% of the total bottlenecks, the one that's actually related to the development of the model is training data quality. So that remains a huge issue and something that's -- this has been fairly consistent over the last few years. So it's a big issue. A lot of AI practitioners expend a lot of time preparing data. On the right-hand side is a quote from Airbnb, who -- one of the, I would say, more advanced players in having AI production models at scale. So they did some research and discovered that nearly 70% of the time that a data scientist spends on developing the model is not the modeling piece. It's actually collecting data and feature engineering, so extracting the features. So you can think labeling the data. So there's a huge amount of time being spent on data collection and data preparation for AI. The next trend we see is that AI is becoming narrower, right? We'll talk quickly through a few examples that we have supported at Appen as illustrative areas, but we see this across the board. So the first example, we've supported a biz-speak model. So the challenge is that someone wanted to build a model to suggest improvements to common biz-speak. So you can think about when you've written something in an e-mail, there's a suggestion, hey, this looks a bit biz-speak-like, here's an alternative. If you think about the challenge with this, biz-speak is highly nuanced. There's regional differences. There's context. It's a very difficult linguistic task to solve. So our task was go out, collect a lot of biz-speak and understand the intent and provide suggested alternatives. And having to do this is very large scale and with a lot of context involved in it. The next example of narrower AI is related to personal training. So there is a big push now to use computer vision as a way to suggest training regimes and to monitor the performance of the person that's doing the actual exercises. The challenge is that a person's movement changes with age, particularly as people get older, they might be limited in their movements, et cetera. So one of the tasks that we're asked to do is capture and annotate videos of seniors doing summersaults. So this is an actual task that we supported. So you can start to get the idea of how specific some of the data collection work is that we do, and it goes back to that class imbalance issue that we spoke about, that it needs to be representative of even the extreme version of seniors doing summersaults. The last version is about long-tail languages. So COVID created a unique challenge where there was a lot of information that needed to be shared digitally in almost real time around the globe. And this included some specific languages where there may be only -- not a lot of people who natively speak that language. So the translation text didn't support all of the languages. So we worked in a consortium with a lot of other large tech players to go and collect and annotate some very long-tail languages to make sure that the information about COVID was being disseminated not just for the common languages but across the world. So really -- a really important project for us, that last one. We spoke again about this good AI model equals model instructions and high-quality training data. But there is this question around, okay, if I'm an engineer and I'm looking to improve performance in my models, should I focus on spending time around the model instructions or training data? And this is a bit of a long-standing question in the AI community. There's a quite respected AI practitioner, Andrew Ng, who has a company, Landing AI. And he tried to answer this question. So he had built a model to detect defects in steel sheeting. So a computer vision model, it takes photos of steel and automatically identify, is that a defect or is that a piece of dirt, et cetera. They'd built a model, and they got to a baseline performance of 76.2%. He then split his team into 2 tasks. So one was he got a group of people say, "Hey, go out and improve the code. Get the latest research possible from the largest tech players and do whatever you can to apply this new architecture and this new model code to the existing data and see how you can improve it." He got another set of his team to go and improve the data. "So let's not change the code. Let's just go ahead and collect more data, improve the labels, improve the quality of the data." And you can see the difference here. And this is one example. Improving the code had almost -- well, no impact on the performance of the model, whereas improving the data had a really significant uplift. And the average human performance for these types of tasks was 90%. So they actually got it to above human performance in identifying steel defects. Again, one example, but more of an illustrative view of how there's this performance improvement benefit from looking at the training data composition. Another example here is the performance of a competition called ImageNet. So ImageNet is a bit of the gold standard competition in computer vision where there's -- you've got a few million labeled examples. And the task is to create a fairly general computer vision model where you load it an image and it will tell you what's contained within the image. Over the past 7 or 8 years, starting with that core data set, the performance of the model is able to get to 86.5%. And these are serious heavy-hitters who are investing time in this competition. By providing extra training data, you can see, particularly in the outer years, there's been a significant uplift in the performance of ImageNet. So another example of the benefit of more training data and how that yields to better accuracy in model performance. So this is really -- comes to this shift from model-centric to data-centric AI. In a model-centric world, AI engineers will use the available data and try to develop models that compensate for any noise or inaccuracies in the model. So you can think about it as you hold the data fix and you try to improve the model. Data-centric AI flips that on its head. So it's all about improving the volume and/or the quality of the data, the training data that's used to train the model. And then you try some different models, but the focus is on improving the data. So it's holding the model fixed and iteratively improving the data. And what we see as a shift that's occurring in the model development world is that the shift from model-centric where the constraint has been, here's the data I have, to a data-centric view where there's a lot more focus being placed on how do I improve the data, how do I expand my data sets and enrich the data. So this is the shift that we spoke about, from model-centric to data-centric AI. The fourth trend we see is the need for training data management. So we've spoken about the AI life cycle. And there's a really important part here particularly that we focus on, which is the data collection labeling and preparation piece. And in the pink down the right-hand side, the kind of the tasks that we see, and doing these tasks are really important. But there's an entire set of capabilities that are emerging around how to support the development and the management of training data. So things like version control get really important. If you build an AI model on one set of training data and the training data has changed, you'll never get that same performance again. So managing the version of the training data is really important for, one, experimentation, so you can figure out what composition of training data worked better than others; but also traceability, so if there is issues in real-world production, it can be linked back to and quickly identify what the core training data set is. Training data security is another important issue that's emerging. We spoke before about the difference in traditional software development being code and AI being data-centric. If you were a hacker, you can go in and change the code in traditional software. In the new world of AI, data is the most important piece. So placing security around the training data becomes really super important, and there's going to be a lot of focus in that. There are a whole bunch of other issues, but they're just 2 examples around how there's an ecosystem being built around the management and the controls of training data. Finally and one that's really important for us and where we're placing a lot of focus is applying automation to the labeling process. So there's 3 main buckets of automation that make sense for data labeling. So the first is prelabeling. So this is where AI performs an initial pass on the annotation. And the work is, it's doing more of a check and a correction of the prelabeling, if correction is required. So it's still human-annotated data. Humans have done that validation and that correction, but we're using AI to speed up that process by having the first pass. So it significantly reduces annotation time. We also see a fairly positive quality uplift through prelabeling. The next is what we call speed labeling. So this is where AI is used to assist the crowd worker in the labeling process. So you can think about this similar to an auto-complete function, where it's humans plus AI working together to get to a fast outcome and a higher quality outcome. Finally, where we use automation in the labeling process in what we call smart validators. So this is when the crowd work has completed the annotation. It will be a layer of checking that completed work prior to sending the file back and moving onto the next stage of the annotation. So the benefit of validators is, of course, it improves the quality of the model. But it also acts as a guide for the crowd workers around how they may want to do things differently in the future to get to a better performance. So we've spoken a fair bit about training data. We'll move on now to some observations around training data -- sorry, some of the modeling techniques and how that is evolving. So one of the things which I think is really important to understand is that AI-enabled applications typically involve a large number of models that rely on a large number of modeling techniques. So the examples here are for a voice interface system. So think about your favorite at-home voice interactive product. There are 3 main blocks, technical blocks, and these are in the dark blue on the left-hand side, where you've got language processing. So this is the models that hear what you're saying. Well, there's a wake word, then they listen to what you're saying. That's processed then to text, right? So this is the speech-to-text component. The next is intent handling. So it's one thing to transcribe audio to text. The other is the natural language understanding component, which is highly complex and requires a lot of different types of models. So that's -- one is understanding the intent, but it's also then matching to the knowledge of the system. And finally, it's the response generation. And the response typically involves different types of responses. So one is the spoken audio that is returned to you. So in the -- let's say, you wanted to start a timer on your phone. It would be -- the voice interface system would be, "Okay, I've started the timer." But it also needs to then go into the application and then start the timer, so the actual activity that's involved. On the right-hand side, and I know this is very hard to see, but all of these smaller boxes are different types of techniques and algorithms that are used for each of those processes. So you can see that there's a lot of different models that are required to be brought all together to deliver an AI-enabled application. And in the real world, what we see is that there's not one modeling technique that's typically used end-to-end for a model. So across the top here are some techniques. This is not exhaustive. You've got transfer learning. Transfer learning is when you take a piece of a model that's been trained on something else and kind of slotted in, which will get you some benefit this side of the model. Self-supervised learning, it's a modeling technique where there's no training or there's no data annotation required for the training. And then the next 3 are examples of supervised learning, where the first is it might be what we call off-the-shelf data, which is data that can be bought from a marketplace or it's already data that's applicable but not specific -- not necessarily -- totally specific for the application. Supervised learning, that's using AI-assisted human annotation. And then there may be a requirement for supervised learning, that's where it's human-annotated only. The composition of how the different modeling techniques are brought together varies, as you can imagine. So in the example of a U.S. English chatbot for a bank, U.S. English is a very common language. Retail banking is quite a common industry. So there might be a fair amount of models that can be used for transfer learning, specific techniques for self-supervised learning and even a set of off-the-shelf data that might be specific for U.S. English and retail banking. And that will get a long way in the model development stage. Then you start to get into more specifics. So for a specific bank, it's going to have different product sets that it's called differently, terms and conditions, a whole set of company-specific taxonomies. And that's where data needs to be collected. In this case, supervised learning, using AI-assisted annotations might get a long way. And there might be some requirement at the end for human-annotated data, where AI-assisted modeling hasn't been developed yet. As you work down the specificity, kind of the next example is a French chatbot. You start to become more reliant on custom data collection and custom data annotations just because the existing models and the existing off-the-shelf data don't exist. The third example is a Qatari-Arabic chatbot for a marine insurance company. So you start to get the idea that more specific AI models don't have the luxury of a lot of pre-existing work that's been used. So it starts to get very custom in the type of data that's -- and the techniques that are being used. I've been saying that -- we spoke about a limited number of techniques. There's a huge amount of research being put into new AI approaches. So on the left-hand side, these are the number of papers which are being posted to arXiv, which is a quite common place for researchers and other academics to post their papers. And these numbers on the chart are in thousands. So in 2019, there was almost 30,000 publications on AI posted to arXiv. On the right-hand side, there's a lot of research being done by new teams. And it's -- we're in a nascent industry, and it's emerging very quickly. So there's a lot of forward progression in the types of modeling techniques that are being used. What we see, though, is the popular AI techniques that are used actually in mature AI practices still rely on human involvement. So here on the left-hand side, it's for a bunch of companies that were surveyed, well over 3,500. What are the different modeling techniques that they are using? And you'll see the first is supervised learning. So that's where examples are being provided and examples that have meaning assigned. Deep learning, a subset of supervised learning but another way that -- where humans are required for the preparation of the labeling process. Human-in-the-loop and active learning and knowledge base and knowledge graphs, so these are all different techniques where a level of human annotation is required. So while there's a lot of research being put in advancing how AI evolves, we see in the real world that humans are still playing a big role in the creation of high-quality data training -- training data. So Mark mentioned, and we've been on a journey, right? We're moving from this transformation from an -- into an AI-powered provider of AI data and solutions. We've gone from data types being language-focused to very AI-focused and supporting a wide variety of use cases, our service-led delivery model to something that relies heavily on our products. And this comes with a shift from project-based or more committed revenue. Our customers have been concentrated. Our products have allowed us to support a greater diversity of customers. And then Mark has spoken and -- more yesterday around the org structure and reporting. And this evolution has not occurred overnight. We've been on a journey. It's through the acquisition of Butler Hill and Leapforce. Appen Connect become -- became a really important -- a really important part of our tooling and our infrastructure. Phase 2 was we acquired Figure Eight, and that gave us a very strong set of capabilities. But we've continued to invest and evolve our products. And this has led us to being very focused on a product-led expansion. Products are really important, but it is one part of the capabilities that we offer. And it's the combination of our crowd of well over 1 million strong, our deep internal expertise on how to deliver high-quality training data and the product that's the real differentiator. We're going to focus today a lot on the product. And where -- Wilson will talk more about this. I'm going to give a quick intro, but there's 5 main components to our product suite. The first is Appen Connect. So Appen Connect is our product that manages our global crowd workforce and does a lot of the matching from the crowd to the task. So it's a really smart marketplace that matches the global crowd with projects. We're applying a lot of AI and building a lot of smarts to make that as seamless as possible. The next is the Appen data annotation platform. This is the real engine of the company where the crowd workers complete their task. And our customers can set up and monitor performance and create real, bespoke annotation tasks for our crowd workers. Then we've got a set of new products that we're really excited about and are going to make a real step change in the performance of the business. So the first is Appen Intelligence. Appen Intelligence is the set of models that we use to improve automation throughout the business. So this includes like what we've spoken about in the annotation process. So how do we improve the productivity of our crowd workforce and deliver better quality for our customers. But it also includes processes to manage the crowd and our -- those workforce tools. So it's a really big part of what we're focusing on. The next is In-Platform Audit. Wilson will talk a lot more about this, but it's -- In-Platform Audit enables our customers to understand the composition of their training data better. We spoke a lot about class imbalance and quality errors. These can be very hard to diagnose and navigate when you've got data sets of hundreds of thousands of images as an example. So the In-Platform Audit is a way for our customers to really easily navigate and narrow in on areas that need to be addressed and where performance needs to be improved or more data might need to be collected and brought into the system. Finally, and this is one which I think is super exciting, is Appen Mobile. So a really great mobile interface that serves a couple of purposes. So one, it's a way for customers -- or sorry, crowd workers to engage with us. So log into the system, identify what jobs are available to them. And secondly, it serves as a different form factor for data collection and annotation. There are a bunch of features in the mobile-specific domain that aren't available for desktop. So things like location-specific and other sensors which are inherent in mobiles but not in other areas. So again, Wilson will talk through all of these, but I'll -- this is just a quick intro. What's really valuable is, though, that these products create a huge amount of value for our customers. So first, we've spoken a lot about AI-augmented data labeling and collection. So that really improves the speed, quality, scale and unit economics of the work that we do. AI-enabled crowd management, so it increases our internal productivity and the experience of our crowd. We've got a lot of expertise in the company, and we're trying as much as we can to productize that expertise and build that into our products. So it automates a lot of the high-quality work that we're able to do. We've got a lot of in-built crowd management features, and this reduces risk for customers, particularly those that are looking at different crowd solutions and thinking about how they work with a very large crowd. Then finally, we spoke about the combination of the crowd with technology. That's a real competitive differentiator for us and enables us to do a lot of the work to solve some of those problems that we spoke about early on around data quality, diversity and bias. So we'll have a break now. Wilson's got a really exciting presentation and set of demos. After that, we'll go into some Q&A. But we'll leave now and come back in around 25 minutes. [Break]

Mark Brayan

executive

#3

Hello, and welcome back. We'll take you over to Wilson shortly. But before that, just a brief recap of some of the things that Ryan spoke about. He took us through the evolution of the company from a language service provider to an AI data provider, from a services-led company to a product-led company. He also then took us through the importance of training data and the importance of quality, in particular. He mentioned the number of different techniques that are used and the number of different training data types that are available. Overall, we're in a complex and evolving space, and that requires a rich set of technologies. And we'd like to take you through those now. So I'll hand you over to Wilson, who's in our Bay Area location, and he's pleased to join us via the technology. Take it away, Wilson.

Wilson Pang

executive

#4

Thank you, Mark, and welcome back, everyone. Ryan just shared that AI industry is moving from model-centric to data-centric AI. To support data-centric AI, we have evolved our product suite significantly. We've upgraded existing products to give our customers and crowd a better experience. We built a new product to support new use cases and brought in large machine learning capabilities to drive efficiency and unit economics. We now have an intelligence platform with a lot of automation capabilities, and a human only needs to be involved when necessary. Let's take a look. We have 2 existing products, Appen Connect and Appen Data Annotation Platform. Appen Connect is the platform where we match our global crowd to annotation tasks. Appen Data Annotation Platform is the platform where the crowd can deliver tasks. They can collect data. They can annotate data. It's also the platform where our customers, they can manage their tasks in a self-service manner. Both Appen Connect and Appen Data Annotation Platform has evolved a lot with a lot of new features, better experience, a lot of the new -- a lot of AI capabilities. We also developed 3 new products. This is what excites me the most: Appen Intelligence, In-Platform Audit and Appen Mobile. They make a huge difference to our business, our customers and our crowd of contributors already. Appen Intelligence, this includes the proprietary machine learning models to empower other products. It has models to automate the labeling tasks. It also has models to automate crowd management tasks. To support data-centric AI, just to collect data and annotate that data is not enough. In-Platform Audit helps the data scientists to really analyze the training data so that they can understand the quality, distribution and potential buys from the data. It is essential. You probably already heard it from Ryan. It is really essential to get the data right so that they can have a better AI performance. Last but not least, Appen Mobile is our new mobile app. It upgrades the crowd experience to help them to do different type of data collection tasks. It also helps Appen to increase our reach to an even broader crowd group. So the AI data industry values quality, speed, scalability, security and unit economics. Our product suite can support all of them and really keep our business ahead of the competition. Now let's look at some details of those different products. Let's first look at Appen Connect. Appen Connect is used by over 1 million crowd workers as well as the Appen internal teams. Product managers set up projects and tasks. Crowd workers find projects and deliver tasks. There are 2 major focus for Appen Connect. Number one is the efficiency and the scalability. We optimize the user experience so that both the crowd workers and our internal team members, they can be very efficient and unnecessary effort can be saved. Number two is automation. We want to automate the product management effort as much as possible, and then the platform to manage the crowd is another human. Let's look at a very typical product life cycle. A product manager, they will set up a project and then sorting the workers or candidate to work on their projects. If I have those workers to ramp up their skill and pass their qualification, then the worker can start work on the project. After a worker work on the project, they may -- the product manager need to really track the progress, track their productivity, their quality progress. And if the worker bump into any issue, the product manager need to support them to fix those issues. Meanwhile, the product manager also need to detect the fraudulent users constantly and kick those fraudulent users out of the projects. So you can see it's pretty complicated like flow and life cycle, and some of those tasks are very time-consuming. Sourcing candidates, supporting workers when they bump into issues and also do the fraud detection, those tasks can take a lot of human effort. We are using Appen Intelligence to automate them and also make Appen Connect an intelligent marketplace. Let's look at the automation of the task to source crowd workers. Within Appen Intelligence, we have built a crowd DNA, which contains a lot of crowd data, their behavior data, their product histories, their skills, their quality and productivity data. Based on those data, we build machine learning models to recommend workers on projects or recommend projects for workers. We will also build a machine learning model to detect fraudulent users. With those AI capabilities, while the product manager, they finish setup the projects, Appen Intelligence can understand the sourcing requirements and then find those workers for the projects, enables them to send a personalized notification to the worker. And if the worker is interested, they will apply for the project. Once the worker applied, Appen Intelligence will further screen them to check if they are eligible or is that a potential fraud. Based on those information, Appen Intelligence pass them or fail them. If the worker passes the auto-screening, then they will be activated to the product automatically. So you can see also the steps here in green color, this can be done by Appen Intelligence. And steps in the green color, those are mainly the steps from the contributor. So with Appen Intelligence, with this automation, this is going to save a huge amount of effort for product managers. They only need to get involved when Appen Intelligence, when our machine learning model, is not sure about the decision. Majority of those time, those tasks all get automated. Let's look at another example, fraud detection. Given Appen Connect is a marketplace, there can be fraudulent users. If you look at the example at the left side, those 2 accounts, they are from the same IP. And one user normally works from 3:00 a.m. to 5:00 a.m. It is very suspicious, and it can be a fraudulent user. It is key to remove those users so then the product quality is not compromised, and we don't really pay unnecessary costs. However, you can also understand that analyzing the activity from over 1 million workers is not possible by human. Fraud detection models from Appen Intelligence can help us to do the job. It processes more than 1 million users every day and handling 200-plus signals for every user. And the fraud detection model, the accuracy is pretty good. It's around 95%. So those models are used in a lot of places. It checks users during the new user registration. It's used to screen the project application. It also runs in the back end all the time to detect any suspicious activity. Fraud detection with machine learning not only automates the huge amounts of human effort, but also except for the skill, human cannot -- just cannot handle, right? Talking about like 1 million workers is just so hard for a human to check every day. Appen Connect, as you can see, it creates huge value for our customers and our crowd. It connects customers with our global crowd. It automates a lot of project management work and reduce the overhead costs. It enables our business to scale and support the future growth. Future investment focuses on 2 areas. First, we will continue to optimize and make a very good user experience for both the crowd and also the internal team members. Second, we will just continue to add more automation so that many, many human efforts become less and less. Now let's take a look at Appen Data Annotation Platform. Majority of the data annotation platform in this AI data industry only focus on certain areas. Some focus on computer vision. Some focus on audio and language. While the Appen Data Annotation Platform has the breadth and depth to support all kinds of use cases, it has tools to support different type of data collection, tools to do content relevance, tools to annotate audio and text data, tools to support image, video and 3D point cloud data processing. Meanwhile, no matter how many tools you have, there will always be some special customer need you haven't heard before. So we also have part of our tool called job designer, and it will help the customers to design a new tool very easily. The job designer also provides a programming language called CML, which is loved by the developers, so they can program a pretty powerful tool pretty quickly. So those annotation tools, they are pretty powerful, and they work really well in a single task. While some AI data use case is very complicated, you need multiple steps and different operations to get the data right. [ Simple ] operation can be a human-laboring job or a machine learning model or a script to process the data. Appen workflow enables those use cases. It fits all those different operations into a flexible workflow. Let's see how it works. Please have to play the first demo video. [Presentation]

Wilson Pang

executive

#5

I hope that I can give you a better understanding of how those AI data use cases, how complicated they can be, and it's great to be able to support those complicated tasks. Meanwhile, it's also very important to guarantee the quality. Quality is always one of the most important factors for any data and Appen Data Annotation Platform has a rich set of features to do quality control. The number one form of proactive quality control is test question, which customers can define ground truth data, and those data can be used to qualify the worker before they start or monitor their quality performance during the job. QA workflow uses a different methodology, where we ask high qualified workers to review and correct annotations from other workers. Dynamic judgments connects judgments from multiple workers and aggregates the results to get a high confident answer. Machine learning validation. So this smarter validation Ryan mentioned earlier is to using machine learning and predict results to validate the annotation from the workers. This is very useful in certain use cases. Normally when I handle projects, I use all those features to -- really to achieve high-quality output. Security is another core consideration. You probably know -- can easily understand how important security is for the AI data. Appen Data Annotation Platform provides a very flexible deployment options. Customers can use the platform in our public cloud or deploy the platform in their private cloud, all in a completely air-gapped environment. So for the customers that use our public cloud, they can use a feature called Secure Data Access so that they don't need to move their data into our platform. They will only access their data when the worker is delivering them, and those access will expire after the data is labeled. This does create another additional layer of data protection. Our platform also meets all those major security and privacy compliance standards, like SOC 2, GDPR and HIPAA. With all those security features, our customers' data are well protected. Appen Data Annotation Platform, it creates huge value for our customers with full suite of tools to support different customer use cases. Appen Workflows, it enables complex training data preparation, and those quality and security options help our customers to get a high-quality training data and also help them to really protect their data. Future investment for Appen Data Annotation Platform focus on several areas. We will continue to evolve the tools to support the newly surfaced use cases. For example, now our team is working on building a new tool to annotate all the satellite imagery tasks. We're also working now to provide a better API to have a tighter integration with our customers' systems, and then quality and security are never-ending effort. We will just continue to invest more and more on those features and offerings to help our customers to get high-quality training data and also protect their data. Now let's look at Appen Intelligence. We build a product to prepare training data for AI companies. Meanwhile, Appen itself is also an AI company and machine learning is used in all our products. Appen Intelligence provides those machine learning capabilities. We have seen earlier how Appen Connect is using Appen Intelligence to automate sourcing workers to do fraud detection. Now let's see how it helps the automation of the annotation efforts. Appen Intelligence provides proprietary machine learning models across different data categories. It has models to identify speakers, to detect languages, segment audio files, convert audio to text to do voice recognition. It also has models to analyze text data to detect gibberish, to extract entity and to do text classification. Those are the models commonly used in the natural language processing field. It also has models to process image and video data, transcribe text from an image, detect objects and generate face landmark or blur faces to protect privacy. It also has a lot of models to handle 3D data to object detection and tracking. So those models are used to pre-label data so that human only needs to review the preliminary results instead of labeling those data from scratch. These models are also used to check the annotation from human, to check their quality to validate their results and helps data quality, labeling speed and saves a lot of labor costs. To better understand how those models are used, let's see a few examples. Understanding documents. As machine learning becomes very popular now, finance companies, they want to process receipts. A law firm wants to find some legal information from a document. To train those machine learning models, we need to transcribe text from images, from PDFs or other files. Now let's see how the training data are labeled for OCR transcription. Can you help me to play the... [Presentation]

Wilson Pang

executive

#6

Clearly machine learning assistance has had the OCR data labeling. Let's see another example. Voice recognition is another widely used AI technique. Let's see how to prepare training data for voice recognition machine learning model. [Presentation]

Wilson Pang

executive

#7

Now let's switch gears to computer vision. Autonomous driving is probably the most exciting AI use case in computer vision field. To do autonomous driving, a car needs to understand the environment surrounding the used sensors like cameras and LiDAR to collect data. LiDAR is a special type of sensor, which collects 3D point data or object in the surrounding environment and the machine learning model needs to understand those 3D data, those 3D points and classify them as cars, pedestrians, bicycles or other object types. To prepare those training data, product workers need to operate in a 3D environment, and it requires some special skills. Normally, it takes a long time to label the 3D data. Now let's see how this 3D point data get labeled. Please help to play the 4th demo video. [Presentation]

Wilson Pang

executive

#8

Self-driving is pretty challenging and needs to handle different situations. You are seeing how this cuboid, how this object is labeled. But besides detecting these objects surrounding the vehicle, the car also needs to detect the lane lines on the road. Now let's see how lane lines can be labeled. Please help to play the 5th video. [Presentation]

Wilson Pang

executive

#9

As we have seen from those demos, machine leaning assistance is very powerful. Here's a quick summary of all the productivity difference we have observed. Audio and speech, machine learning assisted annotation can be up to 1.6x faster. Even the OCR, the results are even better. The 2D image bounding box labeling with machine learning assistance can be 30% faster, while OCR with machining learning assistance can be 6x faster. It also works really well with 3D data. It can be 4 to 6x faster. But labeling in the 3D environment is really complicated for human. So machine learning assistance is very powerful. Meanwhile, it doesn't really work well for content relevance tasks. Content relevance tasks are very open subject to -- it often requires people with certain cultural background and it's super hard to automate. So overall, a lot of those data annotation efforts from the crowd workers are now being automated by Appen candidates. Data automation improves the data quality, labeling speed and also saves huge product costs for us. Clearly, Appen Intelligence creates huge value for our customers. And also adding all those machine learning capabilities to other Appen products, it automates the product efforts and lowers the unit costs. It also helps to improve the delivery speed as well as the data quality. It also automates project management efforts so that our business can easily scale. You may recall earlier how Appen Intelligence is used in Appen Connect side. In the future, we will just continue to add more AI capabilities to automate more use cases for both the worker side and also for internal teams. Let's now move to Appen In-Platform Audit, which is a new product we released last month. It's still in early stage, but I'm super excited about Appen In-Platform Audit. It already brings a lot of value for our customers. As you have seen from the slides Ryan shared earlier, training of good AI model can be expensive. It needs a lot of training data. It needs computed -- it also needs hardware. It needs computation power. It needs efforts from a data scientist team. If there's problems with a model, it's better to find those earlier instead of later so that you don't need to redo all this work. And redoing all this work, that really increases a lot of cost. An AI model performance is driven by the training data. So debug and detect problems from training data early on is key to the model's success. In-platform Audit is designed to help data scientists to analyze the training data. So this will analyze the raw data before labeling or analyze data after labeling. And [indiscernible] can also use them to evaluate model performance. The In-Platform Audit will help the data scientists to detect all those data problems like class imbalance, accuracy or quality or label imbalance. So this data progress might not be that straightforward to understand. Let's use an example to explain. Let's say I want to train a machine learning model to classify if a tweet is a positive tweet or negative tweet. To train that machine learning model, I first need to collect the training data. I've scripted 10 million tweets from the Internet. If 9 million of them are from male and only 1 million are from female, then the model might not work well for female tweets using the data set. So this is a class imbalance problem. I detected the class imbalance problem and I fixed it. Now I have 5 million tweets from the male and 5 million tweets from female. Then I'm adding people to help me to label these tweets. And then when I review those label results, I found a lot of positive tweets got labeled as negative. I've got a data quality problem. The accuracy is not high. Now I detected the accuracy program and I fixed it. However, for those 5 million tweets from male, 4 million of them are positive while 1 million are negative. Although the data labels are accurate, I get a label imbalance problem, which will cause a lot of problem for my model later on. I also detected the label imbalance problem and I fixed it. The data set is now well balanced and has high quality. The model, you can imagine -- the model training using this data set, we are likely to have a good performance. I hope this gives you a good sense of how training data insights can help to detect and fix data problems. I think In-Platform Audit provides additional value to our customers. It essentially enables customers to understand their training data, find problems and fix them, which internally help them to improve their AI performance. We just released the In-Platform Audit last month, and there's a lot more to do. Currently, In-Platform Audit focuses on training data analytics, and we are expanding to support model performance evaluation in the future. Ryan also mentioned there's a trend where people need a lot of tools to manage all this training data. We are also adding more training data management features into In-Platform Audit. So this is a super exciting product. And it can evolve to be a powerful training data analytics tool loved by every data scientist. Now let's move to Appen Mobile. We released Appen Mobile early this year. This new mobile app provides an upgraded experience through the crowd workers. They can engage with Appen at any time from any places now. The new mobile app provides a very intuitive user experience. A crowd worker, he can register to become a user quickly, finding products easily and then work on all kinds of data collection tasks. The app has made data collection easier than ever. Location-based app also becomes very popular, especially during the pandemic and those app, they need location-based data to train their AI and our new mobile app supports those needs. So the new mobile app is great. It provides better experience and also supports more data collection use cases. But that's not the only benefit it brings. The new app also increases our reach to the mobile-only crowd workers. The population actually is pretty big. You know that there's a lot of people, they're only using mobile. This is a pretty big population in Asia and also other developing countries. Enough talking about this app. Let's see how it works. Please help to play the 6th video -- 6th demo. [Presentation]

Wilson Pang

executive

#10

Appen Mobile creates huge value for both our customers and also the crowd workers. It gives crowd members a much more intuitive user experience. They can engage with Appen at any time from any places. It enables a lot of different data collecting use cases. It also helps us to reach to a much bigger crowd population. In the future, we are going to invest more on this mobile app. We're going to support new data collecting use cases and also supporting other data annotating tasks. Whatever task it can fit into a mobile screen, we want to try that in the mobile, too. And also with the mobile app, we are just going to actively expand the crowd to support the diversity and also impact sourcing. So this will end my presentation. I will now hand it back to Ryan.

Ryan Kolln

executive

#11

Thanks, Wilson. I'll spend a little bit of time on a recap and a close before we head into some Q&A. So Wilson took us through the product suite from Appen Connect, which is used for our crowd management; the Appen Data Annotation Platform used for our crowd workers to do the labeling and also our customers to set up and customize jobs; and then some of the great new features that we've rolled out more recently, Appen Intelligence, In-Platform Audit and Appen Mobile. We spoke about how these capabilities unlock huge value for our customers, from AI augmentation in data collection and labeling, delivering speed, quality, scale and unit economics; the crowd management and some of the AI that we're using in that, including fraud detection that really increases our internal productivity but also the crowd experience. We've embedded a lot of expertise in these tools, and that's really helping us deliver high-quality annotation work for our customers. We've got in-built crowd management features and working -- and doing the crowd management on behalf of the customers is really important for them. And that native integration with our crowd, it creates the competitive differentiation, where we've got a complete set of tools and capabilities, both from a technology standpoint and from a crowd. And that's kind of the real differentiator at Appen and how we unlock a lot of value for our customers. It's not about having the right tools, and it's not about having the large crowd or the expertise. It's bringing all 3 of those together. And that's what our customers really value from us, and that's what we will continue to focus on in the future. The product is going to be a very large part of what we do. So I'll now hand it to Mark, who will moderate our Q&A session. I think we've got about 30 minutes. We're a little bit more allocated for some questions.

Mark Brayan

executive

#12

Thanks, Ryan, and thanks, Wilson. I hope you all enjoyed the presentations from Ryan, Wilson and the demonstrations as well. We have some questions. I'll read through them, throw them first to Ryan, first of all, and he can loop in Wilson as required. So the first question is, how does Appen's own data labeling platform and AI investments compare with the competitors such as Scale AI? How are they different? Are you investing enough in R&D to keep up with new entrants? Ryan?

Ryan Kolln

executive

#13

Thanks, Mark. And this -- it's a good question. So we monitor all of our competitors, as you can imagine, from information that's externally available and often do feature comparisons to understand where we are from the market and also speak to our customers to -- a lot of our customers, obviously, will look at different products in the market and see how we compare. To the best of our knowledge, we have a very comparable set of products and some areas where we have a lot of deep expertise that is built into our products and creates a lot of differentiation for us. So I think from a technology standpoint and our product suite that Wilson just took us through, it's comparable and, in some areas, leading the market. I think like what we were just talking about before, there's a huge value about the combination of the product suite with the crowd. So we've got the products, and that's comparable to the market. It's that combination with the crowd and our internal expertise that really makes a huge differentiation for us.

Mark Brayan

executive

#14

Yes. Thanks, Ryan. I'd also add that we do know that our breadth of functionality is superior to many of our competitors who tend to focusing on one area. Recall the evolution of our business from a language data provider through to a multimodal data provider. So earlier on in our evolution, we were focused on language and speech data. Similarly, the new entrants are focused on a particular area, mostly image data. The question also asks about the rate of investment. Clearly, there's visibility into our investment through our publicly available accounts. We don't have that same visibility into our competitors. We see the private competitors. We see the money they raise, but we don't know how much they're putting into R&D. Ultimately, what we try to do is work with our customers to make sure we've got the range of products that they want. And per Ryan's feedback, our view is we are comparable, if not superior, across a broad range of use cases. The next question is, are any of your technologies being implemented in the electric vehicle industry? And if so, what and how are they being implemented? Who are your business partners in this market? And what about autonomous vehicles? A few parts to that question, Ryan.

Ryan Kolln

executive

#15

Yes, a few parts there. I'll focus on the autonomous vehicle part because I think the electric vehicle is more about the drivetrain. Autonomous is more about the perception, where there's a lot more need for training data. So I think through the demos that Wilson just showed, we've got an advanced set of capabilities in the annotation market for autonomous vehicles. Wilson showed LiDAR, but there's also -- computer vision is used a lot also in this space. We have a range of customers that we work with to support their autonomous driving models. So we definitely have the capabilities and the depth in that market, and that's an important focus area for us.

Mark Brayan

executive

#16

Yes. Thanks, Ryan. I'd add that it's a very big problem for autonomous vehicles. As you can see from the demos, just the challenge to annotate road lines and then think about road furniture and other elements that you have to deal with as a driver, it is a big challenge and requires a lot of data. Okay. The next question also touches on autonomous vehicles. Is recency of training data for applications like autonomous vehicles important? Or is it an issue -- sorry. Ryan?

Ryan Kolln

executive

#17

Recencies are super important across pretty much every AI model there is. Autonomous vehicles are a good example where it's really important. So I'll give an example around recency and why it's important. Last mile commuting is becoming really important. So 2 years ago, there weren't too many electric scooters on the road. There's not that many today. If you go to San Francisco, it's a bit of a different story. So if you think about the annotation of electric scooter, particularly for someone that's up right, if you're using data from 2 years ago, you might treat them as a pedestrian. But now all of a sudden, you've got these pedestrian-looking objects that are traveling 20 kilometers an hour down the road. So just a basic example of how real-world environment is changing. And in the context of autonomous vehicle, that has a massive change, right, a massive set of implications. I think the other thing, Mark alluded to, autonomous vehicle is really difficult. The specific environment changes geography by geography and country by country; different sets of road rules, different sets of buildings in the background, right-hand side, left-hand side. So a lot of the work that's being done to build the models today are quite U.S.-centric, but there's going to be a huge long tail of market-specific training required to make autonomous driving a truly global approach.

Mark Brayan

executive

#18

Yes. Thanks, Ryan. And I might hand it over to Wilson to chime in on this one as well. As many of you know, the majority of the work we do is around search. And recency is really important in search. So Wilson, given your background, perhaps, you could add something on the importance of recency in search data.

Wilson Pang

executive

#19

Yes. [indiscernible] Recency is called a search. If I remember my old days when we trained the search algorithm, we would basically raise a new model almost every week just to catch up all those recencies. Kind of like search tested on a lot of different areas. There is a kind of culture change, there's society movement. There's a new keyword, popular music, there's a lot of new stuff that keep coming out, and the search has to be able to support all those new stuff. But that's really like recency is super important for search. That's also the reason we retrained our model almost every week.

Mark Brayan

executive

#20

Yes, thanks, Wilson. And if everybody on the call thinks about their own experience in driving just to go back to autonomous vehicles. Certainly, if I go back to the area or in Sydney where I grew up, the road's changed, and sometimes it changes very quickly. And so even humans need recent data, but of course, we can join dots very well, whereas the AI needs the training data to learn. So it's much harder for an autonomous vehicle to learn than it is for a human. But the recency is super important. Great question. Thank you. Okay. The next question. We talked a lot about the impact of Appen Intelligence on speech, text, image, video, et cetera. But spoke very little about the impact of it on content relevance. Can we go into this a little deeper? So Ryan?

Ryan Kolln

executive

#21

Good question. Content relevance is highly subjective, and it's highly specific to a demographic, so we need that subjectivity of a human and the context of that individual's awareness around the specific environment, which is typically driven by demographics, where they live, et cetera. A lot of the work that we do in the automation space is around improving the speed of the crowd workers. So we spoke about pre-labeling and validation and support during the labeling process to improve the speed of the work. With content relevance, there's typically less of a need to do some of the time-consuming tasks like draw a polygon around a shape, for example. So a lot of the AI that's being used in speech and image-related training data support is to really speed up the process, which helps with throughput and quality. One of the things we are focusing, though, on the content relevance part is more of the user experience changes. So how do we, not necessarily using AI in the process but improving the environment that the workers are operating with to try and get those incremental step changes in the time that it takes to complete the task. And that's an important focus for customers we support in our platform.

Mark Brayan

executive

#22

Yes. Thanks, Ryan. And again, I'll hand it over to Wilson because we've talked a lot about this, Wilson, trying to automate content relevance. Perhaps you can share some of your thoughts on that. And maybe some examples that bring it to life.

Wilson Pang

executive

#23

Yes, sure. I think this is really a great question. And also attracts me, there's no lack of time, right? This is a big part of our business. We try very hard to see how we can save more labor effort there. But it is hard. It is hard. I think if you [indiscernible] normally require human to have certain [ background ], certain knowledge to do those. Sometimes, you can use machine learning to try some of those. I see that will have a search keyword, I want to see some results, some search results or some product results super relevant to the keyword or not. We can use machine learning to try those. Sometimes you can see some success, but there's a big problem with that, why it's really -- if you do that [indiscernible] hypothetically you produce a lot of bias. So you don't really want to using machine learning trained to have a lot of background for this. But let's say you have a kind of machine learning [indiscernible] result, then the worker -- because for them, the task is easy, right, relevant or not relevant, they will just pick whatever you save there, that increases some bias, which is not good. The other part, I can't see a good example, but it's really -- the machine don't really have those people understanding on the culture element of a lot of subjective component, only few of them possess. So it's really hard. We tried a lot, but we haven't seen a lot of success here. I want to go back to the point that I mentioned, we did have a pretty good success when we try to design a new workflow or maybe a different UI than for the workers when they do kind of relevance. It's much easier for them to deliver the results. They see some success there.

Mark Brayan

executive

#24

Yes. Thanks, Wilson. And perhaps you may recall earlier in Ryan's presentation, he had the 3 pictures of breakfast, the black pudding, the chocolate sprinkles and the vegemite. And depending upon what country you come from, you may think that's breakfast or not. So that's an example of a cultural-type question. The search -- the relevance task could be very simple. Is this breakfast? Or would you eat this for breakfast? And the majority of people may look at the black pudding and say no. Whereas depending upon culture and where you come from, you'd have a different answer. So automating that is super tricky. So no lack of trying. Keep in mind also, the companies who ask us to do this for them, the largest search and social media companies in the world, they've got some pretty smart data scientists. And I think if they could have automate it, they would have, but there's still a need for that -- the human element there. Okay. The next question -- sorry -- this is quite dynamic. It has a life of its own. Here's the next question. It's maybe one for Wilson, but I'll throw it to Ryan. First of all, feature engineering is one of the major time-consuming tasks undertaken by data scientists. Does Appen plan to invest in this area? Any investment in medical/biological data annotation technologies? Fairly specific. So Ryan, do you have a response there?

Ryan Kolln

executive

#25

So I'll start with the medical, and maybe I'll throw to Wilson for the feature engineering part of the question. So our tools today are capable of supporting medical imagery. So a lot of the imagery work is a computer vision-related task and now the flexibility of our tools can support that -- those types of applications. So that is an area of support today. Feature engineering, yes, I'll pass on to Wilson for that one.

Wilson Pang

executive

#26

Sure. It's a great question. And also the fans really spend a lot of time on feature engineering. And also when you say feature engineering, most of those types are related to the training data. All the products we shared today, all the example we gave today, in broad terms, it's also feature engineering. They are really preparing all those features, which is also the training data, to really to help to train the model. So we are working a lot on those besides like all those traditional products that we have, have people to just collect data, to enter the data, which we are becoming a feature later on to use to train the machine learning model. That's already a big part of it and also the new product like Appen in platform audit, that actually help you to understand the training data, that help you to understand the features. It was the distribution feature -- is there any bit of feature. So basically, all our work, all our products is around helping people to do better feature engineering.

Mark Brayan

executive

#27

Yes. Thank you, Wilson, and thanks for that question as well. The next question is, can Appen Mobile technology be used for data capture as well as being a crowd-focused tool, Ryan?

Ryan Kolln

executive

#28

Yes, absolutely. That's one of the real core components of Appen Mobile. So Wilson spoke through the 2 main features: one being the interface for our crowd, where they can sign on, view their tasks and manage the relationship with Appen. The second is as a really powerful data capture tool. So I'll give you an example around some of the features which were enabled in a mobile device that aren't in a desktop. So things like GPS. If the task was to go out and take a photo of a real-world environment, the GPS data is automatically tagged within the metadata of the image. So that becomes a really important part of the metadata of the training data. So there are a whole raft of different native features of handheld mobile devices that open up a different set of data collection capabilities.

Mark Brayan

executive

#29

Thanks, Ryan, and thanks for that question. The next question. Slide 46 in the pack refers to different AI technologies that are used in mature practices and highlights "techniques that typically require some level of human annotation and/or data preparation." Where it says some level, is that level of human involvement the same, higher or lower than 2 years ago? And are lower levels of human annotation positive, pardon me, negative or neutral to Appen?

Ryan Kolln

executive

#30

A good question. A lot in AI varies, and there are different use cases and things evolve differently to others. I think that in those specific areas around supervised learning, there are some supervised learning techniques where data can be taken directly from CRM systems or other which are automatically annotated, which may not require as much human annotation to complete the feature engineering, whereas there are other supervised learning techniques that are heavily reliant on human annotations to complete the labeling process that's used to train the systems. So I think that's part 1 of the -- my response. The second part in how that has changed, our market is growing, and we see an increasing need for human-annotated data. There's also a lot of need for training of data That's already kind of prepared because it comes from CRM or other structured data systems. So AI is growing. I think it's growing everywhere. So to answer, is it more or less, I think it's definitely more.

Mark Brayan

executive

#31

Thanks, Ryan. The next question, will the rising concerns on privacy issues with Big Tech -- with the rising -- sorry, with the rising concerns on privacy issues with Big tech. Big Tech are working on reducing data accessibility, do you think this would hurt Appen AI -- Appen's AI business model? And if yes, what would be a solution?

Ryan Kolln

executive

#32

We're still seeing how this is playing out. A lot of these changes are quite new. But you're right in saying that there seems to be a general view that there's a restriction on the data sharing and how that's used in -- outside the broader ecosystem. One of the views that we have is that with this restriction of sharing, there will be less available information to train the models that are used for things like advertising targeting, search results, et cetera. So again, we're yet to see it play out. It's very live, but there is potential that this could be a net positive for us as more data is required to fill those gaps that have been created by the restriction driven by privacy.

Mark Brayan

executive

#33

And if you also think about it -- what's changing is, for want of a better word, the unsolicited harvesting of data. And we'll move to an environment where more permission is needed, where more protection is provided around personal data, more rights are provided around personal data. What's constant is that AI is the center of many product developments in technology. And it's also constant that AI needs training data. I hope we've sort of illustrated that today. What's changing is the way that firms acquire data. So there's no reduction in the need for data, but it's how companies acquire data. And as Ryan says, that could play to our advantage because we could see companies coming to us saying, "We can't harvest this data anymore. How do we get this data in a manner that protects the owner of that data?" So yet to play out, but it is an important part of the AI industry going forward. Okay. The next question, what proportion of models would you think are done by self-supervised learning?

Ryan Kolln

executive

#34

It's a tricky one to answer. Self-supervised learning is an interesting technique, and it has applicability, and it has its benefits because there's a lot less of the feature engineering and the data labeling required. It does have its drawbacks though. It really is restrained on the assigning of meaning to the data. So for instance, it will group a whole bunch of different shapes together, and that can be inferred to have a certain meaning. But that doesn't quite have the same benefit as what we see through human and the data used to support supervised learning. The other thing is that with a lot of the self-supervised learning techniques, particularly in the large-scale applications, there's an opportunity to put bias to be introduced, and it's quite hard to control. Unsupervised requires huge amounts of data, so it's very difficult to filter out the inputs that create bias. I'll ask Wilson whether he has a view on that specific question around the proportion of models that are being -- that rely on self-supervised learning.

Wilson Pang

executive

#35

Yes. I do. Actually, this -- [ AI ] put a lot of effort monitor the progress and also see how this technology evolves, right? Self-supervised learning is not a new thing. It's been there for a long time. I used them probably 10 years ago together with this other technique. So one thing super important question for self-supervised learning also it's -- it also causes self-representative self-supervised learning. What does that mean? This really means -- let's say, how you learn those representative of like a word, right? Let's say, for example, back to my Twitter example, like twit classification example. I want to classify if a twit is positive or negative. I'm using a lot for labeling data, like this twit positive, this twit negative. I use them to train the model. But meanwhile, I also use a lot of self-supervised learning to prepare the features before that. What does it mean? I can use a lot of technologies to make sure how words, let's say, this particular word, what does this word mean? That word, I can use self-supervised learning to really -- how word -- what word into that. That's kind of a feature engineering step. I will convert that word into vector and then use that vector as the input for my supervised learning to train any result. I know it's a little bit complicated, but it's just one step of the overall machine learning training programs. And also, we use those techniques together. It's not that I use self-supervised learning as to replace supervised learning. That's not the case. I use them both between one model.

Mark Brayan

executive

#36

Yes. Thanks, Wilson. And I think, overall, that -- as Ryan's -- you recall the slide in the deck for the speech application, it was a very complicated slide, it's a blue-shaded slide. There are many, many different models in one application or one product, and there are many techniques that go into those models. Training data is essential for AI, but it's expensive. If the only training data was available was human-annotated data, it could be potentially prohibitively expensive. So the developers of AI are looking for any technique they can to accelerate and improve the cost of the development of their AI products. So typically, there's a mix of techniques that goes into building one product. And that's reflected in the way that we're going about our business rather than rely on the simple technique of human-annotated data. We're looking to use ourselves to accelerate and improve the unit economics of the production of that data. Okay. The next question, why do customers use your platform and tools rather than build their own?

Ryan Kolln

executive

#37

A good question. Customers rely on us for a variety of areas. So firstly, and I think we kind of covered some of this today, data labeling is difficult. Data labeling and annotation at really high-quality levels can be very difficult. So that's point one. I think the next is that managing a crowd is very difficult also. So there's -- it's one thing to assemble 1 million-plus people. It's another to allocate work, manage the quality, do the payments, et cetera. So Customers could -- and some of our customers or people in the industry, I should say, do build out their own annotation platforms. And it's largely to support or comes -- it's initiated by supporting a very specific use case. So then we'll build a data pipeline and a workflow within the business and some annotation tools to support a specific use case. Then comes the step of, okay, we want to do some different things. It's not just this narrow use case. We want to expand beyond that. And that's when it starts to become very apparent to our customers that this is a big investment, and it's difficult to bring the expertise across that wide variety of use cases. During our sales processes, we have many customers who have been down this journey where they start with a narrow use case and build something internally and quickly realize that it is very difficult to manage quality, in particular, and it's difficult to support a breadth of AI use cases. So customers come to us when AI is getting serious and really want to move into production and support high-quality training data for high-performing applications.

Mark Brayan

executive

#38

Yes. Thanks, Ryan. I think it's like any developing industry. Analysis is still relatively early. There are many techniques that people try in their own, could even be developing their own platform to do this work. And that gets to a point where it's too complicated. The scale of the operation is too large. And then at the same time as people are learning how to do this, there are companies like Appen emerging to bring specialist expertise to the industries. I'm sure there was a time when every company made their own payroll system, for example, whereas now you would never do that. So I think it's a bit of that evolution as well. And I'd also add that every one of our customers benefits from all of our experience and knowledge that's embedded in the platform, as opposed to just covering their particular use case. So yes, we see a departure from people building their own platforms to wanting to work with a specialist provider. Okay. The next question, how is the role of crowdsource work is changing for Appen as the model changes to a product-led committed revenue model? What does crowdsource efficiency mean for Appen?

Ryan Kolln

executive

#39

Good question. There's a large amount of work for our crowd. One evolution that we're seeing is that the demographics are getting more specific. So the ask from our customers. So one of the things that we're doing is ensuring that we're able to serve the customers' needs by putting the right demographics. So that's very important for us. The other thing that we're working very hard on is what we spoke about with Appen Mobile. So making that crowd experience a lot more seamless. So there's greater visibility into the tasks available and greater matching of a person's skills to the task. So when they do come and work with us, it's a task that they are able to deliver high-quality work and do more on those tasks and support in a really strong approach. So our crowdsourcing approach, we continue to build our crowd. We continue to find ways to better match the people in the crowd with the right task and that's a good experience for our crowd. It's a good experience for our customers and ultimately leads to a stronger growth in the business.

Mark Brayan

executive

#40

Yes. The crowd is a big expense. It's the cost of goods expense that goes through the business. So if we can get more data per crowd worker to improve the unit economics, then that goes to our gross margin and ultimately to the bottom line. This is a multipart question. Firstly, data ownership and licensing. If a client owns their data library, does that mean they no longer need that type of service again? Or is ongoing data maintenance required? If so, is this typically provided by Appen or the client? Can data libraries from existing clients be sold to other clients wanting the same type of data? Receipts does it, private and mostly customized to client needs. And that was just part 1.

Ryan Kolln

executive

#41

Okay. Let me handle part 1. So we spoke a lot about data recency and there's, again, the general view that all AI models degrade in terms of performance. So it's not a matter of if, it's a matter of when. So the view is that a store of data needs to be refreshed. And we support a lot of our customers in updating and supporting additional collection or if they have the data themselves, the labeling to refresh those data assets and those features so that the models can be retrained and the performance continues over time. The second part is around ownership of the data. It varies customer by customer. Some customers where we are doing the data collection on their behalf and determining on the arrangement with the customer, we have the ability to access that data either for internal users or to on-sell other customers we don't. So it is very specific on a case-by-case basis.

Mark Brayan

executive

#42

Yes. Thank you. Thank you, Ryan. The second part of this question is about semi-supervised learning, and I think we've covered that. So in the interest of time, we'll move to the third part, which is, in relation to your expense item, services purchased data collection, do you expect long-term cost improvements here? And I think that relates to Appen Mobile, for example.

Ryan Kolln

executive

#43

Yes, definitely. So we continue to invest in, like what Mark said, getting more data from -- per crowd worker. Appen Mobile is one of the big areas that we're focusing on. One, to improve the unit economics of data collection. But I think more importantly is to create a more feature-rich set of data that we're collecting from the field. So there's a lot of exciting projects that we're working on in this space.

Mark Brayan

executive

#44

And I wonder, Wilson, do you have anything to add on ways that we lower the cost of data collection?

Wilson Pang

executive

#45

Yes. There's a few areas that we look into. Why is really just to make the data collection work much easier for crowd worker, right? So they just pick up their phone, open the app and then pass it down, super easy. So cost -- by improving the experience, we can drive down the cost. And also, there is -- we are also trying to using machine learning in some of those spaces. I give you an example. Some data collection task, they need a worker to record some voice data and then transcribe that voice data. What technically we use there is really when they record the voice data, we are using our machine learning capability in the back end, kind of transcribe those data automatically, but we don't show them to the worker because we just want to make sure that the worker also provide their input. But what we do there, using all the pre-transcribed data in the back end, we provide auto-complete feature for the workers. When they transcribe data, we will show them, is this what you are going to say? Yes. That save some time to really finish that data collection task. So that's the second area basically besides the better experience, easy to use, we are also applying machine learning to help data collection tasks.

Mark Brayan

executive

#46

Yes. Thanks, Wilson. And data collection is becoming more important because recall Ryan's example about the chatbot. The U.S. English language chatbot, there could be a lot of off-the-shelf data. But when you get down to a much more specific use case, a different language, a more specialized area, you've got to collect a lot of data for that. So the easier it is and the more cost-efficient it is to collect data, the more value it is for the customer. The final part of this question is, what metrics -- sorry, do you -- in relation to Figure Eight, the Figure Eight acquisition, what metrics are you using to measure its success? The key one is one we took everybody through yesterday, which is the growth in that new market figure. So the new market figure is data that we -- sorry, revenue that we derive from the enterprise sector, from the government sector from China and also revenue that flows through our platform from our major customers. And you can see from yesterday's presentation that that's growing nicely. None of that revenue would be available without that acquisition. Okay. The next question further down the page is, is the machine learning doing the OCR and audio transcription our proprietary software or off the shelf? I think I'll throw this one straight to Wilson. You built it. You should know it.

Wilson Pang

executive

#47

It is proprietary models. We started with off-the-shelf models. It didn't work for a few reasons. Whilst a lot of use cases that are high learning are very specialized, and we need to find the right training data to support that use case. So those off-the-shelf model doesn't really work well for those use cases. So we have to train our own model. So that's one reason. Second reason is also our model is a little bit different from the end model, right? You can see like, for example, audio transcription, our audio not only need to transcribe the audio to text. We also need to flag this is the background noise. This is a different gender. This is some [indiscernible]. We need to label all those different activities. Off-the-shelf model need to handle those. So we have to train our own proprietary model. So it will be more difficult than off-the-shelf model, but it just gives us advantage. Only us can do this type of job.

Mark Brayan

executive

#48

Yes. Thanks, Wilson. Thanks for that question. I hope that provides a clear answer. The next question, do the Appen products integrate with client side, data pipelines and applications? Again, I'll throw this one straight to Wilson.

Wilson Pang

executive

#49

Yes, that's a great question. And the answer is absolutely, yes. We provide a very rich set of API with the client, using our HR, setup job, upload data, download data and just make our system part of their overall pipeline. That's just used a lot. And that's also a big focus for our product engineering team.

Mark Brayan

executive

#50

Thanks, Wilson. The next question is, will this webinar be uploaded for replay? And the answer is yes. A recording of today's event will be available on our Investor Center on the Events and Presentations page early next week. The next question, one for Ryan. As use cases become more niche, will you have to develop more tools? Can AI models be transferred between use cases?

Ryan Kolln

executive

#51

Yes. Good question on the tooling. So back what we said throughout this presentation, we believe we have a complete set of tools, but new AI use cases and with specific data techniques continue to emerge. So we will continue to invest in the breadth of our tooling. Some examples there, which are live at the moment, is different light spectrum. So non-visible light spectrum is a good example where there's a lot of interest in AI applications. Our tools support it today, but there's more that we could be doing in that space. So an example of an emerging area that we'll be focusing on. On the transferability of models, there is a very common technique in AI model development, which is called transfer learning. And transfer learning is used pretty much across the board for every model. It only gets you a small part of the way, though. So there is still a lot of fine-tuning required, and that's really where the supervised learning comes in and the requirement of high-quality training data.

Mark Brayan

executive

#52

Yes. Thanks. You may also recall during the presentation, Wilson mentioned that we're doing some work on satellite image data, which is not just another data type, but of course, there's a sort of a tiling nature of that data that requires certain tooling, et cetera. Okay. The next question. Regarding moving from model-centric to data-centric, how far into this move to data-centric are customers and, in particular, the big global customers? And how much of a distance can this shift make to Appen's financial performance? Ryan?

Ryan Kolln

executive

#53

I think that we're well into this shift. It does vary industry, industry and customer by customer. I think that in the more forward AI companies, including our largest customers, there is a very -- they're probably further along that shift, whereas there might be a set of customers who are more used to using internal data for the AI development, and they focus more on the models partially because of the realization that if they're able to use human-annotated data or different data sources, that will have a big unlock in terms of value for their AI models. But I might also get Wilson to chime in on this question.

Wilson Pang

executive

#54

Yes. I think the whole industry is moving more to data-centric AI, then basically applies to almost every company who's working AI. It just become a common understanding or common sense in the machine learning community. Data just plays critical role to the AI model performance. So no matter if you're a professor or you are from a big company or a small company, I think all those data scientists, they just know the importance of data.

Mark Brayan

executive

#55

Thanks, Wilson. Thanks, Ryan. The next question, does everyone need the quality of the data that Appen provides? Ryan?

Ryan Kolln

executive

#56

Well, it depends on the quality of the model that they're looking to produce. If a customer wants to build a low-quality model and that's sufficient for the needs of the application, then they may not need high-quality training data. But if you want to build a model that has high quality across not just a small subset, but a broad subset of inputs, then you will need high-quality training data. And maybe another way to put it, if you want to build a high-quality model and you've got low-quality training data, it's kind of not possible. You need high-quality training data to build a high-quality model.

Mark Brayan

executive

#57

Wilson, maybe you have some examples of when you might choose to use low-quality data?

Wilson Pang

executive

#58

I do. I do. Actually, when I work in my chatbot product with my daughter during weekend, I don't need Appen to prove the service to me. So it's good enough to have a toy. But if you are really using AI to do any serious business, high-quality data is a must.

Mark Brayan

executive

#59

I think that sums it up folks, building a chatbot with my daughter. And I can tell you, Wilson's daughter is quite young. So that's the extent of the knowledge there. Okay. This is, I believe, the last question we have. Does Appen market off-the-shelf data libraries for chatbots? And what is the extent of the service that Appen contributes to chatbot setups?

Ryan Kolln

executive

#60

We absolutely do off-the-shelf data. We have a rich catalog of data that we've collected. And one of the big differences for us is that we have very large volumes of data collected. So it's used by a lot of customers to kickstart their development of chatbot. So yes, it's an important part of our product offering.

Mark Brayan

executive

#61

That's all the questions we have. So I'd like to take this opportunity to thank all of you for attending our webinar today. I hope it was useful. If I can leave you with 3 thoughts from today's presentation, the first is that the future of AI is very robust, and it absolutely relies on large volumes of high-quality training data. I think the examples we provide make that very clear. I think also that we've provided a lot of information on the need for technology to provide those large volumes of high-quality data, the complexity of use cases, the volumes required in dealing with millions of crowd workers. It's not possible without a good, strong product foundation. And then the third thing, as I hope you see that we're investing into this. We've made a lot of progress in this area. There's lots to do. But over time, we are building much more of a product-first business and over time building more competitive advantage and resilience into our business as well. So thank you once again. Thank you to my co-presenters, Ryan and Wilson, for all of their input to this. I'm looking forward to the next time that we all meet. Thank you, and good day.

This call discussed

AI CapEx

For developers and AI pipelines

Programmatic access to Appen Limited earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.