Datadog, Inc. (DDOG) Earnings Call Transcript & Summary
August 3, 2023
Earnings Call Speaker Segments
Olivier Pomel
executiveGood morning, everyone. I'm Olivier Pomel, Co-Founder and CEO at Datadog. It's the first time we're having DASH here in San Francisco, and it feels amazing to be here today. And whether you're here with us or joining online from anywhere around the world, I want to welcome you all, and thank you for joining us today. I also wanted to take a minute to thank our sponsors and partners. They contribute so much to Datadog platform and then make DASH a success. And they're all here today, so you can go and meet them on the expo floor if you want. It's hard to believe it's only been 9 months since we had our last DASH, and so much has happened since then. Today, we're very excited to show you what we've been up to. We'll have more than a dozen different speakers on stage from our product and engineering teams, and they will show what we've been working on. I often get asked how we maintain the pace of innovation. And how we build product at Datadog. And the answer to that, the secret is you, the Datadog community. We don't dream up products in isolation. Instead, we partner with you from day 1 to work on your biggest problems and try and solve them the right way. So I want to thank all of you for your trust, for your business and for making us better every single day. And now to get us started, I'd like to welcome on stage my Co-Founder and CTO, Alexis.
Alexis Le-Quoc
executiveThank you, Olivier. Thank you for being with us today. We have a lot of stuff to share with you. So let's get started. If we take a poll and ask the question, what's the biggest change in our industry since the last edition of DASH, October 2022. A bet would find the top answer to be something like the wide availability and the remarkable results of large language models. It is a sea change. For instance, we've been tracking GPU consumption over the past few years, and we've seen it rise at a rapid clip. So to the ecosystem of technologies that are used at scale and monitored with Datadog, we need to add the whole generative AI stack. To tell you all about the ways you can get end-to-end visibility into AI and LM stack with Datadog, I'd like to invite Shri to the stage.
Shriram Subramanian
executiveEveryone. I'm Shri, a product manager here did all, and I'm very happy to be here. So generative AI and large language models are a powerful new way for building software. And it's been an exciting time for us here at Datadog to hear from you, our customers about how you're using large language models to unlock new use cases. From cogeneration to conversation assistance from search to summarization, large language models are universally applicable. And managing AI applications today requires more collaboration than ever between application developers and your in-house machine learning and data science teams. As a former data scientists, I can relate I often had to collaborate closely with my infrastructure engineers across multiple different teams to ensure the reliability and accuracy of my machine learning models in production. This only adds to the ever-growing responsibilities of an application developer. If you're managing an AI application today, your tech stack is going to look quite different. You're going to have new technologies such as GPU instances, model serve frameworks, the AI models themselves. You're going to be managing new types of data that are usually large volumes of unstructured text, images, maybe even audio, and you're going to have entirely new workflows, such as prompt templates as well as Etalon chain agents that we now need to integrate into your code base. While operating in AI stack today can seem complex and challenging and sometimes new monitoring doesn't need to be. Today, I'm happy to announce a wide range of capabilities that will give you end-to-end visibility into your AI application stack. We are launching solutions to monitor the key technologies in every layer of the stack, and we are partnered with the leading vendors in each category to encapsulate the monitoring best practices and remove the heavy lifting for you to monitor your AI application. Starting with the infrastructure and compute layer. Apart from the 3 cloud providers, we also now have solutions for monitoring GPU instances from NVIDIA and CoreWeave that will be useful for your self-hosted AI applications. Building a generative AI application means managing new data sites such as embedding vectors, and we are partnered with the emerging players in the space to have solutions for monitoring Weaviate, Pinecone and others. We may also be using a model serving tool to host your own AI model and expose its functions via APIs. We now have solutions for monitoring Amazon SageMaker, Vertex AI, [indiscernible] and more. Within the model layer, we have expanded the solutions for monitoring open AI, Azure open AI and continue to add support for other model providers such as Amazon Bedrock, Hugging Face and more. Finally, in the orchestration framework, a new layer specific to managing LLM agents, we now have a solution for monitoring the LangChain. These solutions put together along with the out-of-the-box dashboards, recommended monitors, lot pipelines and more. Datadog offers comprehensive visibility in a single place into the health of your AI application. This will enable you and your teams to quickly troubleshoot any performance issues as well as identify bottlenecks in your application stack. So who's ready for the first demo of the day. Here, I have a DASH code I built to monitor my application that uses OpenAI, LangChain and Pinecone. This dashboard is powered by Datadog Solutions for AI stack observability. And I'll summarize my key metrics in the overview, such as my total request made to LangChain, especially to OpenAI, my total model cost incurred and my index fullness ratio from our Pinecone database. I can also understand how my application interacts with different components of my AI stack. Datadog's LangChain integration automatically captures my total request volume based on different LLM providers as well as my specific models that I'm using. Cost is also an important factor for me to track as that can ramp up very quickly. Here, I've broken down my total token usage based on my different OpenAI models. If I'm using an external API, I need to monitor its API performance. I am [indiscernible] in [indiscernible] by type and also broken down my API call latencies. I noticed that my call agencies made to my Pinecone database has started to increase along with my error rates. This may be something I may need to investigate further later on. For more granular visibility, Datadog also collects the logs that capture the prompts and completions that go in and out of my different models. Here, I have a sample of my prompt in completion phase that come from which part of the LLM model as well as which part of the chain it is part of. I also have a sample of my vector similarity searches that my application makes to my Pinecone database, so I can understand how well my retrieval augmented generations are working. Visibility into my prompt completion responses helps me understand how my users are in tracking in my application, so I can better optimize for their experience. So as you can see, Datadog solutions for AI stack observability makes it easier and faster for you to deploy, run and monitor your AI applications with confidence. This allows you to focus on taking advantage of LLM at scale and focus on delivering value to your customers. Datadog solutions for AI stack observability is available beginning right now, and you can check the out from the link on the screen. You can also come -- attend our platform top later today to learn more or see a demo of this and the new launches demo booth. So far, we've kind of talked about how we can monitor the technologies around our AI model that make up the stack. But what about monitoring the model itself? To tell you more, I would now like to invite on stage Junaid. Thank you.
Junaid Ahmed
executiveThank you, Shri. Hi. I'm Junaid Ahmad, and I'm really excited to be here in front of all of you. You just heard how Datadog is providing end-to-end visibility into your entire AI stack. This will help break more silos at the intersection of DevOps and machine learning. And we're not stopping there. In my previous life, my teams and I use large language models to improve relevance and answer questions on web search through rich document understanding. I can tell you that even with a large team of experts, managing models and production was a really big challenge. This is true across the industry for a variety of reasons. First of all, the cost in infrastructure operations are often hidden away from the ML engineers making it harder to run a predictable service. This is amplified by the lack of visibility into model performance and its effect on the application. And lastly and most importantly, model performance degrades over time as real-world data doesn't mimic the training data. Due to the complexity of neural networks, debugging can often turn into a game of [indiscernible]. Today, we are excited to announce a range of capabilities that will accelerate your journey into shipping LLM powered apps to production. To illustrate this new offering, let me walk you through a real-world application. [ Couch Cache ] is an online furniture store. And like all e-commerce applications, they want to drive more sales. So they built a ChatBot powered by LLM on their proprietary product data in the hopes that they will sell more inventory. So how does it all work? Customers of [ Couch Cache ], talk to the ChatBot in natural language about the products on scale, which is then used to retrieve first-party keywords from vector basis and full text search indices. The response is then used to set context for the large language models to get near human responses back to our customers. Each customer is also able to provide a feedback session on their shopping session with their interactions. Dozens of ML practitioners and app developers need to work together to maximize customer satisfaction and sales through the ChatBot. So how does a large team manage such a complicated combination of apps and models? Introducing Datadog's brand new model catalog, a centralized way to view all my models. In this page, I could see models across a range of providers such as OpenAI anthropic, Cohere, Google PaLM and custom in-house models. These are all detected by Datadog out-of-the-box when using our familiar libraries and service integrations discussed earlier. The model catalog also consolidates key usage, operational and performance metrics as well as any active alerts onto one single page. This is great as it provides a single source of truth across my SREs, application developers, and ML engineers, keeping them all on the same page at all times. Now let's investigate some of the most important issues that the ChatBot could be experiencing. I noticed the GPT 3.5 Turbo model has a few ongoing issues. Let's zoom in to understand what could be happening. I'm now looking at the model overview, which gives me detailed cost, performance and service dependency information, which is always a great starting point. As I scroll down, I can clearly see there's a big spike on one of our services calling the OpenAI model, which is limited by another spike on the token usage side for the same model. The token usage aspect is very important, because this is directly the cost of what my teams would be paying for influencing these models. Through Datadog's dynamic aggregator and service map, I can see the services that are now calling this model at an accelerated pace, which in this case, suppose to be the support service. The support service team can now be engaged and this quick alert to fix session, can save my company a lot of money as rogue applications issues are addressed promptly. Let's continue to investigate further. Now I also see that recently, the performance of the model, the same model has been dropping over time, which means that the customers of the ChatBot have not been finding the responses of the model, particularly useful. Let's zoom in to understand what could be causing a regression like this. Datadog's LLM observability gives my team and me out-of-the-box metrics, which can help debug models in production and optimize model performance. You're seeing a slice of the worst performing metrics that have an impact on model regression. These metrics can also be customized by teams owning the models as domain-specific scenarios might require their own metrics. In this particular instance, I can see when the token count is between a certain range, the model performance drops by 8%. Through Datadog's universal metrics to prompt tagging, I can see precisely the prompts that are causing the degradation in performance with respect to the cohort that you just clicked. My teams can now start looking at the relevant samples and better fine-tune the models, getting a head start in the investigation right away. Now fine tuning model is a continuous mundane process, and Datadog does one more thing to expedite progress. I'm excited to show you all Datadog's new automated drift detection and clustering capabilities. For context, drift allows us to track feature or data values over time so that motor regression can be correlated to the precise change experienced by the application. Now also this capability, I get to see the distribution of prompt and responses to find interaction patterns that hurt customer satisfaction. What we are seeing here are clusters of prone response in bedding, which are also color-coded for the customer feedback being received, green for positive, gray for neutral, red for negative. These clusters cohesively categorize the furniture selection available at [ Couch Cache ]. I can immediately see a large red cluster indicating a high concentration of prompts that receive negative feedback from our users. This cluster primarily talks about lamps and lighting. With one easy click to the cluster, I can see the sample of prompting responses. And going through these, we can quickly validate that these are indeed talking about this category of lamps and lighting. Now that we know the problematic category, the team has won suspicion. It turns out, [ CouchCash ] introduced a new product queue called Lights and Labs a few days ago. So it's possible that the vector databases and the search indices have not been updated with this new product category that the customers are asking for. To confirm this hypothesis, I can also detect if there has been indeed a drift over time. I can compare the embeddings with baseline periods in the past. As you can see in this view, the A and the B referred to 2 snapshots of the embedding versions. And as you can quickly see that the Lamps and Lighting category is present in our current cluster and not in the past, proving the hypothesis that the search index has indeed not been updated with this new product category. This one powerful view enables SREs, ML engineers and application developers to quickly focus on fixing the problem, which is updating the searching risk rather than going through all the prompts, all the responses and hopelessly try to debug for a long period of time. So what you all saw today is that with Datadog's LLM Observability suite, you can now bring your models together into one unified use of the model catalog, get real-time insights into your model performance and cost and use model drift to investigate model regressions over time. All these capabilities will radically decrease your team's time and cost for shipping delightful AI-powered experiences to production. Sign up today to start a partnership with Datadog, learn more about a solution at the platform talk after this keynote and see this demo in action at the AI demo booth. And now I want to invite my colleague, Kai onto the stage. Thank you all.
Kai Xin Tai
executiveThanks, Junaid. When monitoring distributed systems, we all know that it's always better to have the data and not need it than to not have the data when you need it. That is why you send trillions of data points to Datadog to monitor the health of your hosts, your applications and your business. Today, we're making it easier than ever to understand and act on all of that data. I'm incredibly excited to announce Bits AI, your new DevOps copilot. That's right. We're bringing generative AI into your day-to-day workflow. Bits AI is trained on the data product you know and love. And on your unique data taxonomy and topology. It knows everything in your environment, your services, your teams and your infrastructure, how their name and how they're all connected. What this means is that Bits AI can get you through investigations by proactively surfacing signals that point you to the root cause and by answering your follow-up questions with full context of your systems. Once you already take action, Bits AI can suggest data workflows, such information from your runbooks and use all our third-party integrations to help you collaborate with you and your coworkers. Bits AI is everywhere you're working, on our web app, in your favorite ChatOps tool and even on your mobile device. Let me show you how it works. I'm on call at 4 a.m. and I just to seek an alert about an [indiscernible]. It's far our [indiscernible] processor, a critical service that indexes events customers submit to Datadog. I know it's sensitive to issues with dependencies. Someone asked Bits AI if there are any related issues. It points out a number of issues with event intake upstream. And that's because Bits AI has access to my service map. And for situations like this, it is trained to check both up and downstream dependencies to see what is impacted. It also answers extensively with information from a variety of sources. Bits AI doesn't just tell me about the words I have set up, it works with Watchdog to proactively surface anomalies, outliers and forecasted outages. What catches my eye is this incident, which I'm going to follow the latter. As soon as I joined the incident Slack channel, Bits AI generates a summary to bring me up to speed. It looks like an intake to service upstream of mine was hit by a flood of requests, and the team is looking at the potential DDoS attack. While the security team takes care of that, I'm going to ask Bits AI for a dashboard to help you mitigate the impact of the increased load on my service. Bits AI has offered me a view that my team relies on. I'm going to follow the Kubernetes one. What you might have noticed is that my chat session was carried over. Context is incredibly important for large language models. This is what helps Bits AI come up with even more relevant responses as I progress through my investigation. Here on my dashboard, I can see lots of pending and failed pods. So I'm going to dig into my logs to see it and learn more. Without leaving, I can ask Bits AI to show me error logs. It seems like the messages are the same all the way down, and they all point to memory constraints. So I'll first check if it's a problem with my workloads by grouping the errors by Image tag. It's pretty even between the 2 versions. So that wasn't too helpful. This time, I'm thinking that maybe the problem is at the cluster level. Lets group by Kubernetes cluster name and make that [indiscernible]. Okay, we see an outsized impact in U.S. So let's now check what instant sites are being used. I see the problem here. AUS is heavily under provision. It's only running medium instances whereas every other cluster is running much larger ones. What I just showed you was natural language coring. Remember, Bits AI speaks your language. It understands your unique environment. That is why in my case, when I say them processor and event intake, Bits AI knows I'm referring to services. It knows US 1 and US 5 that your data centers. Bits AI can help you investigate complex problems by bringing data from across the Datadog platform, metrics, traces, logs, resources, release of transactions, security signals, cloud costs and more. But let's go one step further and see how Bits AI can help you take action. What I'm looking at you now is to scale the EUS cluster. At Datadog, we store our run books and Confluence. So we've integrated Bits AI with Confluence to bring our operational knowledge into Datadog. I'm going to start following these instructions. And although I'm taking manual actions here, Bits AI can also offer automated action. Just like a buyer on the security team, I could ask Bits AI to suggest a Datadog workflow to block the IPs involved in the DDoS attack. It will interact with me and slack to collect all the parameters it needs. And over in Datadog, we see the workflow being executed, and the IPs are now blocked. But regardless of how you're meeting, there are usually processes you have to follow after the incident like writing a postmortem. And so spending days on this, Bits AI helps them get a head start with the first draft. This way, I can focus on what's actually important, the critical analysis. Bits AI is our new DevOps copilot, investigate issues, get synthesized insights from across your entire stack, see anomalies, outliers and forecasted outages and run complex queries against any data source, all in natural language, then take action, tap into institutional knowledge stored in run books, get Kubernetes commands, TerraForm scripts and Fintech help for third-party software and trigger Datadog work through mediate issues anywhere you are. Last but not least, collaborate, get the right people involved at the right time, on-call teams, customer success managers, support engineers, numbers of leadership and give them the context they need. Generative AI is not magic. It's technology. We've already seen Bits AI improved the day-to-day workflows of our early testers and I cannot wait to get it into your hands. You can request access through this link and see a live demo for yourself later today. But that's not it. Over the next hour, we'll show you how Bits AI can create unit tests, code fixes and synthetic tests to boost your productivity and help you ship higher-quality software. And with that, I have the pleasure of introducing our customer, [ Dan Sperling ] from Terradata to speak more about generative AI. Thank you.
Unknown Attendee
attendeeThat is a lot of announcements. There's a lot of stuff that I hadn't gotten a chance to see yet. So I'm sitting backstage, watching this and being like, hey, it's hit, hit, hit of new releases, it's incredible to see. And the stuff that we got from Shri and Kai and Junaid is pretty incredible. For my talk, I'm going to zoom out a lot. It's 1985, you're sitting and listening into Prince's new song When Doves Cry on your Koss Porta Pro headphones. Down the street, there's a huge gathering of about 250 nerds from universities, government agencies, from public sector companies, et cetera, that are discussing TCP/IPv4, which basically underpins the Internet as we know it. 1985, was your life changed significantly by the Internet, I'm really. Fast forward 20 years, 2007, Apple releases the smartphone and places massive connectivity and simplicity into the palm of your hand. How many of you have a smartphone? Smartphone is up. Oh, come on. I know a lot more than that have one, right? Almost everybody has a smartphone. All of your lives were changed significantly. Move forward to 2015. Apple is now should be 230 million units of iPhones a year. That's just Apple. We see massive adoption of the smartphone across the world, but it brought unparalleled consumerization and simplicity and made the Internet accessible to everybody in the palm of your hand. And it may in the internet almost become a basic human right. That simplicity and flexibility the Internet was brought about by the smartphone and the consumerization of it for the masses. JPMorgan recently released a report that says that generative AI is the most important technological development in the last several decades. That includes the smartphone and the Internet. Now much like the iPhone -- or I'm sorry, the Internet when the iPhone release, machine learning is not new. In fact, when you think about like generative adversarial networks, we've had GANs in production since about 2015, about the same year that Apple crusted that $200 million mark. So why is it a big deal now? Why are we talking about generative AI so much now? Because ChatGPT, Midjourney, Claude, et cetera, and the like, they have all made generative AI consumable by the masses. It took 35 years for the Internet to explode. But once it did, and once we had it in the palm of our hand and it was with us all the time, it became just like eating and breathing and drinking. We don't even talk about the Internet anymore. When people ask for WiFi access, they really don't care about the tech. They don't really think about the Internet. They don't think about what they're actually getting access to. They just want Snap or IG or their apps to work, they want to be connected. I feel that generative AI is going to be the same. It won't just become part of our lives. It will become our lives. It will become as real to you as everything that you do right now today with a smartphone. AI will become invisible. You will actually have AI solutions that underpin things that use. It won't be called generative AI, blah or generative like solution blah. Instead, it will just be the thing that underpins that app that you use to dynamically buy all your groceries. And it knows your consumption patterns, it knows the inventory you have in your house, and it just buy stuff that it thinks you probably will want. The thing that you don't totally trust today to turn it on. It will be that thing that underpins the transportation solution that you have that dynamically always knows where you want to go and when you want to go and has the right mode of transportation available to you at that point. Enough that you're willing to get your car. It will be that path that you put on your own that dynamically understands what your body needs gives you medicine, gives you vitamin, gives you a nutrients, et cetera, that you don't trust today to show on your arm and just have it go to town. But you will. The world's eyes are open now to the potential of generative AI. We have seen what it can do. And the world is waiting for you, technology and business leaders to bring generative AI solutions to the masses and make it consumable. I can see some of you nodding around like, yes, we got to do that like that job. But why haven't we? What's slowing this down? AI, machine learning and especially generative AI have been around for over 8 years, why haven't we? What's the problem? Why are we stuck? There's a report that talks about the solutions that exist and it says there are a lot of -- there are some solutions like fraud prevention, for example, that does exist in the market and is using advanced machine learning like Gen AI. But it's estimated 80% or more of prototypes and POCs that are built in gene never see production. And with generative AI proper, that number is even higher. That percentage of miss is even higher. Why are we stuck? This is what companies like Datadog and Teradata are working to solve. You've seen a lot of the announcements. We are working to remove the barriers, the tech barriers that prevent you from moving those great ideas out of ideation and into production. Now Teradata is built to be the world's largest at-scale analytic solution. We have over 150 machinery -- proper machine learning functions that are built into our analytical engine, that are tied to our massively scalable engine for massive performance. We are open and connected and connect to most of every major machine learning work bench, we're seen leading tools that all the cloud service offerings. So that you can either use models that you bring in from them or you can build models. We have proven efficiency and scale to thousands of nodes, but do so with the lowest total cost of ownership so that you can achieve your AI goals, but also achieve your environmental sustainability goals. I could go on and on about Teradata, but Teradata is trusted by the largest companies in the world to bring their AI solutions to market. We've been talking about generate AI little bit here, and I'm going to talk a little bit about Teradata tied to our customer base. But generative AI is not just about large language models. Generative AI, like it can kind of prescribe or define or invent whether it's words or whether it's imagery or videos, it can also prescribe those future events. So that you can be able to take ideas that it can generate that would affect either your business or business verticals, your industry, even people's lives. Teradata has helped customers move beyond prototype. For example, we've got a customer that is actively using generative AI and image -- sorry, cameras in the shopping cart to look at what people are putting their shopping cart. Now you might say, yes, that's normal. But we're not using statistical analysis to derive the next logical item you should have. Instead of using generative AI, actually, it's using statistical models or basic machine learning models to understand what the thing is and to kind of predict what kind of item you're trying to create? Or what kind of distorted or what kind of event you're trying to have. But then it flips over to generative AI and uses generative AI to create to prescribe the best ingredients that you might need, or you might not to tell you what would make the best dish, or the best cocktail or have the best party. This is truly innovative, and it's resulting in significant higher uptake than previous models. We're all strapped to this rocket called -- actually, if you could go back one slide. We're good. We're all strapped to this rocket that is Gen AI. We are all a part of this solution that is moving forward very quickly. If you think about generative AI, my intent also is to make sure that we talk about generative AI. It's not just magic. There's a lot of solutions up here on the slide, but the reality is a lot of these solutions also have other machine learning or just statistical analysis capabilities that may be better, more elegant, more efficient for what you're trying to accomplish within your business. Think of it kind of like modes of rotation. Generally, you would not walk to the moon, generally. But also generally, you would not take SpaceX' Falcon Heavy to your neighbor's house, right? Generally, those are like 2 modes of transportation that work in the same kind of way, we've got the ability to leverage generative AI models for amazing things where they're the right thing to do, but we should be leveraging other solutions or other statistical models, or machine learning function, et cetera, to be able to accomplish our goals and the problems we're trying to solve today and in the future. But Falcon Heavy and generative AI have all the hype today. So I'd encourage you to use that hype, use the energy to push your machine learning function ideas, to push your AI ideas, and your generative AI ideas forward within your business. You've heard from many of the team members here, and you're going to hear more about what Datadog is doing to bring the ease of generative AI adoption to your company. It's so exciting for Teradata to be a part of that, because the stuff that the Datadog is doing is allowing us to improve the products that we bring to our customers. So I'm so excited to see this and what we're going to keep seeing it today. I'm going to close here with a couple of key takeaways. Hopefully, you took away 3 things, at least 3, maybe one. But for today, I hope that you looked and said, firstly, the world is locked on generative AI. That's the hype is yours. But I hope that you are using that hype and that excite and urgency to make it consumable to create products that make generative consumable to the masses. Secondly, generative AI is not magic. But again, like I said, you can use that hype and build on the hype within your companies. And then finally, Teradata, Datadog and others, we are building the platforms that allow you to have a more simplistic way of bringing these ideas, these generative AI ideas, the machine learning ideas to market. And I hope that you were using platforms and your partners, you're co-building together to build the solutions of tomorrow. The world is waiting for all of us. Thank you so much. Have a great time at DASH, and now I'm going to welcome Alexis back to the stage.
Alexis Le-Quoc
executiveThanks, Dan. That was inspiring. Okay. So let's do a quick recap. So far, we've covered visibility into all the components that make up your generative AI stack. LLM observability that helps you deeply understand how well your models perform in production. Last but not least, Bits AI, I love this AI. It joins the power of large enrich models with the wealth of data you already have in Datadog. And it does so in a lot of cool new ways, as we'll see later. Now you may feel that large language models in generative AI have stolen everyone's tender over the past couple of months. But rest assured, we have not forgotten about the rest of our platform. Starting with observability, which is what you all know us for. We have a number of new things to show you. And for this next segment, please welcome Sid to the stage.
Sid Dhingra
executiveHi, everyone. I'm Sid Dhingra, a product manager here at Datadog, and I'm thrilled to be here with you today in San Francisco. Over the last few years, we've noticed that each of you is generating nearly 3x as much data as before. You're generating logs from thousands of services, third-party integrations, IoT devices, data pipelines to every corner of your digital infrastructure. And with each use case, you have different needs in terms of how long you need to retain that data and how often you carry it. Indexing has been the perfect solution for all your short-term retention needs such as a few days and when you need to create the data in a hot manner for all your real-time investigations and alerting, such as for application logs. Whereas archiving has been the perfect solution for all your long-term retention needs such as a few years and when you rarely need to query the data. This is great for all your audit, compliance and configuration use cases. But as the volume of data grows, so too have the use cases, what do you do when your needs are somewhere in between indexing and archiving. What do you do when your retention needs are not a few days or a few years, but a few months, your log volumes are really high and you need to create that data on an ad hoc basis regularly? Well, we've rebuilt from the ground up our log management platform to solve exactly this problem. And I'm thrilled to be able to announce today Flex logs. Flex Logs decouples storage from compute, allowing you to bring all of your high-volume logs for ad hoc querying inside of Datadog at an affordable price. Storage starts at just $0.05 per million logs per month. You heard that right. That is comparable to all of your cloud storage providers however, fully managed by Datadog. That means you don't have to manage any files, you don't need to manage partitions or create schemas, that's all done by us. And you have full flexibility to determine exactly how long you need to retain that data, whether it be 3 months, 6 months, a year or longer. And on the compute side, your teams can choose exactly the level of compute they need, whether it be small, medium or large for each of their use cases at a fixed monthly price. And that means even as your log volumes grow, you can keep your budget relatively flat. Flex logs is the perfect solution for all the use cases that sit in between indexing and archiving for all your high-volume logs that you need to query on an ad hoc basis, such as transaction logs, security logs and network logs. Imagine what you could do by bringing all your high-volume logs into Datadog correlated with your traces and your metrics. Let's illustrate this with an example. Imagine that I am a network engineer at a large streaming service that had an incident last month. And my team was bogged down, trying to find all this data, at least 20 terabytes of network logs per day, get them rehydrated in the Datadog or use some third-party tool, and we lost valuable time in this investigation. We want to prevent that in the future. My team consists of 50 users that are clearing that data on a daily basis, and they don't want to wait for rehydrating or using some other third-party tool. However, they also don't need to set up monitors or alerts. So let's see how they can use Flex logs to solve this problem. Flex Logs works right alongside log index configuration. So the first thing I'm going to do is create a new index and inside of that new index, I'm going to filter in all my network logs and I can choose right there between standard indexing and flex logs. In this case, I'm going to select Flex logs, because I don't need alerts or monitoring, and I can choose the level of retention I need, and I'll choose 90 days here, because that's how long I want to keep that data available for me. The next thing I'll do is go ahead and save this index. And right there, you'll see that the network logs index is saved with 90-day retention under Flex. Additionally, my team can choose exactly the level of compute they need for their use cases. And if their needs change in the future, they can change the level of compute without moving the underlying logs. I'll go ahead and go into log search to see how these logs look like in my logs explorer. Here, I'll first filter down to my network logs index. And in one click, I can search across all my standard index logs and my flex logs. I can even go back historically to see all my older data. To recap, we have solutions now for all of your use cases. Indexing is perfect for all your short-term retention needs when you need hot, fast querying for all your real-time use cases and alerts. Flex logs is the new solution for all your high-volume logs that you need to query on an ad hoc basis regularly. And archiving has been the great solution for all your long-term retention needs when you rarely query the data and you just need it for audit and compliance purposes. These are all of your use cases for logs consolidated in Datadog single pane of glass. You can start exploring all of your logs today, follow this link to learn more and sign up for more information or stop by our platform talk at 3:00 p.m. to see exactly how you can use Flex logs to solve your use cases and problems. Now I'd like to introduce you to Barry, who will talk to you about all of your use cases for logs inside of your own environment with observability pipelines. Thank you.
Barry Eom
executiveThanks, Sid. Hi, everyone. My name is Barry Eom. Just like chemical on my 56 Barry Eom. And today, I'm excited to share a new product update to our Observability Pipelines product. When it comes to your logging volumes we know that what you bring to Datadog is only part of the puzzle. You have terabytes and petabytes of logs going from a variety of sources that need to get to various destinations introducing quite a bit of complexity in your observability architecture. And to make matters worse, because of this complexity, it's a nightmare to make any sort of change. To do so, you need to use a piecemeal combination of various CICD and configuration management tools, which can be a tedious, time-consuming and an error-prone process. And this is why we developed Observability Pipelines to centralize telemetry in your own infrastructure to collect, process, reduce enrich and redact all your logs in one place, regardless of what downstream destinations or vendors you send that data to. And today, I'm thrilled to announce a new control plane for observability pipelines for building and deploying pipeline configurations, all from our UI. So let's see this in action. Let's say I'm on the network team, and I need to get the network logs into Datadog as soon as possible to investigate network connectivity issues. This is a map view of my pipeline running in all my different data centers in different environments. I can get a topology view of my pipeline as well as an insight into the end-to-end health of the pipeline and its individual components. This top part of the pipeline is receiving my organization's application logs from Splunk, Datadog and [indiscernible], processing and structuring them and then routing just the interesting loss, the Datadog in Splunk, while the noisy logs is [indiscernible] S3. And on the other hand, this bottom part of my pipeline is receiving my application logs or my network logs, deduplicating them, remapping them and running them to S3. So to wrap the network logs into Datadog, I simply have to go into edit mode and build a draft. And once I'm in edit mode, I simply click the Datadog destination, adjust its input so that it receives the network logs. And just like that, I've updated this pipeline draft to dual ship the network logs to Datadog and S3. But now because this pipeline is handling all my company's telemetry data, I don't want to roll out a change all at once in production. Instead, I want to roll out just to my dev environment, so that I can check for any errors or regressions and not pay everyone else at the 3 AM in the morning. Previously, I would have needed to manually test my changes, get my changes approved and then SSH into all my different dev environments or pulling a separate DevOps team to run an Ansible playbook and then roll out my changes one by one by one. But I don't need to do any of that with observability pipelines. I simply click deploy to review my changes, which is actually tested in the background, so that I don't deploy a broken pipeline. And once all techs have passed, I can roll out to all my different environments selectively and target specific pipeline instances, whether it be different environments or different regions. And the best part is that you can adjust this workflow to fit with whatever your company's workflow is, whether it be different cloud providers, cluster name space, availability zone. It's super flexible. Again, in our case, we'll roll out just to our dev environment and don't look away, this is going to go really quick. I could deploy and there it goes. Within 2 seconds, I've updated all my pipelines running in my dev environment to dual ship the network logs to Datadog and S3. Once I've monitored my changes, I can roll out this to all my other environments and upscale over time. So a process that would have normally taken me a couple of hours, days or sometimes even weeks, took just a couple of minutes with Observability Pipelines. As you've seen just now, observability pipelines helps you take complete control over all your observability data within your own infrastructure. And with our new control plane, you can build and edit pipeline configurations and deploy them, selectively roll back and manage all your pipelines running in your own infrastructure all through the Datadog UI. To get early access to this feature, you can sign up using this link up here. And to learn more about other use cases around Observability Pipelines, you can find us at the demo booth or join our platform-level talks at 3:30 this afternoon. And with that, I'll pass this over to Irene. It's been a pleasure.
Irene Kors
executiveThank you, everyone. Hello, everyone. My name is Irene Kors and I'm a product manager for Datadog APM. We've come a long way since we first launched our application performance monitoring in 2017. In fact, many of you tell us that it is now one of your favorite products here at Datadog. But if you're a large organization with multiple development teams, instrumenting APM can be time-consuming. Here are some of the steps you would have to go through. Installing the Datadog agent, installing APM client libraries for each programming language, then checking if you have permissions to instrument each service, and if not, finding the right team at Dog and asking them to instrument. And finally, repeating those last 2 steps for each one of your services. And if you forget to instrument at least 1 service, you will have an incomplete of your traces. But guess what? Today, we are changing that, introducing single-step instrumentation. That's right. Now a single developer or SRE can instrument APM across your entire organization in just minutes. All of the services on your host will be automatically instrumented with the installation of a Datadog agent. Let me show you how it works in the demo. Let's say I want to enable APM on my new infrastructure. So I go to the agent installation screen. I copied the agent installation command where the flag to instrument APM is already enabled. Now I paste and run my command, restart my services and watch them appear in a service catalog. From here, I simply click on each service, go to you traces and watch the traces flowing. If you're an existing customer, the experience is exactly the same. The same command will enable your applications across all of your hosts with the installation of the Datadog agent. And that's it, shortest demo ever, right? Well, it's definitely the shortest one you will see here today. And that's exactly the point. Single-step APM instrumentation saves you valuable time by letting you enable APM across your entire org in just minutes. But that's not all. You can now save even more time by modifying APM configurations directly from the Datadog UI. I will use APM sampling rate as an example. Let's say I have a service that produces a large volume of trace data, I'm sure many of you can relate to that. So I use APM sampling rate to ingest only 10% of my traces. Well, that's maybe fine when everything is working as expected. But now I have an outage, and I need all the data I can get as quickly as I can. So I want to change my sampling rate to 100% temporarily, just for a few hours. Normally, I would have to log in to the host, find the right configuration file, modify that trial, and then we start my service for the changes to take effect. Well, now I can do it directly from the UI. And the best part, I don't need to restart my services. My changes will be applied instantly at run time. So you just saw how our latest APM innovations help you save valuable time. But now that you've actually enabled and considered APM, you want to use the full power of distributor tracing. Let's see how you can do that. Datadog APM helps you monitor and troubleshoot your services. It gives you key insights such as throughput and direction of -- duration of your request, latency and the number of failures. But what about the upstream impact of those failures? How do you know exactly which endpoints have been affected or what end users have been impacted? You need to get this information quickly, maybe to send it to your support engineers or to contact your business users as a part of your SLAs. But this research can take hours. Well, today, I'm super excited to introduce Trace Queries, advanced APM querying and aggregation capabilities. Trace Queries help you quickly understand the impact of a back-end arrow on your business. Without Trace Queries, it can take hours. With it, just a few simple clicks. Let me show you how it works. Let's say, I operate an e-commerce platform where merchants sell various goods. Every time my payment workflow fails, it triggers a SEV-1 and I'm required to quickly contact my engineers and affected merchants. So I see that I'm getting multiple errors here in my authentication service. I want to understand if any of these errors have effect on my business. To do that, I want to isolate the slice of website traffic that goes through the checkout flow and finishes with the payment. So first, I'm just going to isolate all of the checkout operations by adding a checkout resource string to my query. Okay. Now I'm going to drill down even further to only the ones that finished with the payment. This is my payment service and that is following this third-party API. So I'm going to filter by that API. And here you go. I just filtered all of the traces for the checkout operations affected by my authentication error that resulted in a payment failure. Well, now I actually want to contact the affected merchants. So I can aggregate the results to find exactly who has been affected and see the number of times they've gotten the error. And that's it. All that's left for me to do is reach out to them. As you can see, Trace Queries help you quickly isolate the slice of the affected infrastructure and find the impacted end users in just a few clicks. Datadog APM simplifies the life of developers and enhances their productivity. And today, we have taken it one step further by reducing the time it takes to instrument APM using single-step APM instrumentation, introducing ability to change APM tracing configurations directly from the UI without the need to restart your services and finally, adding powerful querying and aggregating capabilities for distributed tracing, APM Trace Queries. You can learn more about APM and our latest innovations by going to this link. You can also check out our demo booth and visit our theater session later today at 5:00 p.m. And now I want to turn it over to Wissal, who's going to tell you all about error tracking.
Wissal Lahjouji
executiveHello, everyone. Thank you, Irene. Feels amazing to be here. I'm going to start with that. My name is Wissal Lahjouji, and I lead product and op themes at Datadog. All right. So Irene's demo just covered how Trace Queries can help you assess the impact of errors on your business all the way to your end users. Now let's talk about those errors themselves and what it takes to resolve them. So as you all know, and I won't belabor the point, error resolution can be pretty daunting. And sure, it can also be exhilarating and sometimes even gratifying. You know that feeling when you solve a really complex problem or put out a large fire? But it doesn't feel good all the time, right? It doesn't feel good when it's all you do. No one wants to be firefighting 24/7. Not only that, but while your support team is on hold and your PM might be hovering over you, you still have to do a bunch of things, right? So first, you have to understand this error. And this is easier said than done. You probably go on Google, Stack Overflow. You probably ask some of your colleagues, you name it. Second, you have to actually fix the error and this takes some time as well, right? And third, last but not least, you have to prevent it. You have to make sure this error does not happen again, and this involves writing some test cases. Now wouldn't it be great if there was something out there that could just make this process a little bit easier for me? Something that could assist me in my error resolution journey? Well, today, I am thrilled to announce that we at Datadog built just that. Introducing Datadog's Error Tracking Assistant. An AI-powered engine built to demystify errors and redefine how you handle them. Using generative AI, Error Tracking Assistant provides you with a detailed explanation of the error, a swift code fix ready to deploy and a test case so that we can make sure that error never happens again. Does that sound too good to be true? Well, don't worry, I got a demo for you. All right. Let's jump right in. So here I am in Datadog's Error Tracking Solution. This landing page automatically groups all my errors, whether back end or front end, takes them from RUM, EPM and logs and groups them right here into a digestible set of issues. Now for the purposes of this demo, let's say that I own the e-mail API service and that I want to focus on my trace errors over the past hour. What's great here is that this list is updated and propagated automatically. I don't have to do anything. And it's already sorted by the number of errors. This is great. This makes my prioritization process so much easier. At a glance, I already know which problems to focus on without having to sift through massive amounts of data. And the best thing, I didn't have to do anything to get this. Now catch this. Because I've instrumented my code with single-step APM instrumentation that Irene just presented us with, I get this out of the box. I get error tracking out of the box, I don't have to do a single thing. All right. Let's jump in. Let me go into one of these errors. I'm going to go into this index out of range error. So in this side panel, I can see when the error was first and last seen, and I can also see a distribution over the past hour. I can also select different time periods, but for the purposes of this, I'm selecting past hour. Scrolling down, I can see the stack trace pinpointing the exact part of the code where the error originated. I can also see the source code. And what's even more interesting here, the executional context. But let me not get too ahead of myself. Let me scroll back up and see how error tracking can help me understand and resolve this issue. All right. So when I click on generate test and fix, the Error Tracking Assistant consumes the error and generates a concise explanation of the issue. Here, it's telling me that I'm attempting to delete an element from the list while iterating over it, a very common Python bug. So this explanation is great. But what about that actual fix and test? Well, remember that executional context from earlier that I got super excited about? Yes, this one right here. The Error Tracking Assistant takes all the values, methods and parameters that the code had during its execution and together with the source code, it feeds it into the AI model so that it can generate a test case and a fix that I can deploy. By using this runtime data, Datadog is able to provide much more accurate fix and test recommendations that would not be possible otherwise. Okay. So let's go test this test case, shall we? For that, I'm going to open it in my IDE using the VS Code in extension. And I'm going to try to reproduce this error. Here, I can see the diff view of the added test case. I'm going to go ahead and accept it and rerun my test. All right. Test fails, which is expected, I have not fixed the error yet. For that, I'm going to head back into the Error Tracking Assistant and take a look at that, code fix that it showed me. Okay. I can see the diff. Now, fix looks good. So I'm just going to go ahead and go back to the IDE. Great. Okay. So I'm going to accept the fix, I'm going to rerun my task and bam, it's done. The tests have passed. All I have to do, all that's left to do for me here is commit and deploy. The errors that we saw earlier will stop populating in my issue list in Datadog, and I'll be able to mark this issue as resolved. I didn't have to ask a single question, I didn't have to spend hours scouring the Internet for answers, and I didn't have to bother a single co-worker. Pretty great, no? So how does this all work under the hood? Let's demystify the demystifier, shall we? So as Kai mentioned earlier, this is not magic. In fact, it took us some trial and error to get here, no pun intended. What we realized when we were experimenting with LLM prompt engineering was that feeding the errors alone to our AI model is simply not enough. Errors are too ambiguous. We needed more. So what made this all work? And I hinted at that earlier, the runtime state. That's why I was so excited about that executional context. This is what makes it all work. This is the secret sauce. The runtime data from dynamic instrumentation, including the variables, is what makes it all work. And we, at Datadog, can easily do that using our dynamic instrumentation platform that Evgeni announced last year at DASH. Pretty great, right? All right. Let's recap. So to recap, today, we saw how the Error Tracking Assistant can save you hours, which can easily rack up to days or even weeks, by explaining the errors that you're trying to resolve, actually suggesting a code fix and a test case to make sure it never happens again. Now perfect code does not exist. We all know this. Unless you freeze it in time, perfect code does not exist. And that's okay. That is okay. With error tracking, you can help your teams get out of firefighting mode and get back to what matters, building products for your customers that they love faster and with less interruptions. Error Tracking Assistant is available today in private beta as part of our Error Tracking product suite. You can sign up today using this link right here. All right. Thank you so much. And now back to Alexis.
Alexis Le-Quoc
executiveThanks, Wissal. That was great. So as you can see, we have continued to invest steadily in our core observability platform. We've just shown you brand-new tier logs, called Flex. It's a sweet spot between index and archive; Observability Pipelines, which make it really easy, safe and quick to manage petabytes of observability data; single-step APM instrumentation, it's now very simple to get started with APM; APM Trace Queries. It's a powerful new way to connect tracing with what your users experience and the impact on your business; and lastly, Error Tracking Assistant, which you've just seen. And it goes all the way to suggesting a fix, informed by the entirety of data you send us. That includes runtime data. Now we can't wait to get these new products into your hands and see what you do with that. Now in the past few years, alongside observability, we have been also heavily investing into security because we see observability and security as 2 sides of the same coin. And to tell you more about all that's new in security, I'm going to hand it over to Rishi.
Rishi Ilangomaran
executiveHi, everyone. My name is Rishi, I'm a Product Manager on Cloud Security. Let's talk about the latest in security for DevOps. Now, think about the last audit or security breach that your company had. How many different teams were involved to resolve it? Too many, right? And on top of that, how many tools that you have to look at to even get a sense of what's going on? Well, today, Datadog Cloud Security Management already services threats and misconfigurations in your infrastructure. We're excited to announce that now we're also monitoring for vulnerabilities and identity risks. And that means developers have more information than ever to investigate what's important, all within Datadog. But when you're the one on the line to fix these issues, you're going to have to sift through hundreds of different insights: identities, vulnerabilities, misconfigurations and threats and everything is marked as critical. How do you know where to start? And that's where the Security Inbox comes in. Now Datadog is correlating all of these insights into just one simple security issue. These issues are prioritized for you in the inbox, so you always know what to tackle first. Let's do this in action. So when I open up the Datadog Cloud Security Management overview, I know the most important issue to fix will be right at the top of my inbox. We can see here, there's a critical vulnerability found on a host that also has public Internet access and administrative privileges, affecting 21 resources, that means I need to investigate this right away. This was brought to the top of my inbox because of the combination of these 3 security risks. But without the inbox, I would have to investigate the vulnerability, the public Internet access and the privileged role for 21 resources. But instead, 63 insights have been consolidated to just 1 issue. I can see here, this host has an outdated Redis package that an attacker could use to remotely execute code in my infrastructure. This comes from the new capability we just announced, vulnerability scanning for host and containers. With this risk visibility throughout my infrastructure, my hosts are more secure than ever. But the Internet access also bumped us to the top my of my inbox, right? That means an external attacker could easily exploit the vulnerability we just saw. And on top of that, there's an associated IAM role with privileged access. An attacker with access to this EC2 could assume this role and do a considerable amount of damage. And this comes from another new capability, identity risk assessment. So now that I know how important it is to investigate this, let's jump in. I can see here, we've had this exposure for over 3 months, and this comes from version 5 of the Redis Tools library. And all I need to do is upgrade. I can do this really easily by creating a Jira ticket, assigning it to my team without losing any context. And now my hosts are more secure than ever. So we've just seen the power of a unified security inbox, not only bringing together siloed tools to correlate risks, threats and vulnerabilities, but also to bring together security and DevOps folks to fix the highest priority issues in your infrastructure. But we also know infrastructure and applications are inherently connected in the cloud, and so are the risks and attacks. Last year, with application security, we announced that we can block IPs as well as scan for vulnerabilities in open source library dependencies. But now Datadog can detect vulnerabilities in your own custom application code. That means you're alerted to vulnerabilities in your code in production, in real time. So I've just got a notification for a SQL injection in one of my applications. Let's go ahead and take a look. Here, I can see the SQL injection is happening on the specific service and I have pointers to the file, method and line. With the GitHub integration, I can see the commit where we first saw this and the line in the source code, line 38. Now we're doing this by looking at the data flow through your application. I can see here there's a user input that can reach a sensitive function without being sanitized. This approach doesn't need security testing or simulated attacks. Instead, we're just looking at the web traffic through the routes of your application, just like APM. I can go ahead and remediate this with 1 click, with details tailored to my stack. I need to see who's on call, which I can do in the service catalog. It's Sam. I'm going to send Sam a quick slack. But in the meantime, I'm going to make sure my service is protected. And I'm going to do this with another new capability, proactive blocking with the Datadog In-App WAF. That's right. We're not only detecting attacks, but we're also proactively blocking them for you with a built-in web application firewall. This recommended policy contains hundreds of rules, not just one-off IP blocking, including for SQL injection. And with 1 click, I can easily turn on blocking mode for the affected service. And that's it. Now my application is secure while my engineers work on a more permanent fix. So we just saw a number of new announcements. In cloud security, we're now monitoring for infrastructure vulnerabilities and identity risks, both of which are available in beta today, brought together by the unified security inbox. At the application layer, we're now detecting vulnerabilities in your own custom code and proactively blocking attackers with a built-in WAF. To learn more about these features or sign up, visit this link above or come by the Expo Hall. And now to tell us a bit more about the deeper investigative capabilities of our security products, welcome to the stage, Partha.
Partha Naidu
executiveThanks, Rishi. Hello, everyone. My name is Partha, and I'm a Product Manager here at Datadog. It is great to be here with you today. As you all know, the cloud is accelerating faster than ever. And with that comes an expanding new surface for attackers to target. Nearly half of all data breaches now happen in the cloud. And this is because the cloud is still nascent in terms of security in many cases. And if you didn't know, research has shown that attackers can exist in your cloud environments for 277 days on average before being detected. They can do this by leveraging zero-day or emerging vulnerabilities that didn't previously have threat detections, similar to how attackers exploited Log4j. So even if we were able to detect attacker behavior today, a deep investigation going back several months is required to fully understand the scope of the attack and impact to your business. But investigations can be challenging for a few reasons. Retention of critical security logs for long periods of time can be cost prohibitive and analyzing large volumes of log data can be mentally taxing and lead to analyst burnout. All of this results in slow investigations and slow response times, increasing the exposure and cost to your business. And I personally dealt with these challenges in my previous life as a cyber incident responder, where I spent countless hours and sleepless nights combing through terabytes of log data. But what if there was a better way to investigate security threats? One that didn't require us to manually run queries or map out attacker behavior on whiteboards. Introducing historical investigations with the Datadog Cloud SIEM. With the Cloud SIEM, you are now able to visually explore and search your logs, accelerating your investigation and accelerating your response. Let me show you how. Let's imagine that I'm a security analyst, and I just received an alert in slack. And as a security analyst, it's my responsibility to investigate and respond to all important alerts. This alert here is indicating that a user was potentially trying to exfiltrate data. And from the description, I can see that I have to conduct a user behavior investigation. And I know I can do that with Cloud SIEM investigator, so let me start there. For those of you that don't know, the Cloud SIEM investigator is a powerful analytics and investigation tool that allows you to visualize your log data. Here, I can see the user David Parker's activity during the time of the signal. And hovering over the user node, I can see that this user tried to modify this EC2 instance to potentially exfiltrate data. What's more, I can also see that this user failed to do so as evident in this graph, which means we have some good protections in place. But I still need to do a little bit more to understand if this is a pattern of behavior or not. So I'm going to go ahead in search for failed events and set the search period to 1 year and boom, I can now see that this user has a pattern of trying and failing to access databases like customer information, payments, employee data and more. And see, this is why it's amazing. In just a few clicks, I was able to conduct a deep historical investigation in an easily understandable manner. And what's more, I can easily communicate this to other members of my team involved in the response. Furthermore, I can also take a look at related signal activity for this user in this time period. And from this, I can see there's a few signals. One that sticks out to me, though, is that this user was trying to elevate their privileges by changing their AWS log-in profile to get around the protections we have in place. Again, this is a little bit suspicious. I think I now have enough to determine that I need to take action on this user, and I can do that right here with Datadog workflows. You've seen workflows in action already and their value extends to security as well. So let's go back and open up the original signal that we had. I can go ahead and click run workflow from the signal side panel and select a workflow. I want to suspend this user's credentials, I can do that with Okta. The workflow is intelligent enough to pull the important signal context immediately. And taking a closer look, I can see that it's running and before I know it, it executes successfully, and that's it. I do not have to go to another product, another page. All within the same UI, I was able to quickly investigate and respond to this critical alert. To recap, the Datadog Cloud SIEM allows you to visually explore your logs, accelerating your investigation and integrating with workflows, accelerate your response. Sign up for a free trial today and come check out our demo booth to learn more. Thank you. And now I am so excited to welcome Amit to the stage.
Amit Agarwal
executiveThank you, Partha. Security has quickly become a key offering at Datadog. And our mission is to turn a problem domain that's full of noisy unactionable signals into fewer consolidated drill downs that help engineers get to the root cause of problems faster and fix them from right inside Datadog. And you saw examples of this in security inbox, vulnerability detection and Cloud SIEM investigator. Now shifting left a little bit. Let's show you what we've been up to, to make developers' jobs even easier. To kick it off, here is Bryan Lee who's going to get into how we help you get quality code to production faster. Bryan?
Bryan Lee
executiveLast year at DASH, we talked about enhancing the developer experience, not only through shift left observability, but by beginning to take action on those signals, reducing testing times for customers by up to 95% with Intelligent Test Runner. I'm happy to say Intelligent Test Runner is now generally available and works across 5 languages. At Datadog, our JavaScript monorepo with over 330 developers is already saving over 6,500 hours per month by running only the tests that are relevant and skipping the rest. That's 19-plus hours saved per developer per month. However, we know testing is just one part of the developer workflow. Too often, insecure, unreliable and slow code makes it all the way to production where it runs, waiting for ops teams, customers or even bad actors to come across. Organizations make efforts towards detection and enforcement but lack the tools to easily turn these signals into quality checks. Today, I want to talk about Datadog's next steps in shift left observability and providing the tools to make it actionable. Introducing Datadog Static Analysis. Static analysis surfaces code quality and security issues to your developers earlier in the developer workflow. We help teams detect a bad code before it reaches customers. But surfacing issues is only half the story. What if you could easily and programmatically prevent bad code from being deployed in the first place? That is why I'm excited to also introduce Quality Gates. Quality gates allows you to gate your workflows using static analysis results and ultimately using any signal in Datadog. We give your teams fine grained control over what code makes it to production and what doesn't. Let's jump in and see what it all looks like together. Here, I'm in Datadog Static Analysis, where I can see all of the violation surfaced across the organization for Python. I see violations around security, best practices and code style. But right now, my highest priority is security since I know this code made it to production. I want to prevent the insecure code from being deployed in the future by using quality gates. I can see existing rules created by others, but I'll create a new rule. To start, you can create rules on test in static analysis data, all through static analysis. And I want this rule to be evaluated over every workflow in the organization. So I'm going to leave it set to always evaluate. I could always reduce the scope to a specific repository, branch or even a custom scope like team name. This rule will fail anytime an error status violation is detected. Now I want to focus specifically on security. But I can always come back and create additional rules that target other categories like performance and code quality as well. I want this rule to block my workflow, if it fails, and I'll maybe block any security violations. I'll click save, I get confirmation it's been created. Now let's see how static analysis and my newly created quality gate rule will work in tandem to keep my code secure. I have a simple service that retrieves -- I have a simple service here that retrieves products from a database. I have a commit that adds a function and endpoint to retrieve a single product by IDE. I'll go ahead and open a pull request. And while the workflow is running, I'm going to talk about our GitHub Actions YAML file and describe what's happening behind the scenes. Static analysis is running in CI, analyzing my code, reporting any violations it detects back to Datadog. Our quality gate CLI command looks across all the rules defined in Datadog to determine which rules should be evaluated for what's currently being built and tested. Datadog returns all rule statuses and if any blocking rules have failed, the workflow will fail as well. Speaking to which, looks like my workflow failed due to the block any security violations rule I created earlier. I can see the reason behind the failure. And if I want more details, I can click on matching events in Datadog. This takes me back to Datadog in static analysis, filter down to the exact criteria that caused my rule to fail. It looks like I introduced a security violation by using an f-string in a SQL query. I see the exact line of code responsible for the violation as well as a full description of the issue. I also have a tab with fixes. We're using AI to generate not only a fix in plain English, but a fixing code that I can just copy and paste. So I'll copy the fix and makes the code change. I'll name this commit fix SQL injection. And with this commit, my workflow begins rerunning. And perfect, everything passed. In just a few minutes, we were able to show just how easy it is to begin detecting static analysis violations, enforce organization-wide rules of quality gates and even fix the violation with the help of AI, and this is just the beginning. Static analysis will fully integrate into the developer workflow. We'll surface violations, not just in CI, but in code reviews and directly in the IDE. We'll run the exact same rules locally and provide developers with real-time feedback as they code. Quality gate will expand the work with any signal in Datadog, not just from Datadog's own products, but any data from our vast ecosystem of over 600 integrations. Imagine quickly creating a quality gate rule on synthetic testing data, a Datadog monitor or even metrics from an integration like AWS. We're excited to get these products into your hands to help you ship better code, faster. Static analysis for Python and quality gates are both in private beta and are available through this URL. You can learn more by stopping at the CI visibility demo booth and have any questions answered. Thank you. And I'd like to hand it off to Jamie, who can tell you more about what Datadog is doing for the mobile experience.
Jamie Milstein
executiveHow's everybody doing? All right. I mean I would have liked a little more energy after all these exciting announcements, but I'll take what I can get. I'm Jamie, and I'm a product manager here. Now I'm not actually here today to talk about what it's like to use mobile app. Many of you are probably on them right now, and we're all end users of them. Today though, I'd like to talk about the person on the other end of that app, the mobile developer. Now you mobile developers have a tough job. You deal with critical users who have high expectations, but that's only one hurdle. There's many external factors that make that job even harder. Think about it. How many of us actually subscribe to automatic updates? This means mobile developers have to manage many different versions of their apps, including deprecated ones. And you're not just managing code versions, users can be on any device they want. Whether it's an old iPhone 3 to latest Google Pixel. Say you spot a bug, you're ready to ship a fix. It's not going to happen right away. You have to wait for the app stores to approve your code before you can get it in the hands of your end users. And last but not least, silo teams. Now it's no secret. From my DevOps friends in the crowd, you guys are prioritized when it comes to modern tooling and observability. Problem is mobile teams often sit far from DevOps teams, and they don't always get the first priority for tools. But it's not just mobile developers that want to understand the customer experience. It's product managers who want to understand what users did, support engineers who want to reproduce bugs. The problem, there has never been a tool in the market today that brings the entire mobile ecosystem into one tool, until now. Today, Datadog is excited to give mobile teams end-to-end observability in Datadog. We began our journey a few years ago with Mobile RUM. Today, I am beyond thrilled to announce the 2 newest products to Datadog, Mobile App Testing and Mobile Session Replay. Yes? Let's see how they all work. Now we've seen Shopist a few times now. So let's look at a demo and actually, we can go back one slide. So what you see here is RUM's search explorer. I can look up the most recent session for any of my users. So let's imagine that Taylor Dash left us a poor app store review, but didn't give us much details why? I can go to look up her sessions in plain English. And quickly, I can see actually at her recent sessions, all had errors. Digging in a bit further, I want to understand how these errors occurred and why? And I could see here, they're all occurring on the checkout page. It could be problematic for my business. But what I really want to understand is, are these problems unique to Taylor or other users experiencing them. For that, I want to head over to my performance dashboard that shows me my entire iOS app at scale. Now some performance metrics I want to focus on first are my mobile vitals. And I can see here that app start up time and memory usage are quite high. But what I really want to look at now are the crashes and errors. Right away, I can actually see that the screen with the highest error rate is my checkout page. This is confirming my hypothesis that Taylor Dash is not the only user affected. There's many users affected by this problem. What I want to do next is understand what did they actually do when they encountered this error. And for that, let me look at a session replay. Now with Mobile Session Replay, I can watch the end-to-end user interaction on my native mobile apps, every tap, every swipe, every scroll. And here, I can actually see Taylor Dash going through her end-to-end journey of adding items to the cart and making it to the checkout page. Now if you look closely, you'll notice in the beginning that Shopist was actually flashing a discount code to Taylor for DASH 2023, which she then goes to apply. She actually picks a delivery date and tries to check out, going through her full workflow. And it's at that moment that Taylor gets an error. Now session replay showed me this error as if I was standing over her shoulder. But I want to understand why did this error occur and how can it never occur again. When I drill into my performance side panel, I can see all of the errors and understand a bit more about this one session. I can see my mobile vitals, like CPU ticks, are quite high. And I can again see the errors that I watched in the session live from this view. But what I want to focus on is the trace. This shows me an end-to-end visibility of the user journey from the client type request all the way down to the back end. I could see every single service called upon. And it's in this view I have figured out the problem, that the discount code failed due to a failed API call in the back end. This is the first time I'm able to see the problem from the user's perspective in session replay and drill it all the way down to the back end. Now that we watched this problem, I want to make sure it never happens again. So for that, I'm going to set up a synthetic test. With Datadog synthetics, I can test my endpoints and my workflows all around the world. And now with mobile app testing, I can apply this type of proactive testing to my native mobile apps. So I'm going to run this test to my Shopist iOS app and I'm going to give it just a name, we'll call it the checkout flow. So we're going to actually go through the same flow that we watched Taylor go through. I can run this to any device. So again, that old iPhone in your basement or the latest Google Pixel. And these are run through real devices, not emulators, ensuring the most accurate testing of real-world scenarios. I can run these in my CI pipelines or in my production environment, which really ensures that I'm testing throughout the build process. Now that I've set up the foundation for this test, let's actually build it. So with mobile app testing, I can click through my app with the phone on the side, just like a user, every tap, every swipe, every scroll. In fact, I can even add the same discount code that failed earlier, the DASH 2023. Now if you've used Datadog browser tests, these were actually built with the same multi-element locator technology as our web browser test. So if you update the UI, no need to come in and rerecord the test, we'll update it for you. We've built the test. Let's look at some test results. What you see here would be a successful run or a failed run. Now I'm actually looking at a successful run, and I could see a screenshot of every single step execution. When I click in, I could see the status and the resource duration. Now what you were able to see is with mobile app testing, I'm able to build a true safety net for my native apps. Now once we've actually built this, we started rolling out fixes, I want to ensure that this is helping the business. And how do I do that? By building a funnel. With funnel analysis, I can track my North Star goal, conversion rate. I can build again the checkout flow we've seen time and time again and understand if conversion is healthy, indicating a successful business flow. Here, it looks pretty good, so I'm happy, and I can watch as it changes over time. Now today in this demo, you were able to see how Mobile RUM allows me to monitor performance trends at scale and dig into sessions at a granular level. With Mobile Session Replay, I was able to play back my end user's journey. Again, it's like I was standing over their shoulder. With Mobile App Testing, I have a true safety net for my mobile apps, ensuring I'm proactively testing them. And lastly, with Funnel Analysis, we track the conversion rate of our end-to-end workflows. I could not be more excited to get these products into your hands. Visit the link up screen to try our newest mobile products. Thank you. But that's not all. I have one more very exciting announcement for you all. You met Bits AI a few times now. We have the newest product to Bits, synthetic test generation with Bits AI. Another good one. Let's actually see how it works so right into a demo. So with the power of Bits AI, I can have Bits build me any test. I say Bits, build me a test on this endpoint or build a test on a given flow, but the real magic of Bits, the test I didn't know I needed to create. What if I didn't know that users go through a given flow or I didn't know that this was a popular flow in my application. In plain English I tell Bits, what synthetic test should I create? What Bits can then do is scan real user monitoring traffic, figure out what are the top flows that users are going through and curate recommendations for me. I can click into one of Bits' recommendations here. And as you can see, the test is actually built for me. I have all the recorded steps and I can see the test details. It is just that simple to get started with Bits. Now in this demo, Bits AI and synthetics, you saw a few things. You saw how anyone can contribute to building test. Test creation is no longer left to the few, but instead the many. It also ensures that we're creating the right tests on the right flows. You don't need to hypothesize about what users might go through, understand what they actually do by letting Bits AI tell you. Visit the link on the screen to get started with Bits. And now back to Amit.
Amit Agarwal
executiveThank you, Jamie. After that demo, even I might sign up for a trial. Mobile apps are now a key component of every business. And as you saw in Jamie's section, we at Datadog are working hard to make sure that observability of mobile apps is just as important a part of our platform. Also, you saw how we are infusing AI into every part of a developer's workflow to make us all more productive effortlessly. And lastly, static analysis and quality gates make it easier than ever to ship high-quality code. Now on to a different topic. When I was an engineer, my job was building new features and products. But with the clouds, engineers have become responsible for many, many new things. And one of these is sizing and provisioning just the right amount of resources so your applications are snappy for users while spending the least amount of money possible on clouds. We've been working hard on this problem at Datadog. And to kick off this section, here's Haïssam to talk about what we built for optimizing your containers. Haïssam?
Haissam Kaj
executiveThanks, Amit, and hi, everyone. My name is Haïssam, I lead the containers and Kubernetes engineering teams at Datadog, and I'm very happy to be here with you today. We've all had this experience with the nuance of capacity planning and the effort to get it right. For those of you using Kubernetes, I want you to think about that last time you were trying to size your pods correctly. Did you get it right the first try? Like us, many of you adopted Kubernetes for the promise of being able to more effectively utilize your resources and reduce waste. And yet, the #1 challenge that we hear from you and that we wrestle with at Datadog as well is bin packing. The most important step to achieve bin packing in Kubernetes is to rightsize the resource requests and limits of your pods which is difficult. It requires estimating the resource usage of your services before actually running them in production, which is essentially a guessing game. Then once you deploy, you need to check how accurate your guess was and possibly adjust the values, either to eliminate waste, if you overprovisioned, or to give your deployments more capacity, if you underprovisioned, to avoid your pods getting throttled or killed. And lastly, you need to monitor and adjust these values continuously as applications and workloads change over time. To help you solve this problem, I'm excited to announce our new Kubernetes resource utilization feature. Let's have a look. This is the new research utilization page. At a glance, I can identify the pods or any other Kubernetes resources that are over or underprovisioned for both CPU and memory. Now let's say, I'm an application developer, and I want to verify if my team's deployments are sized correctly. To do that, I can filter by my team's tag and sort by CPU idle. This is great because it helps me identify where I can make the biggest impact. Here, I have a deployment that is using a lot less CPU than it requested. I see in the usage of a request column that it sits below 40% utilization. And the CPU idle column tells me that the sum of unused CPU for this deployment is around 16 cores. So here we are in the overprovisioned use case. I've reserved too much CPU for my deployment, which prevents Kubernetes from packing pods more tightly, leading to low resource utilization in my cluster and higher cost. Now that I've seen this, I will try to fix it, but I also don't want to lower CPU request too much and risk underprovisioning, as if that happens, some of my pods could get evicted. So let's see how Datadog can help there. I can click on this deployment to see the combination of its CPU and memory data as well as the top list of pods. In this particular case, though, no single pod or group of pods stand out. So there is no problem of uneven load. So to continue my investigation, let's see if within these pods, I have any container that is overprovisioned. To do that, I compare the container's usage with its requests and limits and there it is. The service discovery sidecar was given a lot more resources than it actually uses. Here, I see that it's using about 100 millicores on aggregate, but we reserved 13 cores for it. So from there, I can quickly go and update my manifest to adjust the request, apply the change and instantly free up a lot of resources in the cluster for other applications to use or to be reclaimed scaling down the cluster. Once this is done, I can go back to Datadog, where I filter on my Dev environment where I just deployed. Here, I see that the deployment went from yellow to green. CPU idle decreased from 16 cores to 2 cores and usage of a request doubled, achieving an 80% utilization rate. What you just saw is how Datadog's Kubernetes resource utilization can help you rightsize your applications in Kubernetes and optimize bin packing. It simplifies the guess and check game of setting resource requests and limits by telling you how close they are to actual usage. It helps you maintain optimal sizing over time by showing an always up-to-date state of how well your resource allocations, track your pod's current usage. And lastly, it makes it easy to quantify waste, helping you make impactful cost-saving changes. Resource utilization is in public beta today, and it's included for all infrastructure monitoring customers. Please reach out or come see us at our booth if you have any questions. And now since Kubernetes is only one part of your infrastructure, here to talk about evaluating waste across your entire organization is Natasha. Thank you.
Natasha Goel
executiveThanks, Haïssam. As all of you know, reducing waste and costs is top of mind for us all. In addition to performance, engineering teams are now considering costs as they build applications. But as Haïssam mentioned, it can be hard to properly provision resources and reduce waste, and optimizing your cloud environments is difficult. With hundreds of services, it's challenging to prioritize optimization opportunities with the most cost savings. And it's hard to predict if your optimization efforts will end up hurting performance. And lastly, optimizations aren't always actionable for your engineering teams. That's why today, we're excited to announce cloud cost recommendations. Recommendations take the wealth of Datadog's observability data and all of the resources in your cloud bill and combine them to help you stay efficient and reduce waste. So let's take a look at some recommendations in the product. As an SRE, I was tasked with finding cost savings opportunities for my organization. This recommendations page helps me identify where to start. At the top, I see potential daily and monthly savings. And here, I can find recommendations already sorted by potential cost savings. For example, I see this recommendation for unused EBS volumes. Let me actually filter into our Dev environment since those resources are safer to delete. This recommendation is considering historical usage data. So I know these resources haven't been used but are still incurring costs. And I might not typically have done anything about them. But Datadog workflows make it easy for me to automatically delete the volumes directly in Datadog. And I can even have an engineer approve the deletion. In addition to highlighting unused resources, this page automatically surfaces overprovisioned Kubernetes workloads. For example, I see this recommendation for unused CPU and memory resources in my order processing service. This recommendation takes the container utilization data that Haïssam just showed you, and turns it into a concrete cost recommendation. But since I save more money by focusing on the router service, let's actually take a look at that instead. This router service has Datadog continuous profiler installed. With continuous profilers granular data about my CPU usage, I can feel confident that I understand the performance impact of downsizing my service. Right now, the service is requesting 16 cores per container, but we could actually reduce this to 5 cores and save almost $60,000 with just a 2% latency impact. To save even more money, we could downsize to 1 core but with more latency impact. So I'll actually create a case for a developer on our team to take a look. And now let's say, I'm the developer, and I just picked up the case. To start, I'll go ahead and create an experiment. Because continuous profiler has such deep visibility into my service, it knows which endpoints are being queried at any given moment. So it can show me how each specific endpoint in my service would be impacted by downsizing. This summary stats endpoint would be impacted the most, but it's actually not critical for our service. Since the messages endpoint, which is our most critical endpoint, won't be impacted as much, I can go ahead and downsize my containers and create the experiment. And after a couple of days of running the experiment, we achieved lower cost without sacrificing performance. So I'll close out the case and let my SRE know. Today, you saw how cloud cost recommendations help you optimize with confidence. Now you can prioritize savings opportunities with the most impact. And with Kubernetes resource utilization and continuous profiler data, you can optimize your costs while understanding the impact on performance. And lastly, because all of this is in Datadog, you can make it easy for your engineers to take action. Everything you saw today is available in private beta. You can sign up using the link on your screen and stop by our demo booth and platform session later today to learn more. And now I'll pass it back to Amit.
Amit Agarwal
executiveThank you, Natasha. Resource and cloud cost recommendations are both available today. And we look forward to seeing you how you use them in your optimization journey. Now we've covered a lot of ground today from AI, observability, security, developer experience, all the way to optimization across your services and resources. But that's not all. As usual, we have a lot more to show you. Join us in the Expo Hall for live demos, theater sessions and more throughout the day. And thank you very much for coming. Thank you.
This call discussed
For developers and AI pipelines
Programmatic access to Datadog, Inc. earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.