Datadog, Inc. (DDOG) Earnings Call Transcript & Summary

June 26, 2024

NASDAQ US Information Technology Software conference_presentation 100 min

Earnings Call Speaker Segments

Operator

operator

#1

Please welcome to the stage, Datadog, Chief Executive Officer, Olivier Pomel.

Olivier Pomel

executive

#2

Good morning, everyone and welcome to DASH. It's amazing to be here again today. We have a lot to show you and I will tell you that it feels amazing to meet with so many of you here in New York, this week. First things first, I'd like to thank our sponsors and partners. They do so much for us, for DASH and for Datadog and you can meet them all in the Expo hall today. Now I won't be very long. As you know, at Datadog, we prefer to let the product do most of the talking. So this morning, members of our product and engineering teams are going to come on stage and they will show you some of the things we've been working on. But before we do that, I'd like to take a minute to thank all of you. I want to thank you for your trust, for your business but also for all of the feedback you give us every single day. In fact, since last year, we have been in 187,000 customer meetings. And these meetings ultimately resulted in about 0.5 million [indiscernible] production covering more than 400 new products and new features. We build our products for you and with you. And together, we work really hard to allow you to fully observe your applications, to operate securely in the cloud and most importantly, to take action, so you can run leaner, faster and better businesses. And to get us started on that road, I'd like to invite on stage, Alexis.

Alexis Le-Quoc

executive

#3

Thanks, Olivier. It's really great to be here. And we have a lot of things to show you today. They illustrate the kind of work we've been doing across the entire platform over the past year. And that work reflects what you've been telling us, new use cases, new software stacks but also larger scale and faster pace on existing ones. Let's get started with AI. More specifically, we hear from a number of you that your LLM powered applications are moving to production. And once in production, it is crucial that they're monitored like any other load-bearing application. But what's different in their case is the kind of data that's essential to understand health, performance and safety. To tell you more, please welcome Mohamed to the stage.

Mohamed Alimi

executive

#4

Thanks, Alexis. Hi, everyone. My name is Mohamed Alimi and I'm an engineering lead at Datadog. Over the past year, we've seen the impressive potential of large language models, as many of you experimented with them. And this led to an incredible innovation across many industries, where many of these experiments have evolved from simple application into much more sophisticated systems running in production, using multiple LLMs, orchestration framework, retrieval system and knowledge graphs. But this also led to new challenges. First, as these application are involved in LLMs and more complex patterns, they become much harder to troubleshoot. Second, due to the inherent unpredictability of LLMs and AI components in general, this application needs continuous monitoring for hallucination. And finally, these application can face significant risk from prompt hacking and data sharing. To help you address all these challenges, I am happy to announce that Datadog LLM Observability is now available. Let's see how it works. What you see in front of us is a live stream of interaction from an LLM-powered e-commerce chatbot. Datadog LLM Observability highlights issue that requires immediate attention at the top. For instance, it has flagged errors, potential hallucinations, slow responses, token count and security threats. It also highlights faithfulness, which is a measure of correctness and accuracy relative to a given context. And we use faithfulness here as a proxy for hallucinations. I'm interested in the reported hallucination. So I select the first item and I land on a comprehensive page with valuable information about the interaction. So here, you can see the duration of the interaction, the token count it consumes, the number of LLM calls they made and also the models they invoked. Right below, you see the input and the output. In this case, a chatbot user has asked a compound question about a recent order. At first glance, the response seems fine. But when I check the Trace View below, I could see an issue in one of the spans. So I click on it to dive deeper and I see it's a retrieval span that is flagged for hallucination. Datadog LLM Observability reports all the context chunks that were used to compile this response. I also noticed that the chunk that received the highest relevancy score contains outdated information. However, the correct chunk has received a lower score. So the response that we saw earlier, which felt fine, is actually incorrect and this is great. With few clicks, I managed to find out this issue. But I don't want to stop here. I wanted to know more. And in particular, I want to know the reaction or the behavior of a chatbot with respect to similar questions. So with a single click, I discover that this question belongs to a wider cluster about return policy. Datadog LLM Observability groups semantically similar prompts and responses into clusters and also auto labels them for easy analysis. So here in this cluster map, you can see that the return policy clusters is impacted by a high rate of hallucination, which is not the case of our clusters. So it is very likely that the category of hallucination we've seen earlier is very specific to this cluster. And this is great. So I have all the information that I need to report this issue to the relevant teams. So Datadog LLM Observability allowed me to isolate the issue, understand the root cause and also assess how widespread it is. Now I go back to where I started to address other issues. And this time, we'll focus on security threats. Datadog LLM Observability allow us to monitor application inputs and outputs for malicious input and sensitive data leaks. So I select the first reported item. And here, as you can see, we have a user who asked a question about return policy that contains a malicious segment that forced the chatbot to generate a legally binding million-dollar offer. And Datadog LLM Observability has correctly identified this issue as prompt injection. Here again, I have all the information that I need to report this issue. As we saw, Datadog LLM Observability helped me quickly troubleshoot my application. It also reported a quality issue, in this case, a hallucination that could cause customer dissatisfaction and disengagement. It also reported a security threat, in this case, a prompt injection that can cause financial damage and erosion of customer trust. We're so excited to put this product into your hand. So to start, you can use the link above and be sure to visit our booth. It's so great to see so many of our customers innovating with LLMs and also using Datadog LLM Observability. And as a great example here, Jaime from WHOOP, to tell you more. Thank you. [Presentation]

Alexis Le-Quoc

executive

#5

Thank you, Jamie. I too wear WHOOP and this morning, I've been watching my stress level before the keynote. And I asked AI WHOOP Coach to help me relax before getting on stage. And it guided me to do some helpful breathing exercises. And now I feel much better. So you just heard from one of our customers about how we support the safe use of LLMs. We also work with a number of AI partners. They are the folks who provide the foundational models that power all these new applications. And there, our work is about getting tightly integrated with their stacks so that getting deep observability on LLMs is just a 1-click operation for you. To tell you more, I'd like to hand it over to Daniela Amodei, President at Anthropic, one of our AI partners.

Daniela Amodei

attendee

#6

Thank you, Alexis. Hi, everyone. I'm Daniela Amodei, President and Co-Founder of Anthropic. Anthropic was founded 3 years ago with the goal of building trustworthy and reliable AI with safety at its center. And today, our mission is to ensure the world safely makes the transition through transformative AI. Earlier this year, we launched our Claude 3 model family, which was designed for many of the use cases that businesses like yours care most about. Claude 3 Haiku enables many AI use cases like customer support, chat and sales. Claude 3 Sonnet enables the use of AI to perform tasks like search and retrieval over your own data or to generate content and code to save your employees' time. And Claude 3 Opus for use cases that require state-of-the-art intelligence for complex domains like R&D, drug discovery, market forecasting and more. Our models are designed to meet your diverse needs and use cases but they all share a common goal to help people feel more at ease with AI and the trajectory of this evolving technology. At Anthropic, we recognize there's still a long journey ahead for the AI industry. We are here as a trusted partner in every step of your generative AI journey. We aim to provide the deep technical expertise and reliability that come with models like Claude, as well as the observability and security features that our partners like Datadog provide. And this is why today, I'm excited to announce our integration with Datadog LLM Observability. This new native integration offers you, our joint customers, robust monitoring capabilities, a suite of evaluations that assess the quality and safety of your LLM applications. This provides you with real-time insight into performance and usage with full visibility into the end-to-end LLM trace, enabling you to troubleshoot any issues, reduce downtime and get your cloud-powered applications to market faster. At Anthropic, we believe that what we do now will set the trajectory for how AI unfolds. Looking ahead, as our models keep getting more powerful, we are committed to ensuring that they are best in class on the factor that you care about: intelligence, speed, price and reliability. And last but not least, we believe that understanding these models fully is critical, which is why we've invested so heavily in interpretability, research that helps us see inside the black box of the model and understand how it works. We're excited to partner with Datadog and bring trusted, powerful AI to all of you. Thank you.

Alexis Le-Quoc

executive

#7

Thank you, Daniela. We, too, are really excited to get AI to be smarter, faster, cheaper and better. So to recap, if you're building an LLM-based application, please give our LLM Observability a try, we'd love to hear what you think. Now the meteoric rise of GenAI over the past couple of years doesn't mean that we get to ignore the rest of the stack. Containers, for instance, continue to be adopted and deploy at a rapid pace. So we continue to invest a lot of time and effort in making them easier to manage. To hear about the latest and greatest in container monitoring, I'd like to welcome Danny to the stage.

Danny Driscoll

executive

#8

Thank you, Alexis. I'm Danny Driscoll, Product Manager for container and Kubernetes monitoring here at Datadog. Now most of you here today are already partnering with us to monitor the health and performance of your Kubernetes environments, with Datadog container monitoring. Over the years, many of you have told us that you chose to build your platforms on Kubernetes to deliver more efficient resource use, which can lead to lower infrastructure cost and lower energy consumption impact for your businesses. However, in our latest research, we observed that more than 65% of Datadog-monitored Kubernetes containers are still using less than half of the requested memory and CPU resources. There's still more that we can be doing here. That's why today, we're very excited to announce Datadog Kubernetes Autoscaling. With Datadog Kubernetes Autoscaling, we'll have a new solution will allow you to prioritize the workloads and clusters with the most savings potential. To take direct action from the Datadog platform to apply and then automate rightsizing recommendations and to observe and measure the impact of your complete autoscaling program on your key cost and efficiency metrics. Let's take a look at a demo. So starting from my Kubernetes Overview, I can now immediately see the total idle costs for my entire Kubernetes footprint across clouds. In this case, I see I have over $85,000 in idle spend last month and I'm motivated to start optimizing. On the cluster list view, I have a prioritized list of all of my Kubernetes clusters across clouds. These are sorted based on their idle CPU and memory use with visibility into idle costs and a day-by-day breakdown of their total cost over the trailing 30 days. All of these signals are available with my existing Datadog Agent instrumentation without any need to deploy any new tools. Now that I've identified this top dev EKS shop as cluster, as my most overprovisioned, I can continue on to optimize it. On the cluster detail view, I again have a prioritized list. This time of the workloads within my cluster sorted by idle CPU and memory with visibility into idle costs. In this case, I can clearly see that [ add auction to ] is my most expensive and over-provisioned workload and I can take direct action from here to optimize it. When I open up [ add auction to ], I get a complete multidimensional scaling recommendation for this workload, that combines vertical, how to rightsize the pod by setting proper CPU and memory limits and requests and horizontal, how to select the proper amount of replicas to meet demand. For each component of this recommendation, I can drill in and inspect the Datadog metrics backing it at the individual container level to build understanding and trust before I proceed with any changes in my environment. Once I do have that trust and I'm ready to proceed with this change, I have multiple options for how to do so. I could simply reference these values with my existing configuration tool or get ops workflows but in this case, I'm more excited to take advantage of the Datadog platform to apply it directly. Now I can do this as a onetime change and adjust the workload based on its recent traffic up till now. But in this case, I'm even more excited to enable autoscaling, which will ensure that Datadog can continuously monitor and tune this workload so that its usage closely tracks with its requested level of resources moving forward. Now let's jump ahead to after I've started autoscaling this workload. Here, we're able to observe 2 key signals from the workload that show how the autoscaling is taking impact. On the right, we have an event stream of the Kubernetes events emitted by our new Datadog Pod Autoscaler, Custom Resource, which is responsible for that continuous reevaluation and application of changes. Those events are overlaid with our CPU and memory metrics for the workload, which we can now see have a tight alignment between usage and requested resources, reflecting a more efficient arrangement for the workload. Back out at the cluster level, we can start seeing how this has an impact in aggregate. As we see the total CPU and memory allocatable for the cluster trend down is, our autoscaler is able to kick in and start downsizing the nodes for this cluster, leading to that direct cost savings. So to recap, with Datadog Kubernetes Autoscaling, we have a solution that will allow you to easily and quickly identify and prioritize the clusters and workloads with the most opportunity for savings. To take direct action to apply and then automate rightsizing recommendations from the Datadog platform and to observe and measure the impacts of your entire autoscaling program on your key cost and efficiency metrics. We're very excited to start our private beta of Datadog Kubernetes Autoscaling with you all. Feel free to sign up at the link here and swing by our booth for more information. And now I'm very excited to introduce Jason to talk about logs.

Jason Manson-Hing

executive

#9

Thanks, Danny. Hi. I'm Jason, Product Manager here at Datadog. As you all know, logs are a rich data source containing a wealth of information that can be used in incident response, security investigations and even business reporting. In many ways, logs are the foundation of observability. But logs come with trade-offs. There is no set schema or standard for logging, which makes performing this [ rich ] investigation difficult. And with logs coming from more sources than ever, it's hard to know what attributes and content need to be extracted ahead of time. And let's not forget, I'm often forced to lean on a small group of expert users writing long queries in a proprietary syntax just to derive insights from my logs. But what if I could craft these powerful analytical queries myself just by chaining together a series of simple operations. Well, now I can. Introducing Log Workspaces. Log Workspaces allow me to freely join and transform data across multiple sources and then chain together simple queries to perform complex analysis in a single collaborative space. Let me illustrate the power of Workspaces with an example. Sam, an engineer at a trading company and my team notices an increase in the number of failed transactions between 2 of our services that receive and execute trades. While the rest of my team works on root cause, I've been tasked with understanding the potential revenue impact on our business and more importantly, who our affected customers are. Given that my logs are coming from 2 separate services with different attributes in content and that my customer data resides in Salesforce, I realize there is no single source for me to query. I'm going to have to join fields from across these 3 sources, calculate new values and even reference data that lives outside of Datadog. And with Log Workspaces, I can do just that. Let me show you how. I'll start with a simple search in the Log Explorer, as it's the fastest way for me to quickly filter across all of my queries and logs. I want the logs from my trading platform environment, specifically that service, that receives these trades. I'll take this search and open it as my first data source in the Workspace. Data sources help me transform my logs into structured tables that I can carry with SQL, so they're very important. The definition of these tables comes from these columns that I've selected in my search. And I'm going to name this data set "my trade received logs." Okay. Now I have a record of all of these trades starting, I want to figure out which of them didn't complete successfully. These logs come from another service that finalize these trades and I'm thinking that I can join them using a transaction ID. So let me import this data source as a second source, the trade execution logs, now looking at those logs from that trade finalizing service. The attributes are quite different on these logs. I definitely need that customer ID and status but it looks like the transaction ID that I was hoping to use to join my logs together isn't available as an attribute. But with Log Workspaces, I can extract fields directly at query time. Let me show you how. I'm going to use a transformation cell, which I'll name parse the execution logs to help me extract that transaction ID for my log message. I'll use the word transaction as an anchor and then capture the next word, which is my transaction ID. And just like that, transaction ID is now a column that I can query. Okay. We have all the pieces here. Let's stitch it together. I'll use an analysis cell to generate this failed transaction record and I'll ask Bits AI to query for me. I know that I want my time stamp, my customer ID, my transaction ID and the dollar value from my received logs as well as the transaction status from my execution logs. I can use the transaction ID that we just extracted to make this join. And while I'm at it, I'll tier the trades, so the high-value transactions are marked at being over $250. Of course, I'm looking for the transactions that failed to complete, so I'll be sure to filter for just the records with errors. By describing my goal, Bits AI is able to write and execute this query for me. We're almost there but remember, I actually wanted to know who these customers are. For that I get a monthly customer report from Salesforce, so let me import this data source into the Workspace to finish my investigation. These are my Salesforce users. This time, I'll join the customer data in place using the customer ID from my Salesforce users and that failed transaction record to produce a final shareable report, the transaction record with names. And there we have it. I have my customer name and country data from Salesforce alongside the transaction value, the transaction tier that we derived and the transaction status from my execution logs. Now I know it's only tasked with figuring out who these customers are but let me take this a step further and see if they have anything in common. This might be of help to my team working on root causing the issue. For that, I'll visualize this data and see if anything stands out. I'll start with an easy one, filtering for that high-value tier that we just derived and then grouping by the country data that we imported from Salesforce. Immediately, I'm noticing that a large number of these high-value trades that are failing are coming from Italy, so I'll be sure to let the support team for that region known. I'll also share my full report with the rest of my engineering team as this might be of help for doing root cause. Maybe we need to be taking a closer look at the EU data center. They'll be able to see exactly how I came to this conclusion and we can continue the investigation together in my workspace. This is just one example of how the workspace was able to help me identify the impact of failing transactions on my business and identify our impacted users. But let's imagine together the possibility of this workspace for security investigations, creating evidence time lines, doing post-mortem analysis or even as part of an audit and a compliance report. Using Log Workspaces, I'm able to get the most out of my logs by joining, transforming and chaining together data from logs and other sources expressively within Datadog to construct nuanced datasets without having to learn a complex syntax. I'm empowered to perform my own analysis being limited only by my own imagination. To get started with Log Workspaces and to learn more about how to go further with your logs, please visit the link behind me and be sure to visit our booth on the Expo floor. And now to tell you more about how to get the most out of the rest of your observability stack, I'll pass it off to Gordon.

Gordon Radlein

executive

#10

Hi, everyone. I'm Gordon. I'm a Director of Engineering, work on APM and OpenTelemetry. Now as many of you may know, OpenTelemetry or OTel, as a standard for instrumentation and telemetry offers a ton of great benefits, like portability, interoperability and vendor neutrality. At Datadog, we believe OpenTelemetry is revolutionizing observability by providing a standards-based foundation for us to build on, unlocking innovation across the industry. It's a tide that lifts all boats. And that's why I'm thrilled to invite Michele Titolo from GitHub to the stage to tell you a bit more about how they navigated their tide on their way to OpenTelemetry and Datadog.

Michele Titolo

attendee

#11

Thank you, Gordon. My name is Michele Titolo. I'm a Principal Software Engineer at GitHub, working on our platform engineering organization. And today, I'm here to share with you our OpenTelemetry journey. GitHub is the home of open source software and wherever we can, we try to use open source. We're also running at a really massive scale. So any piece of software we choose requires a lot of time and planning to roll out. So before I get into our journey, I just wanted to share some facts. GitHub serves 5 billion API requests per day. We have over 100 million people collaborating across 420 million repositories. We have hundreds of services, including our very large Ruby on Rails monolith. We run everything from containers to bare metal to VMs and they all send traces, are thousands of machines are what make GitHub possible and they're all hooked up with OpenTelemetry. But first, let's start with the beginning. Way back in 2016, we added tracing to github.com. At the time, there was no OpenTelemetry, no open standards. So we use something from our vendor. When OpenTracing came out, we migrated to it. And we stayed there until OpenTelemetry came out. And then there were a lot of services at that point that had tracing. So it took us a bit longer to actually roll out. So a few months later, we GA-ed our internal OpenTelemetry, which included the monolith. And then we had the long tail. Remember, hundreds of services. It takes time for all of those different teams to do the updates to get them on to OpenTelemetry. But then we were starting to see that long tail really shrinking, get to that point where there were only a handful of services left. And we thought we've been on the same vendor for 7 years, maybe it's time to think about something else. And so we did, last summer, we began our evaluation into alternative tracing vendors because we're on OTel. That same month, we began a proof of concept with Datadog APM. We were already a metrics customer, so being able to consolidate was a huge win for us. In September, we onboarded to Datadog and turned on the Datadog APM migration. And then in October, we turned off our old tracing vendor. If you're doing math in your head, that's 4 months from our initial let's just think about migrating to actually performing the migration and getting onto the new platform. That's the power of OpenTelemetry and using vendor-agnostic tooling. You hear a lot of people say it's possible and I'm here telling you that we did it. We also had a really easy time getting set up with APM. We had 38 lines of YAML, which is not a programming language to initially set up APM. We also run what's called a gateway model. So we don't have sidecars, which most people assume. Again, bare metal, VMs, side cars are hard. So instead, we run a fleet of collectors that all of our applications connect to. And that collector, that one central place is where we can figure where we send our traces. But of course, nothing ever goes quite as planned in software. So we did run into a couple of hiccups, which I'm just going to briefly share with you. Firstly, we had 1 month of time between when we added APM and got rid of our old vendor. So for 1 month, we were sending twice the amount of trace data. Any software engineer will be able to tell you that you cannot do twice the amount of work with the same amount of capacity. So our OpenTelemetry collector fleet got a little unhappy. Thankfully, because it was a single fleet, we made a pull request to add more capacity. We increased the size of that fleet and we were able to run for that month with our increased capacity. And then once we turned off the old vendor, we were able to scale back down. We also have this really big Ruby on Rails monolith that I recently learned is 16 years old. So adding libraries there or making some foundational changes are challenging. We were running into some issues with the upstream Ruby OpenTelemetry libraries. But again, OpenTelemetry is open source. So we were able to make pull requests upstream, get those merged, get those [indiscernible] released and then pull them into our application in order to see the results and the rest of the community gets to benefit. I'm going to wrap up with just a few success stories on how tracing and OpenTelemetry have really helped us improve our engineering visibility. The first is with performance savings. One team was investigating why updates to pull requests were taking longer than any other call for pull requests. They looked at their traces, their flame graphs and saw we were updating the model twice. Easy fix, they batched that together and were able to significantly reduce latency for that one API call. We also love traces for end-to-end visibility, especially when it comes to our GraphQL services. The GitHub graph is huge, which means engineers can query tens to hundreds to potentially thousands of objects. Our authorization service is responsible for making sure those results are returned to someone who has access to them. And at the beginning of this, every single object was being called individually. So that's tens to hundreds to potentially thousands of queries to authorization in 1 graph GraphQL call, that wasn't great. So we built a new way of doing batching authorization calls from the authorization service so that every GraphQL call results in 1 call to authorization and not this huge N+1 thing that we were seeing. And the authorization service was really able to benefit from this and just do more performance in general. Lastly, every company has those bugs that you're like, what is going on here. And what's been plaguing some of our engineers for years has been timeouts. Until very recently, we hadn't been able to see what was going on, what was still pending when things were timing out. So we had 2 different areas we investigated. First, we enabled pending spans in our OpenTelemetry collectors so that if a span hadn't finished but the overall request timed out, we were able to see what was happening. But then things still weren't working. So we looked at the upstream OpenTelemetry SDKs and realized we needed to change how we instrumented RAC, which is Ruby on Rails web server. We did that. We rolled that out. And for the first time, engineers were actually able to see what was happening when a request timed out. We were also able to see those timeouts show up in our RED metrics because we were tagging the spans with errors appropriately. This has been a huge win for engineers at GitHub and I'm really excited to share with you today. That's it for me with a really quick overview of GitHub's OpenTelemetry journey. Back to you, Gordon.

Gordon Radlein

executive

#12

Wow, it's amazing to see the progress that OpenTelemetry has made over the past couple of years. The growth of the community and adoption from customers such as GitHub is a real testament to the need for this project. And that's why we're enthusiastic supporters of OpenTelemetry here at Datadog. We're a top 10 contributor. And over the past year, our engineers have worked to make profiling an industry standard through the new profiling signal and they have helped the collector along its journey toward a stable 1.0. We're maintainers across multiple repos and we expect to continue expanding our support. Stable standard is good for all of us and we're happy to do our part. In fact, the benefits of that standard are why we've been working hard to make ourselves more compatible with OpenTelemetry. Because we've been building in this space for so long, OTel doesn't yet support all the products that we do. And with our pace of innovation, I expect that to continue to be the case even as we close the gap. And that leaves you with a dilemma, go all in on Datadog and forego some of the great benefits that OpenTelemetry brings to the table or be limited to the products that OTel supports. Naturally, you're probably wondering, why can't I just have both? Well, we've been working hard on that problem because we believe that Datadog is better with OpenTelemetry and OpenTelemetry is better with Datadog. Last fall, we announced support for W3C Trace Context and the OTel API and our APM libraries, bringing vendor-neutral instrumentation and interoperability into our native ecosystem. This was a big step. But we know that instrumentation is only one side of modern observability and many of you also want the flexibility and control offered by the collector. Well, today, I'm happy to announce we're taking our next big step by unifying the Datadog Agent and OpenTelemetry Collector. Now you can benefit from the agent and the collector working together to form a whole greater than the sum of its parts, enriching your OTLP data and enabling our product suite. Now you can just have both. With our new agent, collector users will immediately get access to our full product suite and platform. You'll enjoy app-based management of your collector fleet and you'll get the peace of mind that comes with being backed by our dedicated product support. New agent users aren't being left out of the fun either. You'll get access to the large and growing number of community contributed integrations, including out-of-the-box support for the growing number of commercial and open source tools being instrumented natively with OpenTelemetry. You'll get better interoperability across the tools in your observability fleet, whether vendor based or open source. And of course, you'll get control over your OTLP data with full access to the collector's powerful routing and processing capabilities. Let me show you how it works. So to get started, simply install the new agent or update to the new agent. And if you're wondering about your current collector config, don't, you can just keep it. All you need to do is point our new agent at your existing configuration. That's right. Your existing OpenTelemetry Collector configuration and the pipelines defined by it will continue to just work with our new agent by leveraging the integrated Collector. And once deployed, collector users will immediately feel the difference in the depth of the product experience now that you have access to the full Datadog platform. For example, we know managing large fleets of collectors can be tricky. With access to our fleet automation tools, you'll be able to view and manage your collectors from within the app, getting visibility into configuration and dependencies. You will get access to our full container observability suite, including autoscaling, as Danny talked about earlier and live containers, giving you real-time insights into your containers and the processes running on them. You'll get access to unique features like single-step APM providing zero touch automatic instrumentation of all your running services, getting you app level insight in minutes with minimal effort, all with the click of a single button or the setting of a single flag. Oh, did I mention that single step works out of the box with your OTel API instrumented code, because it does. And of course, when things go wrong, you'll be able to fall back on our full product support experience. And finally, you'll get access to the more than 750 integrations that come standard with a Datadog Agent and platform. So as you can see, whether your observability strategy is Datadog first or OpenTelemetry first, your experience only gets better with our new agent, get the best of Datadog while benefiting from the standards-based instrumentation. There's no more dilemma. Datadog and OpenTelemetry work better together. And the best is yet to come. If you're as excited about this as I am head on over to dashcon.io/together to sign up for the private beta. And don't forget to check out the demo booth to see it in action. Thanks.

Alexis Le-Quoc

executive

#13

Well, thank you, Gordon. And thank you, Michele, for sharing your story on OpenTelemetry and GitHub. Let us do a quick recap of what we just covered. With Kubernetes Autoscaling, you can optimize your spend by rightsizing pods to workloads and save real money. With Log Workspaces, you can slice and dice all your log data directly within Datadog. And last but not least, our investment in OpenTelemetry. We put our money where our mouth is, and we give you the best of both worlds with the newly integrated OpenTelemetry Collector within the Datadog Agent. And now let's switch gears. Let's talk about security. There is not a single day that goes by without hearing about some kind of cybersecurity incident. And when that happens, we know it's all hands on deck. What we're seeing is that even in day-to-day security, it's never just a security team that gets involved. Everybody has to chip in. Code needs to be fixed and reviewed, new versions need to get deployed, cloud configurations need to get patched and hardened and so on. In other words, whether we, as Dev and SREs like it or not, we are an integral part of the effort to secure our infrastructure and our application. And that is precisely why we've been building security products for the past few years in a way that takes advantage of the rest of the platform and in a way that we hope speaks to you. To hear about the latest in security, please welcome Daniela.

Daniela da Cruz

executive

#14

Good morning, everyone. My name is Daniela. I'm an Engineering Director here at Datadog and I'm excited to share what we have been up to with our security products. Today, more than 6,000 of your companies and organizations use Datadog security every day to detect vulnerabilities and protect your cloud environments from attacks. And as all of you know, when under a security attack every second counts, getting the right context at the right time is crucial. Last year, we announced the Security Inbox, which allows you to sift through the noise and zeroing on the most critical issues in your environment. And today, I'm excited to share that we have made it easier than ever to get started with Datadog Security and take immediate action right there from your Security Inbox with all the context you need to determine the next steps. In fact, it can take as little as a few minutes, thanks to our newly launched Agentless Scanning now generally available. Let me show you how. All I need is to go to the cloud security management setup page and configure an integration with my cloud provider under cloud account sections and activate Agentless, Here, I'm going to activate Agentless for hosts, containers, [indiscernible] functions and data security as well. When I'm done with the activation, the Agentless scanner will immediately start analyzing whatever is deployed on my cloud accounts for the resources I enable it for. Let's go take a look at what it finds. For that, I will navigate to my Security Inbox, which lists out all of my security blind spots. Here, I see immediately 2 different critical issues: 1 attack path and 1 application vulnerability. My Security Inbox automatically correlated and prioritized issues across my, call them misconfigurations, identity risks, infrastructure and application vulnerabilities. I didn't have to do anything besides enabling Agentless Scanning, not to mention that I didn't have to add any tasks to my team's backlog to get this visibility. All right. Prioritized inboxes are great. But do you know what's even better, an empty one. Let's see how we can make it happen. I will start with the first one in my list. It's a public EC2 instance, potentially exposing sensitive data. Let me investigate further. Here on the side panel, I am able to get additional details such as resource name, tags and account ID. Looking to the new security context map, I'm able to see my vulnerable EC2 instance in full context. In the left side, it shows me how this instance is exposed to the Internet and the right side, the potential blast radius, which resources and services were impacted. In fact, I can go one step further and view these in data security. With data security launched today in beta, I see exactly what type of sensitive data is being exposed. Not that I have all the information, let's go back and fix this. Going back to the side panel with all the details, I see that I'm able to start fixing this vulnerability directly from the context map. By clicking on remediate, I have several options to fix. This time, I'm going to the first one, open a pull request. Datadog will automatically generate a pull request for me with all the details that my team needs to merge it. It includes a brief description of the vulnerability, a link to the [ regional finding ] and the pointer to the vulnerable resource. And of course, I can check the proposed changes to the relevant files. In a few clicks, I'm able to fix a potential data breach that could be a real headache for my organization and I wrapped up the whole thing in just a few minutes. Amazing, right? But wait, I'm not yet done. I still have that second critical issue in my inbox. This time, it's a remote code execution. It's a vulnerability in production, it's under attack and there is an exploit available. Let's investigate it further. Here, I see it's a vulnerability in the third-party library, spring framework. It's used by the product recommendation service. It's affecting my production environment and there is a high probability of malicious exploitation. Datadog gave this vulnerability a solid score of 9.8 out of 10. Looking to the severity breakdown, I understand exactly how each risk factor impacted the score. Datadog severity score gives me the full context if it's sensitive or Internet exposed environments, the evidence of attacks and exploitability risk. Okay. I think it's pretty clear that I should fix this. For that, I will navigate to the remediation tab. Here, I find comprehensive step-by-step instructions from, all the way find the vulnerable library in my dependencies, all the way on how to upgrade it to a new version. But you know what, I already took care of that first issue in my inbox. This time, I want my development team to handle it. Since my team is on Datadog, I can send them a select message directly from the side panel and create a zero ticket with all the details that they need to start fixing this vulnerability. All right. Let's take a look at the fruits of my labor. Amazing, my Security Inbox is at zero, I can finally take a break. We just covered a lot of ground. Let's recap. With our Agentless scanner, you now can get started with Datadog Security in just a few minutes without deploying any additional agents or software. With our actionable Security Inbox, you can get immediate context and spring into action using the new security context map and data security as well. Last, but not least, with our infrastructure as code auto remediation, you can now automatically generate pull requests for your infrastructure and scope. To learn more about everything I just covered, please use the link on the screen. And now to tell you how Datadog can help you to catch security issues even before they reach production, I will hand it over to Julien.

Julien Delange

executive

#15

Hi, everyone. My name is Julien, and I'm a software engineer here at Datadog. Daniela showed us how quickly we can get started with Datadog Security. Datadog Security shows the vulnerability in your production context. But over the past few months, we learned from the conversation with more than 6,000 organizations that use Datadog Security that you want to find vulnerabilities earlier in the software development life cycle. That's why today, I'm pleased to announce that Datadog now secure the entire software development life cycle. From the first line of code you write, all the way to deploying and monitoring the application in production. Let me show you how. And I will start by my production environment and I want to make sure that I remediate the most important critical viabilities. Datadog Application Security Management determines the security portion of my application. The prioritization funnel, happy to cut through the noise. Here, I go from 158 vulnerabilities in production to the [ 6 1 ] that cause an immediate threat. And this is exactly what I want to focus on. On the right side, I can see the breakdown by team, service or library. It helped me to prioritize my remediation efforts. Datadog also showed me how quickly my team remediates vulnerabilities. I want to make sure that my team remediate existing vulnerabilities faster than you are -- now being discovered. I see also the breakdown of all my vulnerabilities according to [indiscernible] framework. Now my production environment is safe and secure. But I want -- what I want to do is to continue to detect vulnerabilities, remediate existing vulnerabilities and don't introduce new vulnerabilities and for this, I use Datadog Code Analysis. In just a few minutes, I connect my code repositories and get started and analyze my code in my IDE or my code repository. I write code in my IDE, there we analyze my code as I'm writing it. Here, I add value in a database. And if I have a vulnerability such as a SQL injection that detects it but also suggests a fix. In a matter of seconds, I find this vulnerability and I fix it. And if I don't use the IDE integration, then I've also analyze my code on [indiscernible] and annotate my [indiscernible] request. Now, my code is safe and secure. But what I want is to make sure that the code of all my services and the code of all my teams and application is safe. And for this, I will use Datadog Quality Gates. With Datadog Quality Gates, I define rules that are checked at every commit. With Datadog Quality Gate, I can define a rule that will block any commits that introduce a vulnerability, either in my code or in third-party library. With Datadog Quality Gates and Datadog Code Analysis, I can focus on what matters, writing code for my product, writing new features for my customers and not having to worry about adding new vulnerability. Let's recap. Datadog now secures the entire software development life cycle. And with application security management, I discover vulnerabilities in production. I can cut through the noise and work on what truly matters. With Datadog Code Analysis, I detect issues in my IDE or in my data [indiscernible] requests. I prevent vulnerabilities from reaching production. And finally, with Datadog Quality Gates, I check that no vulnerability is introduced at each commit. Any commits that may introduce vulnerability will be blocked. To learn more, please visit the link behind me or please see us at the booth. Thank you. And now let me welcome Kassen Qian on stage.

Kassen Qian

executive

#16

Debugging, we all have to do it. Sometimes it feels awesome. After diving in and having that aha moment I feel like I'm on top of the world. I may not know what day it is but at least I figured it out. Other times, I'm not so lucky. I have to comb through a code I've never seen before, flip through documentation someone wrote years ago, think of ways to reproduce this thing and ask for help hoping I don't look stupid. I'm on a wild goose chase and I haven't even touched my code yet. All I want is to fix the bug. And that's why today, I'm excited to introduce Datadog's Live Debugger. For the first time ever, I can debug my application with live production data at every step of the process. Let's see how it works. Let's say I'm a back-end developer working on features for an e-commerce website and I own a service that's responsible for handling the checkout process. In my editor, I've installed Datadog's IDE extension for detailed code insights, as I work. I notice that Datadog Error Tracking is telling me that there's an error in my file. It's on the method that checks for valid items in the shopper's cart. So I should probably take a look. Above the stack trace for this error, I can now click on exception replay to step through the execution flow of my code as well as see the local variable values that were captured live when the exception was thrown. No need to run my code. Exception Replay captured the run time variables for me, so I can follow the execution path and realize immediately that one of the items passed through my service had a negative price value. My code properly checks that this isn't a valid price. So it was right to throw this error. Now I can rule out that the problem isn't in my .NET service. It came from wherever my service is getting the price data from. Okay. So what are the services that my service is talking to? Looking at that code is probably a good idea. But where do I start? Do I check APM? Is there an architecture diagram somewhere that tells me how everything is connected? How am I going to find the pieces of code related to this issue? Actually, I can just click a button and Live Debugger tells me everything I need to know. Live Debugger preserves the troubleshooting context from my IDE, so I can seamlessly continue my investigation. With the help of tracing, it helps me visually understand the state of the application at the time of the error as well as contextualize its location. It also provides me with an AI-powered summary of the executional context for each span, a description of the error itself, as well as a suggested code fix. Most importantly, I can now see the flow of production data between services and exactly where this interaction occurred in the code. My .NET service made a call to a downstream Python service to apply a coupon to the user's cart. The request went through, okay but when I hover over the variables relevant to this request, I can examine their values to see that the final item price, the Python service returned, was negative which is why the checkout failed. With Live Debugger, I know exactly which service talk to my service, the exact values that were passed between them and what code was executed, when that happened. Amazing, right? But I'm not done yet. In order to fix this bug, I have to reproduce it locally. But what card item should I mark for this? What attributes do I need to have for each item? What was our discount value again? Luckily, I can fast forward through all of that with Live Debugger's integration test generation. It uses production context collected at each service entry and service exit to generate a test for me that marks all of the relevant values for calls made between upstream and downstream dependencies for my service. I don't have to worry about setting up my environment or a proper database. I get a working reproduction of this bug with 1 click. Now I can run this test directly in my IDE and focus on what really matters, the actual debugging of the code, trying a fix and checking if it works. I'll add my test and then I'll set a breakpoint at the end to see if I can actually reproduce the fact that the Python service returned a negative price value. Running this locally with everything reproduced for me, I'm now off to the races. Before Live Debugger, I would have had to inspect my application performance, logs, errors, documentation and lines of code across many different files just to understand the nature of the bug. With Live Debugger, Datadog gathered the real variable values and application context all in 1 place, guided me through the execution paths in code and saved me hours by reproducing the issue for me. Now I can fix the bug and move on with my day with minimal interruptions and fast time to resolution when interruptions do occur. To learn more about Live Debugger, our IDE integrations and Datadog for software delivery, please visit dashcon.io/debug and come say hi at the APM and software delivery booths. Next, I'll pass it to Sara.

Sara Varni

executive

#17

Thank you, Kassen. Hi, everyone. My name is Sara Varni. I'm the CMO here at Datadog. I'm super excited to be here at my first DASH here in Javits and hello to everyone on the live stream. Let's recap what we just saw with Datadog Security. Now you can go from set up to fixing issues in a matter of minutes with our Agentless scanner and our infrastructure has code auto remediation. We also now allow you to secure your entire software development life cycle, helping you build in security from that first line of code and put time on the clock back for your developers. And also, when it comes to developer productivity, we're super excited today to announce Live Debugger, which allows you to fix code with production level context and to fix bugs faster than ever. We're super excited about all of these new features that point towards developer productivity but we don't want you to just hear about it from us. We'd love for you to hear about it from a customer. And now I'd like you to hear from someone who's been revolutionizing the global payment space for over 25 years. And now they're partnering with us, with Datadog to take it one step further. So let's hear from PayPal. [Presentation]

Sara Varni

executive

#18

Thank you, Ryan. So you saw how with Project Quantum Leap, they were able to accelerate the pace of delivery and improve developer productivity across thousands of developers worldwide at scale. But you also heard from Ryan that it wasn't just about driving internal efficiency, it was also about creating an amazing end user experience. And we're working with customers like PayPal every day to figure out all of the new ways that people can take data on the Datadog platform and put it into action. And now I'd like to welcome on to the stage one of our incredible product managers to talk about a brand-new product that will help you go from not only managing the health of your systems but the health of your business. So please help me welcome to the stage, Jamie Milstein.

Jamie Milstein

executive

#19

Thank you, Sara. It's so good to be back here. I see so many familiar faces. I'm Jamie and I'm a Product Manager on our digital experience team. Now in 2020, we launched Real User Monitoring. And since then, it's become the critical product for you to understand the performance of your browser and mobile apps. Now we've added some critical capabilities with features such as Error Tracking and Core Web Vitals. And over time, you've seen us add a few more things around the user behavior space, Session Replay and Funnel Analysis. But you pushed us to go further. You wanted to understand how changes to your app, ultimately impacts your end users and how that all actually impacts the bottom line. You wanted to understand questions like, as I release a new feature, how does that actually affect my conversion rate? And how does this all again impact the bottom line? So what do we do? We built you something specifically for this to understand your user behavior data. And that's why today, I'm excited to unveil to you all, introducing Product Analytics. All right. I'll take it. Now you don't want to hear me talk about it. You want to see it. So let's take a look. You'll notice right away. This is a brand-new product. It brings your business teams and technical teams into 1 UI, leading to better collaboration. But this is actually connected to the rest of Datadog. So I can go back and forth between my business data, observability data, all in this interface. You'll notice right here what I'm looking at is my analytics summary. It has all the KPIs that I as a product manager would care about. I can see things like who are my top users and I can even see demographics data. So I know exactly where everyone is coming from. Now if I want to actually understand the flow of traffic, let's take a look at one of our new user journey diagrams. In this diagram, all I need to do is, put in a beginning point or an endpoint and the diagram will do the rest of the work. In this case, I'm saying for all users who went to my home page, what do they do next? And here, I can actually analyze the critical flow, so I can figure out where is the drop-off happening. Now what I can do next is actually convert it to a funnel where I can see the drop off. And as a product manager, I don't just have to see the drop off, I know why there's drop off. I can see it could have been due to performance data, user behavior data, all in this single view. Now this is really the first time that I don't necessarily have to query to understand what went wrong. I could just look here and I could see that it's due to high error rate on the add to cart button, simple as that. Now don't go thinking that user journey diagrams are the only thing we have. Product Analytics truly has it all. With Session Replay, I could watch what 1 user did, watch all their cursor movement, see where they hover and see what they might have missed. When combined with heat maps, I can actually extrapolate out what I saw in 1 single session replay and really understand the macro trends. So for example, what you can see here is a heatmap, where I can see what are the hotspots on a page. I can see where people are focusing their attention. I can see what are their top actions. And we can actually just go back one. On the heatmap again, we can see what are the top clicks. And with user retention analysis, I can actually measure user stickiness. I can see where people drop off. So for example, if I noticed a drop-off is happening in a given week and say, week 3, if no one's coming back, I'm going to want to launch a marketing campaign to ensure that we can keep these users retained. But lastly, with our analytics summary, I can query at very granular metrics for specific business KPIs. I can filter here to users in my loyalty segment, for example and actually see how much we're spending in a given time frame. I've added user data to enrich this. And here, I can look and say, users who spent more than $20 and just very quickly see who my top spenders are for my e-commerce app. Now with this, keep in mind that Product Analytics data, it's actually retained for 15 months. So you can understand long-term trends. You can understand year-over-year or quarter-over-quarter. You can share this with your teammates, your executive stakeholders, your collaborators and bring them into 1 interface. Now to recap what we talked about. We just looked at Product Analytics with Datadog. We saw that it's actually very powerful. It has extremely low overhead because you're already sending this performance data to us. You don't have to pay that data and performance [indiscernible] twice with one single data source. I want to stress that this is collaborative. It brings your business teams, UX teams, technical teams, all into this UI to ensure that there's no context lost. And have I said it enough, it's powerful. It truly has everything you need for a product analytics tool with no context lost. Come take a look. We're going to be demoing this all day at the Expo. Come find me, come join our theater session and I'd love to tell you more about it. With that, I'd like to pass it back over to Sara.

Sara Varni

executive

#20

Thank you, Jamie. All right. We just saw how you can use a product like Product Analytics to take the data that you're collecting on the Datadog platform and put it into action proactively. But let's be real, we're not always in that mode. Sometimes we're in reactive mode and we need to keep on top of the issues and incidents that happen on our platforms. At Datadog, we are committed to building the most integrated and efficient products when it comes to incident response. And to talk about one of our newest features here, I'd like to welcome to the stage, Galen.

Galen Pickard

executive

#21

Thanks, Sara. Hey, folks, my name is Galen. I'm a staff engineer. I'm one of Datadog's core incident commanders. I'm here today to talk about instant response and I want to start with a really simple idea. Most incidents are triggered by changes. Now when I say changes, I'm, of course, talking about changes that you make to your own systems, deployment of new code, a feature flag or config change, a database schema update or a manual operation like running [indiscernible] and I'm also talking about changes outside your control. A spike in traffic from one of your customers, a downstream outage in a service you depend on, an infra issue or a network problem. And when I'm debugging an incident, one of the first questions I ask, is usually, has anything changed recently? This was the inspiration behind change tracking that I'm delighted to say today. Let's take a look. Here, we have the status page for a monitor that recently alerted. This is a monitor on the error rate of one of the endpoints of the checkout service. The evaluation graph tells me about when things started going wrong. Just above that graph, we have something new. This is a time line of recent changes that might be related to the monitor alerts. And if one of these changes caused the issue I'm seeing, I can debug and remediate from right here on the monitor page. Here, I see 3 changes. First, we had a deployment of the checkout service about half an hour ago. A bit later, a feature flag flipped and it changed the behavior of the cart API service, then not long after that, I'm seeing new error types occurring on checkout. Now checkout is the service being monitored. So any change in that service will be included here. But sometimes, a change on one service causes a problem somewhere else. In this case, Watchdog found another service with a highly correlated error rate. When errors started increasing on checkout, the same thing happened on cart API. Because of this, the time line shows all changes to both these services. The timing of that feature flag change on cart API looks very suspicious. Let's pull up more information. So here, I have some basic information about the feature flag. The title and a description and the history of who changed what and when. Now this flag is managed, we launch [indiscernible]. So all this information is available out of the box. But I know many of you out there use your own home-grown flag management tools. Don't worry, we can provide all of this via API as well. Now the [indiscernible] on this most recent change looks very strange. I think one of my colleagues might have made a mistake. Looking at the command they ran to make the change, I can see the problem. They applied the config for the new color scheme flag to the data source strategy flag. That will certainly do it. Between this and the timing of the change, I've seen enough. Let's take action. I can use these buttons to view the flag config or to run a workflow, like this rollback workflow. Okay. Great. The workflow is running in the background but let's take a peek. This is a feature flag rollback workflow that someone in my company set up previously. It's configured to check for permission from the flag owner and Slack before proceeding. So it's currently paused at that step. The owner can approve or reject. And once they approve, the workflow will push the change to launch [indiscernible]. Now I would hate to make you all sit and wait for that to happen. So let's jump ahead. Okay. 10 minutes later. The change was approved in Slack and back on the monitor page, I see a new feature flag of rollback. Sure enough, the new error type stopped occurring. And the overall error rate is back to normal. Change tracking made it easy to resolve this issue. I was able to see recent changes to my services, gain more context on the ones that look suspicious, then remediate by rolling a change back, all this without ever leaving the monitor page. Change tracking is a new platform feature available on monitors, dashboards, service pages and, well, I won't spoil it but you'll see in a minute. Visit dashcon.io/changes to learn more and sign up for a preview. Change tracking gave me everything I needed to resolve this issue. But when things are more complex, I might need a bit more help. And for that, here's Sajid.

Sajid Mehmood

executive

#22

All right. Thank you, Galen and hello, everyone. My name is Sajid, VP, Engineering here at Datadog, working on many of our AI/ML products. And I'm so excited to finally be able to share with you all what we've been working on with Bits AI, our DevOps copilot built on generative AI. Now we recently announced the general availability of Bits for incident management, which helps you stay on top of the most important issues in your infrastructure with summaries as soon as you join an incident, natural language commands to help you manage your incidents and a straightforward way to find related issues in your infrastructure. Since launching Bits, we've heard from many of you that chatting directly with Bits is a great way to get information that you are looking for during an incident. But we also know that as incidents get more complex with many people, dozens of new services and multiple teams, figuring out what question to ask is often the hardest part. And let's be honest. We've all spent a lot of time chatting with AI assistants all over the Internet this past year. I'm sure, while sometimes they astound us with their capabilities, other times, you end up asking a simple question and just facepalming at the response. Because it turns out, you've asked something, it just doesn't know how to handle yet. What you really need is an incident copilot that knows what it's good at and would just tell you, so you don't have to guess, which is why I'm thrilled to unveil the latest evolution of Bits AI, a fully autonomous AI agent that does exactly this. We train Bits to observe, plan and act continuously so that it can help you run the incident response end-to-end. Let me show you an example of Bits in action. I'm on call for a food delivery service and I have just been paged for our most critical service, the restaurant's API, which is responsible for processing all of our user orders. By the time I scramble over to my laptop, Bits has already slacked me to let me know that it's begun its investigation. It's going to follow the instructions in the monitor as well as follow links to run books and external tools like Confluence and begin planning out its investigation in a notebook. Let's check it out. So Bits will actually design and execute its plan in real time, following multiple threads of investigation simultaneously. Bits will also adapt this plan as new data arrives, choosing next steps based on its earlier observations. For example, here, you can see that Bits found logs indicating that our restaurant service is actually timing out with connecting to an upstream takeouts RPC service and so decide to investigate that service, looking at error traces, latency metrics and more for takeouts RPC. Now Bits will continue its investigation on its own but we don't need to sit here and watch because it will use Datadog Case Management to keep us up to date on all of its key findings and using integrations with tools like ServiceNow, Jira and even Slack, this will keep all of my teammates in the loop. Once we're in Slack, Bits will begin to suggest next steps to me based on its investigation so far. Here Bits sees, using real user monitoring, that thousands of users are impacted and so it suggests we declare an incident. Everything is prefilled for me. So it's just one click to get that started. Once we're in the incident, Bits has a summary of the investigation so far. So it's really easy for my teammates to get up to speed. And it has 2 more suggested actions for me. Again, because of the thousands of users impacted, it's prepared a status page update for me. And because it knows that takeouts RPC service that was erring before, is owned by a different team in the service catalog, Bits suggests we page them to get their help. All right. So as we do that and my teammates join and get up to speed and start their own investigations, Bits is running in the background, following the conversation and looking for opportunities to surface relevant [indiscernible]. [indiscernible], my colleague has identified that takeouts DB is the problematic database that's slowing everything down and so Bits decides to investigate that database in the background. If Bits doesn't find anything, it just won't say anything. But here, Bits has found the root cause. A database migration from 2 weeks ago that lines up exactly when the changes began. Using change tracking, we can see exactly what changed. We can see that we altered the data type of a key column we were querying. And this gives the team all the information they need to realize that this change likely broke our indexing. And so we're going to have to roll out a new index to fix the issue. As the team works through that, Bits has another thread running in the background, looking for other potentially related incidents in my infrastructure. And if it finds one, it will bring those threads of investigation together. For example, here, Bits sees that there's actually several other teams that are downstream of this problematic database. And it's let them know that the issues that they've been firefighting in isolation are likely caused by this one. So as they join and begin to ask about sort of an ETA for resolution and we all hear that actually it's going to take a while to fix the issue and the downstream services are still struggling under the load. Bits offers to help scale up these 2 services using Datadog workflow automation. There's buttons right here for my teammates to trigger these workflows and the rest of us can easily follow along right from within Slack. As the scale ups complete, it's easy to just ask Bits a follow-up question to see how our services are doing now. And here, we can see that the services have recovered and things are looking much better. And finally, as we roll out the new index and our queries are looking better and we're able to fully resolve the incident, Bits has prepared a first draft of the postmortem for us to review. It's configured to follow my team's template starting with a summary of what happened, an overview of the key systems and a time line of all the major events that happened in the incident response. So how is all of this possible? How did we build this? To transform Bits into an independent investigator, we invested heavily in AI agent design and planning capabilities optimized specifically for the multiuser, multi-threaded environment of incident response. As with all AI research, data and rigorous evaluation have been critical to our early success. We actually built a dedicated simulation environment for Bits that allowed us to replay real incident scenarios continuously and benchmark how Bits does across a variety of dimensions. Over the last few months, as we've been working on Bits, we've seen Bits improve enormously. For example, after we introduced the change tracking tool that Galen shared with you earlier, we saw Bits' data gathering benchmarks improve substantially. Unsurprisingly, making it easier for our users to find relevant changes in their infrastructure, helps Bits do the same. Now all of this has allowed us to add truly autonomous investigation capabilities to Bits that run automatically as soon as your monitors trigger, that use existing run books. So you don't have to spend a lot of time feeding Bits detailed instructions and will suggest next steps, so you don't have to guess at what Bits is capable of. If you'd like to try this up, you can sign up for the beta of our new autonomous investigation capabilities here at dashcon.io/auto-bits or check us out in the Expo hall below. And yes, since we started calling this release, Auto bits, we've been having a lot of fun with Transformers puns. With that, it's my turn to welcome Daljeet to the stage. Thanks, everyone.

Daljeet Sandu

executive

#23

Hey, everybody. My name is Daljeet and I'm a product manager here at Datadog. Sajid just covered how the latest evolution of Bits can now work alongside incident responders to help resolve issues faster and how Bits can now start investigations before you have reached for your laptop. But what if you don't even have to reach for your laptop anymore. That's right, folks. It is my great pleasure and honor to announce the latest addition to our platform, Datadog On-Call. Built by on-call engineers, for on-call engineers, Datadog On-Call supports everything you need from a paging solution and combines it with everything you already love about the Datadog platform. Let me dive right into the On-Call's core capabilities. Starting with scheduling, whether it's business hours, follow the sun or 24x7 rotations. On-Call covers all your scheduling needs. Escalation policies ensure that all your alerts are routed to the right team members in your organization at the right time but that's not all. Since Datadog already captures the state of your services and your teams, you no longer have to worry about duplicating your service catalog into your paging solution, viewing up and downstream issues and paging the relevant teams for them is now possible in a single unified view. Now let's talk about ways of getting paged. As a responder, you can set up notification preferences to specify exactly how you want to be paged, e-mails, push notifications, SMS, or phone calls. And yes, Datadog On-Call will circumvent your do not disturb mode if you tell it to, no matter where you are, even if it's on stage at the Javits Center, Datadog On-Call will reach you. Now you might be wondering what can I be paged on using On-Call. The short answer, everything from everywhere. Whether that's telemetry you already have in Datadog, all the way to third-party tools that you use to keep your critical systems up and running. Datadog On-Call will page you anytime, everywhere, about everything that you need to be paged on. Now since getting paged is a critical aspect of most -- while getting paged is a critical aspect of most engineering organizations, it's not exactly anyone's favorite part of the day, or let's face it, the night. This is why Datadog On-Call comes with out-of-the-box on-call insights and overviews. In a single view, you can see key top-level metrics such as MTTA and MTTR, as well as answer questions such as which team members experience the most interruptions last sprint and which services are experiencing the most issues and causing most of our operational load. Equipped with these insights, you no longer have to worry about -- equipped with these insights, you can truly focus on getting your teams out of firefighting mode. All right. Now that we covered On-Call's core capabilities, let's quickly see it in action and talk about why this is truly a game changer and why I'm personally so excited to get it into your hands. First, let's go back to the page I just got a minute ago, along with the phone call. I will start by tapping the push notification and go directly into the Datadog mobile app. Here, I can see that my checkout service is experiencing an elevated error rate. And before I go any deeper, I'm going to press acknowledge to make sure that Datadog doesn't call me again while I'm on stage. Now, since Datadog already has all my observability data, I can see everything I need to determine the severity of the situation right here in the palm of my hands without switching devices or losing any context. I can see where this page event came from as well as automatically Datadog shows me my impacted SLOs. Speaking of, it seems we have already breached one. Now, if I want to start investigating, all I need to do is 1 tap and it takes me to the triggering monitor. Here, I can see the evaluation history of my monitor and how my service has trended over time. But what I can also see here is the associated remediation playbook. Now Datadog's mobile app allows me to pivot into related dashboards, logs, traces and service, all from the palm of my hand. All of this without any urge to reach for my laptop. Pretty cool, right? Now I've already seen that my SLO has been breached. I'm also seeing that my error rate is spiking rapidly. So something must have happened recently. I will go ahead within the Datadog mobile app and declare an incident right here in the same context. Now once I do this, once I press create, all of my organization's automations will kick in, meaning relevant people will be paged, communication channels will be opened. And as you have seen in Sajid's talk, Bits AI will be there to guide me through the entire incident. This is amazing. I just went from getting paged, investigating the issue and declaring an incident in record time, all while on stage in front of thousands of you and without a single laptop insight. Let's recap. Datadog On-Call comes with all the scheduling and escalation capabilities you need to enable your teams to go on-call. But what I just showed you isn't just the paging solution. What I showed you is a single platform for monitoring, securing, paging and investigating issues on the fly. And as you've all just seen, thanks to mobile investigations, we have a platform that's fully connected end-to-end, that helps you observe, secure and now more than ever, act. We can't wait to get it into your hands, visit the link on the screen or stop by our On-Call booth at the Expo floor. Thank you very much. Now, please welcome back Sara on stage. Thank you.

Sara Varni

executive

#24

Thanks so much, Daljeet. Now I personally wanted to have who let the dogs out be the ring tone for Datadog On-Call. But unfortunately, we have no ties to Baha Men and so if anyone has an in, please let me know. I'll be on the Expo floor for the rest of the day. Let's recap what we just saw for incident response. First, with change tracking, now in real time, you can see changes to your environment. With Bits AI today, we are excited to announce autonomous investigator, which helps you remediate issues even faster and, of course, now with On-Call, you can optimize your incident response from that very first page. That wraps up all the features and products that we want to talk to you about today in this keynote but we are just scratching the surface. There are many more features and products that we're announcing this week at DASH. And I highly encourage you to visit our Datadog hub to talk with one of our resident product experts, watch a session in either the solution stage or the observability theater on the Expo floor and please attend breakouts, many of which are cohosted by you, our customers, to hear about all of the great new exciting ways that customers are using the Datadog platform. And with that now you've seen how all in one platform, you can observe, secure and act on your data with Datadog. Thank you so much for attending DASH. I hope you have a great rest of your day.

This call discussed

For developers and AI pipelines

Programmatic access to Datadog, Inc. earnings transcripts and 32,000+ others is available through the EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments, full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.