Confluent, Inc. (CFLT) Earnings Call Transcript & Summary
May 20, 2025
Earnings Call Speaker Segments
Unknown Attendee
attendeePlease welcome Chief Product Officer of Confluent, Shaun Clowes.
Shaun Clowes
executiveGood morning, everybody. Welcome to London. Thank you all for being here with us. I am so excited to be at another Current. It is always so amazing to be back among the broader Kafka and Flink communities. You all have achieved something amazing. You've taken data streaming from a niche technology to a foundational capability that's driving modern businesses forward. It's an incredible achievement. And we have a fantastic event planned for all of you, where you will have the opportunity to hear the latest in Kafka and stream processing. You can check out one of the really interesting breakouts we've programmed for you, and you'll be able to network with your colleagues in the broader data streaming community. Now to kick it off today, I want to talk a little bit about the future of data streaming. It's becoming the critical foundation for modern data architectures and the key to untangling the mass of pipelines and custom code that's slowing you down. From AI to apps to micro services, it's the key to unlocking it all. Why is that? Well, you've heard me talk a little bit about the operational and the analytical estates before, so I'll keep this brief. But the operational estate is where your online applications, think your ERP, your mainframe, your SaaS apps work together to run the business. And then analytical estate is where your data warehouse, your data lake and your BI tools are used to understand the business. Now both of these estates are held together with the fragile mass of pipelines. You have point-to-point application integration pipeline on the operational side and ETL pipelines on the analytical side. Now we've tolerated this really messy kind of painful approach for a long time, but now it's reaching breaking point. The world is more connected, it's more real time and it's more complicated than it has ever been before. Over the past decade, the systems that you are building have grown to be incredibly data-intensive, and increasingly, many of them require real-time data just to work at all. In the past, systems you're building might have only needed data from a couple of systems of record, and it was okay if the data was a little bit out of date. But think about the use cases you're building now. In retail, pricing and promotion reacts in real time to inventory and demand. In manufacturing, production lines adapt instantly in response to machine status and sensor readings. And with AI, we're seeing emergence of a whole new class of incredibly data-intensive and real-time applications. In health care, AI monitoring a patient's vitals and conducting interventions. In logistics, intelligent agents continuously adapting delivery routes, staffing and customer delivery time windows in response to environmental conditions. It just goes on and on and on. We're expanding the number of decisions we expect software to make. We're deepening the depth of those decisions, and we're shrinking the window in which we expect the software to make those decisions. We need a new foundation. We need to be able to move, process and act on our data seamlessly and instantly. The time has come where we have to unify the operational and analytical estates, we need data to flow seamlessly from our operational applications to our data-intensive and AI use cases and then back into the operational applications to update information or to take action. And that continuous low information has to be seamless and work continuously. If you have hiccups, for example, the latency spikes or the context is incomplete because of some processing failure, you were going to significantly impact the quality of the outcome. And with AI, the consequences of those data failures are really extreme, you can very quickly turn a super smart AI agent into a chaos machine that makes terrible decisions and frustrate your customers. Put really simply, modern software problems are almost always data problems. Whether or not you're building a customer experience, an internal application, a report and AI workflow, the hardest part is getting the right data in the right shape to the right place at the right time. But streaming is what gives us a foundation for seamlessly moving data everywhere that we need it. Streaming in conjunction with the data streaming platform lets us bring forward the right data right now. It's the key to untangling your data mess, turning it into rich, real-time data and then feeding it to power all of those different data-intensive and AI use cases throughout your organization. But today, instead of just telling you how, we'd like to show you how through a series of different demos. We're going to be talking about the data streaming platform at River Bank. They're a global financial institution and mortgage provider, and they're looking to streamline the mortgage application process. They'd like to make it faster, more accurate and require less human intervention. Now to power these types of use cases at River, we're going to need a few different things. Firstly, we're going to need data to be streaming because real-time experiences are obviously powered by real-time data. Second, we're going to need our data to be governed and enriched through stream processing because the quality of a customer experience is ultimately driven by the quality and depth of the data that it is being fed with. And then we need to take this high-quality, reliable real-time data and feed it to all of our different use cases, AI, applications, reports and beyond to power the needs at River. Let's start with the first part of that. Connecting to the data and getting it streaming. Now in this case, River is getting credit score data from a third party and they're storing it in an Oracle database. Now their payments and mortgage application systems are custom written in Java and those applications were built to natively stream into Kafka. Now to check out more about how exactly this works, I'd like to welcome my colleague in today's demos and our Head of Technical Marketing, Ahmed Zamzam.
Ahmed Saef Zamzam
executiveThanks Shaun. It's great to be here in London, and I'm super excited to show you how easy it is forever to start streaming with Confluent. Let's get started. Cue the demo, please. Confluent Cloud makes data streaming super simple, by offering multiple different deployment options that are tailored to workload needs and cost requirements. Now for our use case, [indiscernible] data sensitivity means that this data needs to stay off the public Internet. And for this reason, we chose enterprise clusters. These clusters are fully services, instantly provisionable and they offer private networking. As you can see here, with just a few clicks, we have a cluster up and running. So now we want to ingest data from the few different data sources that Shaun mentioned earlier. And let's start with the credit score data coming from the Oracle database. Again, as you can see here, Confluent makes this super simple with fully managed connectors. Here, we're using the new Oracle CDC XStream connector. All you need to do, you need to enter a few database connection details like username, password, database to connect to, and then tell the connector, which topic to write data into by providing this prefix. And then with literally 3 clicks, the connector is up and running. So now the data is flowing from the Oracle database to the credit score topic in Kafka. And here's how it looks like. Now this is one data source. We still have two more to go. We have historical payments and mortgage applications. Both of these are standard custom Java producer apps using the good old Java producer. Let's take a deeper look at the historical payments. This is a topic that holds the entire history of all payments for all applicants. And shortly, you'll see how the event looks like, the payment event and whether or not it was successful. And it's also important to note that all of our topics are backed by schema and the schema registry, and here's how the historical payments look like. On to our last and final data source mortgage applications. And I see here a very nice background image of London. Mortgage applications are coming in either from the mobile banking app or from the website. Here, the applicant enters a few details, like a full name, property value, loan amount. They hit the submit button, and this event ends up in a Kafka topic called mortgage applications in Confluent Cloud. So this applicant is John Doe, and this is the application we'll follow throughout the process today. So stay tuned. But for now, what do you think Shaun? It literally took 2 minutes, 2.5 minutes and it's really that simple to get started with Confluent. Back to you, Shaun.
Shaun Clowes
executiveI love it Zamzam. It really is incredibly simple, and we're always working to make it easier for you to be able to stream data to and from all of your systems, applications and databases. That's why in Confluent Cloud we offer over 85 different fully managed connectors to all of the most popular sources and sync systems that give you that 2-minute experience that Ahmed just showed you. And with our Connect with Confluent Program, we're actually going even further than that. We're embedding a Kafka connectivity experience directly within the user interfaces of tools that you know and love. And they are now over 50 of those integrations into major applications like Elastic, AWS Lambda and MongoDB. And for those cases where you do need to code your own connector using the Kafka Connect SDK, we can take away the management and operational overhead by hosting it for you in our custom cloud offering in Confluent Cloud. Now in that demo, you just saw our brand-new Oracle CDC connector powered by XStream technology, it is the most reliable, scalable, performant Oracle connector we've ever created, and it delivers the power of streaming to even the most complicated enterprise Oracle deployments. But obviously, it's not all about connectors. You recognize that in many cases, developers are building applications that natively stream using the Kafka clients. And we want to make it easier for developers to build, test and deploy streaming applications in one of their favorite IDEs, Visual Studio Code. We're excited to share that our Kafka VS Code plug-in is now generally available and it's available on the VS Code marketplace. We already have over 1,000 users, and we'd invite you to check it out at the QR code on the screen. All right. So we've successfully connected into our data sources, and our most important data is streaming as Kafka. Now we hear it current, so I don't need to tell all of you that Kafka is the living beating heart of the data streaming platform. Kafka gave us something game changing, the foundation of streams as the center of an open data system. With just a few lines of code, any developer can push data into Kafka and know that it will get wherever it is needed instantly without needing to know how. And in the consumer, I can pull updates in real time without needing to know where or how that data was produced. It just works instantly, reliably and at scale. Honestly, it's kind of magic. But sometimes people hesitate to use Kafka because they're worried that, hey, it might be too expensive for some big high-scale data workload, or they worry about the effort of managing and upgrading a complex Kafka deployment. But we want to bring that magic Kafka and the Kafka API to all of your data no ifs, no buts. And to share more about how we're going to do that, I'd like to welcome our Head of Kafka and Streaming, Addison Huddy.
Addison Huddy
executiveThanks, Shaun. It is great to be back with the entire data streaming ecosystem here in London. Let's start by talking about Apache Kafka. There's so much happening in the Apache ecosystem right now. There are over 150,000 organizations using Kafka in production. There were 1,800 meet-ups happening all around the globe. The energy in the Kafka community has never been higher. You can feel it, and you all here are part of the streaming journey. And the best part is Apache Kafka just keeps getting better and better and better. Earlier this year, the Kafka community hit a major milestone, the GA of Apache Kafka 4.0. It's the biggest Apache Kafka release ever. AK's transition to KRaft is complete and AK no longer ships to ZooKeeper. Consumer rebalances are faster and simpler, leading to more reliable applications, and there's even an early access for cues on Kafka. AK 4.0 is an incredible example of developers coming together to do something amazing. And the community is drumming up some more interesting ideas to make Apache Kafka more cloud-native. This is exciting because I've been saying this for years, that putting Kafka in the cloud isn't just about putting Kafka in the cloud. These ideas are really exciting because I can tell you firsthand that streaming in the cloud are a match made in heaven, when done correctly. Over the last 7 years, Confluent has been on its own cloud-native journey with Kora, our cloud-native Kafka engine, that deals with all of the pros and cons of the cloud. With Kora, we took the Kafka protocol, and we made a global elastic, serverless, modern data streaming platform that allows us to handle all the different workloads that our customers send our way. Now I don't have much time this morning, so if you want to learn about all the architectural advancements in Kora, please check out this QR code. Instead, I want to use this as an opportunity to share with you all a few lessons that we've learned along the way and share with you what it has allowed us to offer to our customers. Kora's journey to make Kafka cloud-native has not been easy. I want to touch on three areas: security and reliability; price performance; and hybrid and multi-cloud deployments. Let's start by talking about security and reliability. Bringing Kafka to the cloud brings with it a mountain of security challenges and opportunities. Confluent's #1 priority is and will remain security. Kora makes it to deliver on this promise. But for the most part, it's just a lot of hard work. Year-by-year, this focus has culminated in some incredible progress. Confluent Cloud now meets all major security and compliance standards around the globe from PCI to ISO, GDPR and over 1,800 more. Now for reliability handling all the idiosyncrasies of the cloud is a real challenge. But Kora is disaggregated architecture along with some fine-tuned operation allows us to offer a best-in-class SLA. Now here's the big one. price performance. One of the big promises of the cloud is cost savings. And one of the primary ways that we are saving all money is through optionality and recognizing that a one-size-fits-all solution doesn't work for Kafka, especially in the cloud. Some use cases, demand low latency, others high throughput and others just need to move data from A to B as cheaply as possible. Every workload deserves the right cluster at the right price. Kora has made it possible to give you optionality for these different workloads, a whole fleet of auto scaling clusters that deliver exactly the performance and capabilities that your workload needs. You can start with basic and standard move to enterprise, we need higher throughput and private networking. We have been hard at work on making enterprise clusters even better. They're now available on AWS, Azure and GCP, all with private networking, and they can scale up to 7.5 gigabytes per second of throughput, meaning that can handle almost any workload that you send their way. I want to take this moment to highlight our newest cluster type freight, which GA earlier this year, and it's already saving customers up to 90% on their Kafka costs. These diskless clusters replace a lot of the expensive inter-AZ replication costs with direct-to-object store rights. This means you can trade off low latency when you don't particularly need it for significantly reduced costs. Freight supports a new private networking type that we call private networking interfaces that implements the native AWS E&I interface. This makes freight perfect for relaxed latency workloads like logging, metrics and telemetry and scan the QR code to learn more. Every one of the clusters I've mentioned instantly auto scales, meaning they're always rightsized for your workload. Elasticity is the cornerstone of saving money in the cloud. I want to say it one more time. Elasticity is the cornerstone of saving money in the cloud. Most companies dramatically over-provision their Kafka infrastructure to ensure stability. Let's take a look at this graph. It's a typical week of throughput, peaks and valleys throughout the day and lower throughput on the weekends. If you're going to self-manage this cluster, how much throughput do we provision for? Peak? That's not going to work because we might get a spike. So even at peak usage, we found that clusters on average are underutilized by over 50%, all this red area here, that's wasted money. On average, you're wasting over 50% on your infrastructure costs at best. The real number is much higher. Instead, elastically scaling clusters scale up and down with your workload, ensuring that you always have capacity and you're never overpaying. And this is just Confluent Cloud. The optionality goes further. WarpStream recently joined the Confluent family, and we could not be more excited. WarpStream also uses direct-to-object store rights to avoid all those inter-AZ replication charges, and it also does this with a BYOC architecture with a purpose-built BYOC control plane. It's great for those that have chosen a BYOC deployment strategy. And if you need to run on-premise or in a private environment, that's where Confluent Platform comes in. Okay. Quick aside, big shout out to my graphics team for putting up with me with this slide. They said, you're taking the graphics too far Addison. No one's got to understand what the helicopter is. I was like, no, no, no, the helicopter can land on your data center, it's Confluent Platform, right? Makes sense? A little bit. All right. Very good. Thank you. Yes, I thought it was a good slide. Anyway, these offerings aren't alternatives to one another. They're the right fit in deployment for the different streaming workloads across your enterprise. So you don't just pick one cluster, you can mix and match to meet your different workloads. And the best part is, all these clusters work together. Thanks to Cluster Linking and WarpStream Orbit, you get byte-for-byte offset preserving replication, across all of these clusters, across your entire environment, any cloud, any data center, all in real time. And that bridges us to our last pillar, and why hybrid and multi-cloud is so important. You have applications everywhere. You have data everywhere, and that means you need streams everywhere that are cost effective and easy to manage. Confluent is designed so you can put your data in motion globally. Confluent Cloud is available in over 100 regions and Confluent Platform and WarpStream bring that streaming to your data center. But not only can you move data between regions in Confluent Cloud on the same cloud provider. Recently, we announced an LA for cross-cloud Cluster Linking, between AWS and Azure, and we'll be adding GCP to the mix next. No networking to set up, no IP filtering, no tunneling. It just works seamlessly and securely. But how do you manage this global deployment? Later this year, we are debuting our unified stream manager, which will give you control of your Confluent clusters regardless of where they are. The unified stream managers ships with Confluent Platform, and connects to Confluent Cloud, only sharing metadata with the cloud. All your clusters, whether on-prem or in the cloud are presented in a single unified view. All your schemas, all your metrics and lineage across all of your streams. Imagine being able to set an encryption policy in a single place and with a click of a button, have it applied to your entire streaming estate. That's how you bring consistent streaming across your organization. We'll be sharing more details about the upcoming unified stream manager throughout the year. And that's not all we're doing to make on-prem better. We're committed as ever to Confluent Platform, and we recently released Confluent Platform 7.9, which extended RBAC to OAuth and mTLS added IPv6 support, and there's an EA release of client-side field-level encryption, which helps implement the gold standard of moving data securely. We're not done. We're excited to announce our next-gen Control Center for Confluent platform. It's ultra-scalable packed with new features and has an LA release of our new Flink UI. The new Control Center is available for Confluent Platform, 7.5 or above, you can scan this QR code to read more. And we're not finished there. Stay tuned. Confluent Platform 8.0. It's around the corner, and it's going to be our biggest on-prem release ever. So with that, I'll hand it back over to Shaun and Zamzam for some more demos. Thank you.
Shaun Clowes
executiveAll right. Thank you, Addison. And it is time to get back to some of River's demos. So we're connected to the data that we need and our streams are up and running. But obviously, if we're going to power our data-intensive use cases, we don't just need real-time data. We need data that is reliable, trustworthy and contextualized to solve the decision and use cases we're setting out to achieve. And that's where governance and stream processing comes in. We really need to shift left and move governance and processing closer to the streaming source. And that's how we can contextualize and enrich our data just once, and then use it over and over again to power our different use cases. Now Apache Flink is the key to making shift-left practical because it brings the reach processing capabilities of the analytical estate into the operational estate. That's why Flink has seen such rapid adoption. really innovative companies like LinkedIn, Netflix, Uber, Disney Plus, Stripe, Shopify, and many, many more have all adopted Flink as a way to seamlessly process data at scale across their entire organizations. Now really important use cases, things like ad targeting, personalized recommendations, dynamic pricing, real-time logistics, all of those types of use cases are data-intensive and they require you to be able to seamlessly process data at scale in real time. Now Flink has made those types of workloads, not just achievable, but approachable and affordable too. Now Flink is so powerful because it represents a single unified run time by your developers to work with and shape with their data at the source as they're producing the data in streaming or in batch. So they can produce really high-quality data that then reduces the need for further downstream duplicative processing and drives up the opportunity for reuse of that data in other use cases. Now put simply, with Flink, things that were previously only possible in the analytical estate are now possible in the operational estate. They are now -- what was only possible in batch is now possible in streaming and the results are not just more valuable, they're also less expensive too. So let's see some Flink in action. Take it away, Zamzam.
Ahmed Saef Zamzam
executiveThanks, Shaun. To be able to make a decision on a mortgage application, we usually need data from a few different places at once. We need credit score data. We need historical payments for each applicant and we need the actual applicant details, coming in from the mortgage applications. So what we'll do, we will take in this live stream of mortgage applications that's coming in and will use Flink to transform this into a real-time contextualized data set that we will use to feed our AI agents with. Cue the demo, please. Let's start here from a data portal. This is a one-stop shop where we make all of our real-time data sets, easily accessible and discoverable throughout the organization. If we click here on mortgage application style, you find some useful business information like the owner, description of the topic, the associated tags, you can search the topics by these tags as well. And you can also drill deeper into the schema. And you can define data quality rules directly on the schema. So for example, here, we're saying any mortgage application that is coming in that does not have a valid payslip ID, will get routed to the debt letter queue. Now what's good about these data quality rules is that you define them centrally in Confluent Cloud. They get pushed automatically and enforced at the clients level with 0 code changes on the client side, it just works. And here's one of these examples. So now we want to take the mortgage applications and enrich it with the credit score data coming in from the Oracle database. So here, we're using Flink Java table API code to create this new enrich data set for us that joins these two data sources together. And in this new data set, we're also creating a new custom field. Monthly PMI cost or monthly mortgage insurance costs. This is calculated by user function that takes in loan amount, property value and credit score to determine this value, also written in Java. So what we did here, we wrote some code in Java, package it up, submitted it as a job to Confluent and not once did we think about scaling or managing infrastructure. Flink automatically did this for us so that we stay laser-focused on writing code and not managing infrastructure. Now looking at John Doe's application. I think he has a pretty healthy credit score. I think he has really good chances in securing this mortgage. But let's see. So let's further enrich this with historical payments. But first, let's process these historical payments first. We'll aggregate all historical payments for each applicant into a single row and create a new enrich dataset applicant payment summary. This time, we're using Flink SQL. In Confluent Cloud, you have the option to either write your code in Java or use Flink SQL directly from the UI. It's all your choice. So here's how applicant payment summary looks like pretty simple, applicants and all their entire historical payments in a single row. So now we have two enriched data sets, applicant payment summary, and enrich mortgage applications. Next thing we'll do, you guys are right, join both together, to create one final enriched data set that we will use to power our AI agents with. That's what we're doing here. We're creating enriched mortgage payments with -- enrich mortgage applications and payments. Things got a little bit longer here, but promise, this will be the last long name. Then now if you look at John Doe's application, you'll find all the new fields that we've just added in this demo. So monthly PMI, credit score and most importantly, historical payments that we just added. I hope it does not have any missed payments, though, I see one or two, but let's see. So now we're ready to power our AI agents. But more on this, a little bit later on to, so I ask you just to hold your thoughts for a few moments. For now, back to you, Shaun.
Shaun Clowes
executiveAll right. Thank you, Ahmed. So you just saw two examples where we're able to take raw data streams, and we turn them into a highly contextualized enrich data sets that can then be reused to power our different use cases. Now Flink was the backbone of that capability. In the demo, you saw our cloud-native serverless Flink offering, Confluent Cloud for Apache Flink that takes away all the management and operational burden of running Flink and it can seamlessly auto scale resources to meet the size requirements of all of your data flows, large and small. But for our on-prem customers, we also offer Confluent Platform for Apache Flink, where we've taken our many years of experience running large Flink estates and put them into Confluent Platform to make it easier to manage and operate complex Flink deployments across significant organizations. Now we're the only vendor who's offering a Flink solution that bridges on-prem, the cloud and multi-cloud. Now we've moved fast over the past 12 months to bring Flink to all of the different potential users of your organizations. As Ahmed showed you, we GA with Fink SQL, which allows anybody to use the power of SQL to manipulate data streams. They don't even need to know anything about Kafka, and it's available in the UI, the API and the CLI. But we recognize that not every developer wants to work with their streams using SQL. So we've made it easier to access the power of the Flink engine using the table API and UDFs from Java and Python for example. And we've made a possible for Flink to work with all of your data streams in Kafka, whether or not the data was originally serialized with the schema or not, using flexible schema management. And that means that Flink can work with all of your data, whether or not it's in Avro, Protobuf, JSON or more. And just over the past few months, we've been working to make it really easy to build AI-powered applications using Flink. We've recently made models first-class citizens in Flink, so that you can trivially invoke models in Anthropic, OpenAI, AWS Bedrock, Google Vertex and more. And we introduced federated search to pull in additional context from major applications in your ecosystem like Databricks, MongoDB, Snowflake, BigQuery and more. So that you can fill up the LLM context window, get the best possible results with the fewest possible hallucination or errors. Now all that means you can build really reliable and scalable AI applications dramatically faster than you probably thought was possible. So let's see some of those AI features in action, Zamzam, show me some AI agents.
Ahmed Saef Zamzam
executiveOf course, Shaun. But first, I'd like to take a step back and set some context here. Well, there's a lot of excitement around these Agentic frameworks at the end of the day at their core, they're just states and microservices with some brains. They produce and consume events from event streams, and they often interact with each other to produce a continuous set of autonomous actions. Now as it turns out, microservice is our thing. So let's take a look at this perfect example of one of these Agentic microservices. You have some sort of a data source that is generating continuous events like a storage system or a database, you ingest this data into Kafka and then use Flink in a pretrained LLM model to transform these events and produce a set of continuous stream of actions. This closed loop of observing, reasoning and acting is perfect for stream processing. And with Kafka, you have the real-time view of what's happening in the business and all the business processes. And with Flink you have the ability to build and develop logic that you run, rerun and iterate in production, the same as you would do in development. Let's see how this applies to our use case. In our use case, we have a set of three agents that run sequentially. First one runs on AWS Lambda and this does an overall fraud and credit risk assessment. The second one, this is the agent that makes the actual mortgage decision and runs entirely on Flink, and it also recommends an interest rate. And then based on this decision, agent 3, which also runs entirely on Flink, either generates a mortgage offer or a rejection letter. At the end, you'll have something like this, either an offer or a rejection letter. So let's see the agents in action. Cue the demo please. Agent 1, runs on AWS Lambda. So we'll use the fully managed Lambda sync connector to stream data directly from the topic that we created in the previous video, enrich mortgage application with payments, through the Lambda function in real time. In the Lambda function, we're using LangChain, cloud model running on Bedrock and a broad risk assessment tool that is bound to the model to evaluate each mortgage in real time. The output is a continued stream of validated mortgage apps that also sent to Confluent as a topic. And this topic includes a fraud score, general credit risk score and most importantly, agent reasoning. Now looking at John Doe's application, agent has full confidence in John's ability to repay the mortgage. And it also assigned -- shortly, you'll see that assigned a moderate risk to the whole application. I think it's moderate because he has a few missed payments and high credit utilization. I think he'll get this mortgage. I have a good feeling that he'll get it, but let's see. Agent 2 runs on Flink SQL. So we first need to register the model, with the system prompt the Confluent catalog and then use the pre-built ML predict function to pass the output of agent 1 as an input to the model of agent 2. The output is a continuous stream of mortgage decisions, and a recommended interest rate. Now this is a perfect example of one of these Flink AI microservices that I was talking about a little bit earlier on. Again, looking at John Doe's application, you'll see thanks to the strong credit profile, they got approved. Hallelujah, happy for you, John. And he got a pretty decent interest rate, 3.65%. I think many of you here will agree with me. This is pretty good, at least better than mine. Agent 3, same thing runs on Fink SQL. So we need to register the model with a system prompt to Flink catalog. And then use the ML predict function to pass the output of agent 2 as an input to agent 3, and then the output is either a mortgage offer or a rejection letter. Now while we fully automated everything, we still want to make sure that humans are kept in the loop. Just in case, if anything goes wrong, we want to save the day. So looking here at John's application. Yes, John got approved. But we will not send the offer just yet. We will send the first human to review. And if they agree with the agent decision, they will approve and send. So what we did, we fully automated the process, made it faster, more accurate and kept humans in the loop for critical decisions. Back to slide, please. So think about what you just saw here. This is not AI for the sake of AI. This is real-time AI communicating with itself, they're self-reflecting to drive better customer experience. And all of this was made possible because we are using the same top-quality real-time data to power these AI agents for us and help us automate this process. Back to you, Shaun.
Shaun Clowes
executiveThank you, Zamzam. So you just saw a couple of great examples of business facing operational AI use cases. But the world doesn't stop there. What about all of the decisions, reports and AI workloads that you power in your data lake? How can you take all of the great work that you've been doing in streaming and reuse it over in the analytical estate? Well, traditionally, the analytical estate has thought about the world in terms of tables, not streams, and those tables have been tightly coupled to the engine that created them. So for example, a table in Redshift could not also be reused in Trino or Snowflake. Now that mismatch between streams and tables and the tight coupling between a table and the engine that created it, meant that feeding data lake from streams took a little bit of work. You had to read the data from Kafka, write it on to disk, potentially reformat the file on disk to a more queryable format like parquet, apply schemas, validate the data, then do business processing like filters or aggregates. It all added up to a lot of manual work and costly processing just to get the data in a form where is usable in your analytical estate. But it doesn't have to be that way anymore. Over the past several years, we've seen the emergence of open table formats like Delta and Iceberg. And they enable an analytical table to be separate from the engines that are processing it. And that has unlocked our mission to make data connected and accessible across the operational and analytical estates. Time for the news of the day, everything you just saw in those demos can now be immediately reused in your analytical estate with 0 additional effort. And that's thanks to the power of Tableflow. With Tableflow, we have evolved Kafka storage engine so that a Kafka stream can also be accessed as an Iceberg or Delta table. The exact same real-time, reliable reusable data that is already powering your operational applications can be immediately reused inside your data lake, your data warehouse and all of your BI tools with no additional effort. This is one of the most powerful technologies I've had the opportunity to release in my career. Now we took Tableflow, GA, a little bit over a month ago, on AWS with support for Iceberg. And today, we're super excited to announce the Open Preview of Delta Lake support as well. You can check out more at the QR code on the screen. Now Tableflow demonstrates the value of what we were talking about when it comes to shifting left and moving governance and processing closer to the stream. By doing so, we've got reliable reusable data we can use everywhere, including in the analytical estate. We don't need ETO. We don't need ELT. We don't need endless data prep. We just get seamless data flow between our operational applications, our data-intensive apps, our AI and all the way into our data warehouse, our data lake and the analytics ecosystem. You save time, you cut costs and you spend time building new stuff rather than constantly wrangling data. So let's check out Tableflow in action. We're going to just take a closer look at the mortgage approval data. This time we're going to do so in Databricks, and to talk us through how that works. We're excited to have Robin Sutara, Field Chief Data Officer from Databricks joining us today.
Robin Sutara
attendeeThank you so much, Shaun. I'm so excited to be here. I know many of you are already using Kafka and Confluent to get your real-time data and hopefully, many of you are also using the Databricks platform as well. And that's why we're super excited about this partnership. As Shaun mentioned, being able to actually use Tableflow to get immediate access to your real-time data as Delta Lake format inside the Databricks platform opens up a whole area of opportunity for us. Just think of all the new data sources that you have previously been unavailable to you because of the problems and issues that you have with ingestion. I get to meet with hundreds of customers every month, and they all struggle with this very thing. And so we're super excited that Tableflow now gives us the ability to fuel analytics and AI inside with the full power of Databricks at a fraction of the cost and the effort. But you don't know me, yes. And so I think seeing is believing, and so I would love for Zamzam to show us a little bit of a demonstration of how this works and what it looks like inside Databricks.
Ahmed Saef Zamzam
executivePerfect. Thanks, Robin. So let's see this in action. What we want to do, we want to expose our mortgage decisions topic that we created in the previous video to our users in the Data Lake. And in this case, our friends here at Databricks will help us with some deeper analysis on these mortgage applications. Cue the demo, please. So as Robin and Shaun mentioned, Tableflow is the easiest way for you to expose your Kafka topics as either Apache Iceberg or Delta tables. What this means, any tool that understands these open data formats can now easily access this data. Now in our case, if we want to enable Tableflow, all we have to do, we simply pick a storage format. In this case, it's Delta, and then choose a storage mechanism. It's our own S3 bucket, and then again, literally with two more clicks, maybe three, the data and the associated schema are available in Delta format. So no more custom code required, no more manual schema mapping, it just works. And here is catching a phase for everyone to remember, with Tableflow, your streams are tables. I hope everyone remembers it. I'll ask you after the keynote. So let's see what we could do with this data on Databricks sites. On Databricks, we're using Genie AI, which is a tool that allows us to interact with the data with natural language. So here, we're saying, show me a pie chart with all mortgages that were approved this month broken down by state. What Genie does, it understands the intent, translates this intent into a SQL query and uses the query to retrieve and display back the results. Let's take a look at another example. We can say, show me a bar chart with all mortgages that were injected, so bar, not pie, and rejected, not approved. This month, broken down by state. Again, same thing happens. Genie understands the intent, translates it into a query and uses the query to retrieve back the results. These are powerful analytical use cases that anyone in the business could do regardless of their technical expertise. And you can stop here or you can use this data to power so many other use cases on Databricks side. But what we'll do, we'll take this a step further on the Confluent side. We recently received the request from the audit team, they want access to the entire historical data in Tableflow for mortgage decisions to do some sort of ad hoc analysis. And the question here comes how can we access the entire data in Tableflow while making sure that the real-time data in Kafka is fully reflected in the results. The answer is, I'll try to time it snapshot queries. This is a new feature that allows you to take a snapshot of the existing data set at the time the query was initiated. The best thing about these queries that they support hybrid reads. So both reads from Tableflow and from Kafka without any co-change required. And here's a perfect example, audit team are looking for all mortgages that were approved in Texas with interest rate below 3.7%. Note here, they did not specify where to read the data from. They simply enable snapshot mode and Flink did the rest. Back to you, Robin.
Robin Sutara
attendeeThat deserves a clap. Yes, Who else is excited, real-time data inside the Databricks platform? So while Zamzam shared with us a few simple use cases, the real power is now that it's available as a table inside of Databricks platform. You have the full platform to be able to do all of your AI and BI. So just think of the possibility of those use cases that you haven't been able to solve for yet, hyper-personalized real-time experiences for your consumers, being able to do dynamic risk scoring as you're taking in applications for credit, real-time fraud detection. I just think there's so much power here that you are now going to be able to deliver these use cases for the business that you haven't been able to end the day. If you think about it, real-time data is what powers AI, powers your real-time AI. And so together, Confluent and Databricks are making it seamless for hundreds of organizations worldwide. And we're super excited for what's to come next. We're looking at even more product integration, so think Tableflow into Unity Catalog. Bidirectional integration between the two products. So this is only the beginning of the partnership, and we're super excited to be here together and see what you can build with us. Thank you.
Shaun Clowes
executiveSo ultimately, we want to feed the reliable, reusable enrich data to all of your different systems, operational and analytical. So we'd like to take a moment to acknowledge our partners on that journey. We are building deep catalog integrations with our commercial partners. Databricks, AWS and Snowflake. And we're also excited to launch Tableflow with a variety of ecosystem partners who are already invested in open tables and for whom Tableflow represents the ideal way to get data into their tools. I also want to call out a really useful capability that Zamzam showed you at the end of those demos, just there, Flink snapshot queries. Let's say on the development, I'm working to create a new derivative data set on the mortgage data. I might want to experiment a little bit before I push the streaming query into production. I'm going to try a few things. Now with Flink snapshot queries, you can accelerate your interactive queries using the Tableflow data and those queries will complete 50 to 100x faster than if they were running directly on top of the streaming data. So you can do really fast iteration, get the exact shape that you want and then just push it into production. Same data, same code, super easy. And it goes further than that. we're shortly going to extend Flink, so that when you run a new Flink job, it can seamlessly reprocess all of history, catch up to current time using the Tableflow data, and then switch into streaming mode producing continuous updates on the top of the real-time data as well. Now that makes reprocessing historical data using Flink and streaming really practical and it opens up a whole new class of problems to the streaming data platform. Whether you're working with tables or streams, interactive or production. It's all one pipeline, one job, one unified system for working with data. This is an powerful unification that Jay is going to describe a little bit more about in just a moment. And that's it folks. We solved the problems that we set out at the start for River Bank. And what we've really done is we've taken our data and put it to work to unlock new opportunities for our business. Better customer engagement, faster loan approvals, more accuracy, more dollars on the table. But the power of the data streaming platform extends far beyond just banking, it extends to all different verticals all over the world. And so let's switch it up a little bit and talk about the data streaming platform in energy. And for that, we'd like to welcome Dora Simroth, Head of Data and AI Engineering at E.ON Digital Technology.
Dora Simroth
attendeeThank you, Shaun. Hello, everyone. It is amazing to be here today with you at current in London. I am Dora Simroth, the Head of Data in Engineering at E.ON Digital Technology. In my role, I lead and build engineering teams that work on digital products with the data in AI core. A big part of my job revolves around creating the technology strategy for data and AI at E.ON. And in this role, I am so happy to be here today with you and to tell you how we are engineering the green energy future at E.ON. And Spoiler, it has something to do with data and AI. But before we get into all of that, I want to introduce E.ON to you. We are the playmakers of the green energy transformation in Europe. As one of the biggest energy companies in Europe we are active within three lines of business. Within energy networks, we operate the biggest distribution drop. It is the backbone of that transformation. It includes critical infrastructure for our society. Within energy infrastructure solutions, we provide support to cities and industries on their path to decarbonization. And within energy retail, we are there for individual households and enterprises to find their path to a net-zero future. That was the description with words, but we're all here data people. So let me try again this time with numbers. 1 million renewable energy plants connected. 1.7 million kilometers of grid across Europe. And we serve 47 million customers. So now that you know us, we can really get into the story. The biggest challenge and at the same time, the biggest opportunity that we are facing is the transformation of the energy world. In the past, energy used to be created in a centralized type of way. It used to flow in one direction, and it used to be relatively predictable. In the past, imagine this, there were a handful of big plants that were running on relatively predictable schedule. They produce that energy. It flew through our distribution grids. It reached end consumers, and they also had relatively predictable ways of consuming our energy. That world is very different today. The energy system has changed. The energy system has changed because of multiple things, but just take the flow of the energy, it is now not just running one way, but bidirectional. And why is that? It's because we have so many energy distributed resources. Look around, and you'll see photovoltaic systems, you'll see electric vehicles, you will see heat pumps, you'll see batteries and they make the energy system be decentralized. So for example, household, when it needs energy, it just pulls it out of the grid. However, if that household happens to have a photovoltaic system on its roof and the sun is shining, it's also pushing energy into the grid. Consumers have also evolved their behavior. As the name say, consumers consume, but nowadays, they can do much more. They can produce energy, they can store energy and they can sell energy. So we need a new name, flexumers, we call them. While all of this is changing, the task of the grid operators has still remained the same. They still need to balance energy production and energy consumption, and they need to do that at the millisecond level. So you can imagine, streaming is a wonderful way to address this. And the importance of accurate real-time forecasting has never been more important. So I brought to you three initiatives. The energy world is changing. So we have also transformed our data and IT landscape. These are examples of three initiatives where streaming is an enabling technology and Kafka is a core building block. On the grid digitalization side we look at assets. We have rolled out smart meters, and we are modernizing all of the components of our grid infrastructure. The grid is a highly dynamic environment. The grid operator needs to assist in real time. They need to monitor what has happened, and they need to react promptly. So streaming real-time integration patterns are a lovely way to master that challenge. If we turn towards the energy demand side and look at energy markets, keep in mind, we don't actually own energy generation. So whenever somebody needs energy from us, we need to go to the energy markets and procure them. And that's where we're using Confluent to be able to create optimal procurement plans, when to buy, how to buy in a way that reflects the needs of our customers and also their contracts. And at the center, you can see data products at scale. This initiative is at the core of it all, to be able to scale, but to do so in a way that simplifies our data landscape, we have put data products at the heart of what we do. This is also where I'm most excited about the innovation coming from Confluent because now at the level of the file we can now achieve the duality of the stream and the table and in that way data products can truly be reusable. So to sum up, for us to manage the challenge that is coming from the transformation of the energy system, we need to be flexible. But flexibility requires strong foundation, and that is why within our data and AI initiatives. We have built on top of strong data platforms. This is where we partner with Confluent. In this way, we can take all of the data sources that we require, connect them process stream, govern them, build those data products that I was telling you about and power our use cases on the grid side, on the energy market side and many more. So what does it mean to have a strong data platform? For developers, it means having access to connected data, trusted government data and AI-ready data. For our business, it means maximizing grid efficiency, being able to dynamically respond to energy markets and having accurate forecast in high temporal resolutions. We don't just stream our data, we also act on it in real time and with intelligence. For example, on the grid side, we have invested in predictive monitoring. We have installed digital sensors within our low voltage grids that allow the grid operators to get visibility in real time and to control low voltage grids. And the outcome of all of that has been faster incident detection. Now if we turn towards energy demand, we've also invested heavily in increasing the capability of forecasting renewable energy production. For those of you who live in London, I think you have a very good feeling of what it means that the weather forecast is not really to be trusted. It might be a sunny day, and that's the weather forecast, but exactly where you are, the cloud has come out. And that cloud covers your photovoltaic system and suddenly it's not producing as much energy as you would have initially thought. So we need have very granular forecast. And if we are able to forecast, the renewable energy production, we can take that knowledge, and we can use it to incentivize our customers to consume the energy at the right location and at the right time in a way that helps us balance the grid. However If we can't do that with our customers' contracts, we can also take that flexibility and the knowledge that we have and traded on the energy markets. So there's a lot we can do by building AI on top of the stream. There's many initiatives we have at E.ON, where we are using AI on top of the stream. And I'm so happy to be part of this, and there's so many more stories to tell, but all of these are stories for another time, and I'm really happy that we're partnered with Confluent achieve this. I'll hand it over now to Jay to tell you more. Thank you, everyone.
Edward Kreps
executiveAll right. Well, that was really cool. I think if there was two topics in technology, I really like it would be data streaming and AI. But if there's three topics in technology, I really like, it would be data streaming, AI and what's happening in the world of energy, which is really fascinating. So I really enjoyed that. I'm going to be talking here at the end a little bit about this unification of streaming and batch, a little bit of what it means in the age of AI and agents, a little bit about the kind of data-intensive applications that Shaun talked about. And how these all fit together with some of the functionality we talked about, the snapshot queries, what's happening in the world of Flink. But to frame all of this, let me step back a little bit and start with the big picture. So what does all this mean? What's the point of all these data-intensive applications. I think there's a larger theme here, which is, in some sense, companies are becoming much more software. It used to be a purely human activity, but now we started to build software systems that don't just support the business. They actually run it. They're actually operating core activities, the interaction with customers, the production and goods and services, really things right at the foundation. I think what you heard from E.ON is very similar to that, where right in the flow, there's a feedback loop that's optimizing that business. And I think it's not unique to energy. It's happening in every industry. So how did we get here? What does it mean? What are the implications for data systems and infrastructure? Well, to talk about that, let me rewind a little bit and start with kind of the quick history. If you think about how software was adopted in organizations maybe going back a few decades, it was byte-by-byte. Applications, which kind of mostly stood alone. They were kind of a silo with their own data, their own database, the user types in the data and sees it back in some way. They just basically are kind of passive repository of data. And this is kind of the paradigm. These are UI-centric tools that are meant to show things back to humans. And the humans are ultimately driving the intelligence, the decision-making, a lot of the interesting stuff. The software is just there to kind of hold it for the human. And if you think about how this evolves? Well, we started to need to plug these together mostly for business intelligence. And so the rise of data warehousing was all about extracting all this data from these little silos and putting it together in one place where you could run analysis on it, where you could look across where you could do reporting across these things. And now if we fast forward over the last few decades, what's happened to this architecture? Well, it's mostly accelerated. Mostly, we have a lot more of all these things. We have more applications. We have more databases. We have more analytics systems, we have more environments that all this is running in, and it's all more interconnected. So part of this is just, well, there's a lot more of it. There's a lot more data. But I think there's been a change. It's not just that there's more software or more applications. The nature of these applications is shifting, right? This is -- there's really a transition from that kind of UI-centric tool which is there to support user input to something that's much more significant backbone of the business, the rise of these data-intensive applications that we were talking about. And I think if you think about what this means fundamentally. It's actually a pretty significant change in paradigm. And it's a change that goes all the way to how the applications are built. After all, if you're making a decision with people and the software is just supporting that then ultimately, the role of the software is to hold on to the data until the user wants to see it, and then retrieve it and put it on the screen. And so your concept of data infrastructure is mostly about storage and retrieval. That's really what it does. And the action is going to be very periodic. It's when the user shows up at the desk and happens to look at the thing. As you move into a world where software is doing more of the work, where it's making more of the decisions on its own, suddenly things change because it's not like the software comes into work at 9 a.m. and finally decides to open up that browser window, the software is working all the time. And so things that used to happen periodically start to happen continuously. They become a much more real-time activity. And the problem with data isn't just about storage. You still need to store data, but that isn't the primary thing. It is about the flow of real-time activity and working continuously off of it. And as this happens, the loop inside businesses speeds up. This is no longer something that happens periodically. It's something where one thing happens which triggers another things, which triggers another thing. And this kind of closed-loop decision-making in this translation of business into software systems, what do you think is happening now in this new world of AI? Well, this is a trend that's massively accelerating. This is something that I think didn't start with AI. We were doing this probably for the last 10 years in some sense, but it's -- the capabilities are massively accelerating. And for really obvious reasons, right, the limiting factor -- if you're trying to put decision-making and autonomy into the software, the limiting factor is the things that you can express in a bunch of hard-coded rules, right? It's really hard to capture certain business processes in a formal algorithm. And it's a lot easier if you can apply intelligence directly, right? The intelligence is no longer just in people's heads, which requires getting all the data through UI and then back again, the intelligence can now be out there with the software itself. And this really redefines the scope of what the software systems can do. I think it really is going to accelerate the adoption of software in companies, the automation. I think if we think about what's going to happen in the next 10 years of software development, it's going to focus on this problem, how can we internalize this, how can we take it on, how can we apply it in the organizations where we work. That's what's happening. And so ultimately, the question is, how can we build this type of data-driven, data-intensive application? What are these new agents in AI-driven applications going to look like? What's the architecture for them? What are the problems we have to solve? How hard is this ultimately to do? Now after all, we've all used AI in some form. You probably use ChatGPT or cloud or something. That's not that hard. So is it the same if we're trying to integrate this into a business? And the answer is, well, it's actually a lot harder. If you're actually trying to take something a business does and translate it into an AI-driven software system. There's a bunch of challenges that you have to solve. And I'm going to talk about three of the things that are really different, I think, from traditional software engineering that we have to adapt to if we want to be able to do this. I think these are kind of three principles that we have to get right for building this kind of system. The first difference is these applications are ultimately built with data, in a way that's really different from a traditional enterprise app. You can absolutely sit down and build some piece of software and write a bunch of unit tests and performance tests and integration tests, all with fake data, and you can deploy that to production and have it work correctly. And say, yes, this software works, it's logically correct. I've validated all of its rules, and it totally works. With AI-driven systems, if you haven't seen it work with the real data, you have no ability to say if it works at all, right? Think about a practical example of this. Let's say, I'm trying to build an agent that's going to answer support requests, right? I have no ability with just wiring that up to some other things to say if it works or not, until I've seen it actually run with real support requests from real customers acting on real data from the actual parts of the business applicable to that customer. And so I've seen that. I can't say that, that software works or doesn't work. I can't say anything about it, right? So the data in this new world is much more closely connected to the application than ever. And you have to be able to directly build on that, and you have to be able to harness a much wider ecosystem of data to actually build this type of application. It's not just about the stuff that's in its local database. It's about the stuff that's all over the organization that can help it. The second thing that's different is the ability to actually iterate on these data-driven applications is quite different, right? I need to be able to actually take this, run this on real support requests in my support example, be able to see output, benchmark that and say, yes, these are good. These are better. And then I have to be able to do that as change the model, change the prompt to add more data that's going to act as context, I have to be able to do that in development and then out into production as the system runs. Is it actually producing good results? Is it still working as I expected? If I need to make future changes to that, is that still working as I expected, that evolution and iteration is now a very different life cycle that's much more metrics driven. And finally, I need to be able to integrate this back into the actual operation of the business. I need to somehow take something that's very data-intensive and apply it in the flow of a company, actually run it like a production system, how it actually act in real time. So I think these are the three characteristics. So when we think about, well, what's the right way of building this kind of thing. We have different paradigms for building programs. And one that we might lean to right away, if we say, hey, where is the data? We might say, well, maybe we can build this kind of thing as a kind of batch process? Why not? And you might expect you to kind of beat up on that, kind of make fun of the batch processing agent. But actually, batch processing isn't that bad, in some ways. In fact, I would say we should have 2 cheers for batch processing. And it's not quite 3 cheers. There's not going to be quite 3 cheers, but there's at least 2 solid cheers because there are some things that it gets really right. If you ever built a simple process that just takes input from a text file, some Python script that munges a text file, this could be simple rules. It could be a more sophisticated machine learning process. It could be something with Gen AI. It's actually really easy. You take your inputs, you produce your outputs, you run it over and over again until it works, right? Similar thing if you work in a data warehouse, you have all your data there, you write your SQL script. You can actually build that with the data, you run the process, it produces an output. It's pretty easy. You can actually really do a good job of working with that. And the reasons I think are kind of simple. First of all, the batch systems actually have the data. You go to a data warehouse, it's full of data. It has a bunch of tables. It has the full history of everything. It comes across different domains, if you're writing a data-intensive thing, this is actually not too bad. And secondly, the batch systems have this really nice iteration loop. I can actually take my inputs, transform them into outputs, tweak my logic, do it again and again until I get the right thing. And I can do this in a way that's productive and that I can test on data at scale, if I need you to. So these things are pretty good. So maybe the world should just run on batch agents that kind of kick off at midnight and run until morning. And I think that's kind of where the problem is, right? In those three criteria, you kind of get the first two, but you don't really get the third thing. And anyone who's ever tried to take batch processing and integrate it back into the operation of a real-time business, you start to find this very hard. There's a lot of things that show up that just aren't right, right? Customers expect all the things that they see to be up to date and in sync. The business is ultimately happening continuously. And you end up having to do all these kind of weird hacks to integrate this back into production. You get all kinds of delays. This is ultimately kind of a hopeless process. Reality is out there happening in the world all the time. The applications we build have to do that as well. The more of the business, the software is taking over, the more it's going to need to do that. And so we're probably not going to fix batch processing to be this. But there was something good about it. So what about -- what we usually do? What about request response applications? Well, these are great. We build lots of them, rest, services, microservices, web apps, we're very familiar with this. We know how to run it in production. We know how to scale it. Maybe this is the right way to do it. Is this going to be the path to building AI agents? And the answer is, well, kind of like we definitely know how to scale it. We know how to run it reliably. It is real time, but it's really hard to iterate on data this way. We've lost something of what was good in a data warehouse. The ability to actually draw on many data sources, the ability to actually test and iterate, this is harder to do against a bunch of rest services. And it's harder for a number of reasons. First of all, they're not really set up for this kind of test, retest, benchmark methodology. I'm probably going to cause some chaos in production if I try and do that. Secondly, that data is going to be changing all the time. So if I run it once and I run it again and I get better or worse results, I can't really say if that's because the input changed or it's because of some change I made, right? And finally, there's a whole set of security concerns and trying to do that in production. There's nothing like the data playground that data warehouse had. And so there's a bunch of limitations if I'm trying to think about how to build this kind of data-intensive application purely on request response. And this will particularly become apparent as I think that, hey, I need to see the outcomes, it's going to invoke. What are the actions this agent is going to take? I need to see those actions before it does it. Ultimately, I need to be able to run that benchmark. And so the question that arises is, well, what about streaming? I think this is where streaming can come in done, right? Streaming has the strength of both of these paradigms. It has the ability to do something in real time. It has the ability to work with data. But not necessarily, like stream processing, in particular, has often had a gap between theory and practicality. It's been something that kind of makes sense intellectually, but it's kind of hard to do, right? And it hasn't always fulfilled this vision of making it really easy to work with data in real time. And I think there's a couple of reasons for this. One of the reasons is that a lot of stream processing systems, you just don't have all the historical data. So a lot of Kafka clusters, maybe got 7 days of data, but you don't have everything going back all the way. If you're trying to run something that's built on data and you don't have all the stuff you need, that's really hard. And if that's missing, then you often have to graft on some other system to make it work. Likewise, a lot of the stream processing systems haven't been good at kind of high throughput processing of historical data which makes it much harder to work with that full set of things. And the result is you're trying to build across maybe many different systems to get something that -- we'll do the historical data but can come into real time to do the real-time data and somehow bridge it all. This is not an easy thing to do. What can we do about this? Well, there's been an idea that's out there that you can really imagine streaming as a kind of generalization of batch, like something that's batch plus-plus, where you can run something at a point in time and get a result. But you can run it and have it keep running as that data evolved. So this is not a new idea, but in practice to get this right, there's a number of details you have to put together to actually make something that's practicable you have to actually have all the parts and have them really integrate well. So what are those parts? And how do you actually make this work? Well, there's two key concepts that a system has to deal with to do this well. One, is this idea of a stream? What's the flow of events happening in the world. Everybody here would probably know this from the world of Kafka. You also need the tables, the kind of state of the world, what's the current state of the world. An intelligent system kind of inherently needs both of these. I think we have some intuition that you need these. The stream is, in some sense, giving you awareness. It's like the sensory system of what's happening out in the business. And the tables are kind of like the memory banks, like what's all the stuff that we know right now. And inherently, you're going to want to connect these. But if this seems a little bit theoretical, let me give a practical example. So this is a process that is common out in the world, the process of taking a census. So U.K. has a census, U.S. has a census, a number of countries have census. And the goal is to really compute a bunch of stats about the population. Where do people live, how many people are they, nationality, et cetera, origin, all the things you might want to know. And so in a really funny way, the U.S. census is actually hardcoded into the constitution as a batch process that runs every 10 years. So every 10 years, it runs and you literally just count every single person, you go find every person, and you write them down. So it is the full table scan of census operations. And it's not that different in a lot of other places. It's actually not that different in the U.K. And this probably made sense in its time. Certainly, when the U.S. Constitution was written, the data was all collected on paper and transported by horseback. So I think it was inherently not going to be a stream processing system. But nonetheless, it just doesn't match with the current world. You want something where you know the state of things more than every 10 years, and you don't want something that's out of date by the time it finishes. So how could you do this with these two concepts? Well, inherently, you have these two things, right? You have the state of the population, what are the people? Where do they live? That enumeration of people, that's kind of the table. And then you have the stream of what's happening, the stream of births, deaths, people moving between cities. These two things together give you everything you need to know, right? As people -- as these events happen, the underlying table is evolving to reflect it. And these two things are intricately related. And this -- in the stream processing world, this is often referred to as the table stream duality, right? So if I had the stream of all the births and deaths and movements, I can actually recreate the table of where everybody lives as long as that stream went all the way back to the beginning of time. But also if I watched that table and I watched it evolve and I wrote down all the changes that happened in the table, I would actually end up with that stream. The two concepts are kind of interchangeable. That's what I mean. And if I had this, I can actually do real-time process on my population. I can have a view of where everybody was and all the stats I wanted that continually updated. And this is not a science fiction, it turns out there's actually a number of countries how do this, not the U.S. and not the U.K., I don't think so either. But there are some countries in Europe. I think in some degree in India, that have continuous population registries that update all the time. So it's not entirely impossible. And it's not just a thing that happens in the census system. These concepts also show up at the core of databases, where you have the idea of a commit log of changes, which is effectively the stream of updates to your data, and you have the tables of data that are sitting in the database. And these two things are actually directly related. The stream of changes is what actually populates the tables. And people say, in a sense, the tables are kind of just like an optimization. If you had that log of all the changes, you would actually have all the data and be able to recreate it, not just in its current state, but at any point in time. And this is the same relationship, if you think about it, that Kafka and Tableflow actually have to a company at large. So if you think about what is Kafka doing, it's hub of real-time data that is ultimately acting as a kind of commit log across the company. It's taking all the updates, all the things that are happening all across the organization, and it's applying them across all the systems you've got. And that's primarily used in this operational state for the kind of real-time applications running the business. If you go over to the analytical estate, there's a similar basis, there's a similar hub of data, which is these tables, right? These open tables data Delta or Iceberg, which are serving a set of query engines. And this is again an open service, not just the internals of a single database. So both of these are taking these concepts and doing it kind of a company-wide scale. And so the very natural thing as we talked about, is to connect these to actually have the stream, feed and continuously update the tables, have the stream of changes actually populate the tables. And when you do this, you get a kind of unified data set across stream and table that's actually the basis for stream processing. And one of the points I want to make is that this is not just a skin deep integration in Confluent. It's not just a connector that sends the data off to some Iceberg thing. This is actually a really fundamental representation. In Kafka, we've long represented this idea of a table as a kind of compacted stream. And in Delta and Iceberg, they actually represent tables in a stream-oriented fashion. They're built on a kind of LSM oriented design, which is effectively a table that is written out in streaming chunks and compacted together to deduplicate it. And you can plug both of these together in a very literal way to create a unified storage layer across streaming. And so this is what we've done with Tableflow. We've actually literally unified this stream table duality into something that actually represents the full life cycle of data. And for data systems, this is actually a basis for starting to think about how to make stream processing really practical. But data systems don't just have a storage layer, they also have a processing layer. And this is where Flink comes in, where it actually unifies these two things, and it gives you the ability to look across. It gives you the ability to treat batch as just a kind of bounded stream, a stream that's stopping now. And if you have these two things together, you can really start to realize this vision of making streaming a generalization of batch. And this comes into play is when we think about, okay, how are we going to build these data-intensive applications, what are the requirements we have? Well, one of the really important ones was the ability to reprocess data in a really easy to do way. And this reprocessing with streaming data, there's been a model for how to do this for a long time, where you just do it on the stream, right? You process the stream and if you need to go back and do it again, you kind of rewind, start over, and do it again. They sometimes called the Kappa architecture, although it's kind of such a simple thing. I don't know if it needs a name. And it's actually a really simple approach, but it has some practical limitations. And two of those limitations are, one, it's often not that easy to rewind all the way to the beginning of the stream because you may not have all the data in one place. You may not have it all in Kafka. You may not want to store it all there. That might be too expensive, that might not be desirable. It might be duplicative. And secondly, it can just be slow, like actually reprocessing all that historical data could just take a lot of time. If we're trying to iterate on top of this data, we don't want some process, which is slow to repopulate everything. And I'm going to show you how a combination of Tableflow and Flink solve this. And so it starts with the unified processing in Flink. So here is an example of two programs that do the same thing, one in SQL, one in Java. And when we look at this, we can say, well, okay, is this a batch program? Or is it a stream processing program? And the answer, of course, is yes. It is both of those things, right? It's a batch program, if you run it on just the data at a point in time, the snapshot. It's a stream processing program if you let it continue to run. And so you can have a single definition of what you want to do and treat it as both. And then as Shaun described, we've done a ton of work now to optimize the combination of these two. So using data that is in Iceberg or Delta, you can significantly speed up the processing of this data. When you're running these kind of snapshot queries, they'll actually be 50 to 100x faster. And this is very meaningful. If you're working on a good chunk of data, this will take something that might have taken 20, 25 minutes and it'll take it down to seconds. And so if you're doing kind of iteration on data, this is a very meaningful change. And we're not stopping there, as Shaun said, we're actually generalizing this further, so the streaming queries will always take advantage of these underlying optimizations, whether they're running in snapshot mode or not. They'll actually fall back on this and use it whenever they need to catch up. And so if you think about what you get when you put this together. You get a unified data system where the storage layer is built on Kafka, acting as a kind of commit log, Iceberg and Delta acting as a kind of table representation, Flink acting as a query layer. And we put these together into a unified system that shares schema, that shares a set of capabilities so that you have unification across all of this. And what this enables is solving some of these problems I talked about. This unification makes reprocessing of data really easy. I can now work on these data sets iteratively. I can work on my program and process them and look at the output and process them and look at the output. I can integrate AI into that process and do the same thing. And this makes it really easy to build with data. And that's really important. But not only can I do that and not only can I iterate on it, I can actually translate it seamlessly into production. I can take that same thing and run it as a continuous stream processing. So that actually really gives me -- all three of these characteristics, the ability to build with data, the ability to iterate on that and the ability to act in real time. And I think this is a really powerful combination. And I think this takes stream processing from something that was niche that was kind of on the side and makes it into something that's an incredibly powerful tool in the toolbox. And with Tableflow acting as the basis of this, you really don't have to pick and choose. You can kind of have your cake and eat it too. Because Tableflow is going to also populate all the data that you need in the rest of the analytics ecosystem. It's going to help you fill up all the tables that you need in your lakehouse or your warehouse. And that data will actually get better. It will actually arrive faster, and it will be transformed on the fly. So you can land data in a way that's immediately usable for your analytics users. And you can do this in a way where you don't have all the painful mappings from the operational estate to try and guess the schema and put it into the analytical estate, that will be actually maintained end-to-end with the unified schemas. So instead of having some ETL process breakthrough next day, if data changes in an incompatible way, that will be caught in development. People will know that, oh, I can't publish to that topic, with that schema that actually will break. So your lakehouse, your warehouse, the analytical side of the business gets better. But so does the streaming world. Suddenly, your data streaming platform has infinite retention and gets this essentially for free because you were going to keep that data for the warehouse anyway. And the data is shared. It's the same Iceberg table. You get the ability to reprocess historical data on demand, and you get a compacted table version of that data to use for reference. So both sides of this get more powerful by sharing. And I think that these two platforms are both really key to the future that we're moving. The world of data streaming is powering these kind of data-intensive applications, the things that are running in real time that are applying AI to run the business that are driving some of the customer-facing operational applications. The lakehouse and warehouse are the basis for analytics, the basis for a lot of the intelligence, the insight, the ad hoc analysis, the data science. And both of these need to somehow suck in all the data across the business and harness it, and make it available. And that's ultimately what we're trying to do at Confluent. And to do that, we've really brought together these core open standards that we think people want to build around. Flink, Kafka, Delta, Iceberg. We brought them together into a unified system. And we've added to that the connectors that actually hook this up to the rest of the organization that pull in the streams, and the set of unified schemas and governance that tie it all together. So that Flink knows about the same data with the same structure that's in Kafka, the same data with the same structure that's in Iceberg or Delta. All of those are operable in the same way. And we think that this can act as a basis that really powers the next generation of data-intensive applications that actually connects all the different systems across an organization. And we think that this is really the fundamental platform. When we think about what's going to happen with data, how we're going to harness that data with AI, how our company is going to translate more of what it does into software. And we're incredibly excited to build that future with all of you guys. And so that's kind of ultimately what this conference is about. We've got a fantastic lineup, I think, with really engaging content across all of these topics. Kafka stuff, stream processing, AI, all kinds of aspects of the governance of data, the use of data, all of it, you can find in some of the different sessions here, and you can not only do that, but talk to some of the people who are doing it for real, who are building these systems, who are using these systems, who are applying it in different parts of the economy. So go to some talks meet some people and let's all explore this new world together. I couldn't be more excited. Thank you very much.
This call discussed
For developers and AI pipelines
Programmatic access to Confluent, Inc. earnings transcripts and 32,000+ others is available through the
EarningsCalls.dev REST API. Plans from $24.99/month — full transcripts, speaker segments,
full-text search, and the recently-added /api/v1/transcripts/recent polling endpoint for ETL pipelines.