• Apache Kafka 3.5 - Kafka Core, Connect, Streams, & Client Updates
    Jun 15 2023

    Apache Kafka® 3.5 is here with the capability of previewing migrations between ZooKeeper clusters to KRaft mode. Follow along as Danica Fine highlights key release updates.

    Kafka Core:

    • KIP-833 provides an updated timeline for KRaft.
    • KIP-866 now is preview and allows migration from an existing ZooKeeper cluster to KRaft mode.
    • KIP-900 introduces a way to bootstrap the KRaft controllers with SCRAM credentials.
    • KIP-903 prevents a data loss scenario by preventing replicas with stale broker epochs from joining the ISR list.
    • KIP-915 streamlines the process of downgrading Kafka's transaction and group coordinators by introducing tagged fields.


    Kafka Connect:

    • KIP-710 provides the option to use a REST API for internal server communication that can be enabled by setting `dedicated.mode.enable.internal.rest` equal to true.
    • KIP-875 offers support for native offset management in Kafka Connect. Connect cluster administrators can now read offsets for both source and sink connectors. This KIP adds a new STOPPED state for connectors, enabling users to shut down connectors and maintain connector configurations without utilizing resources.
    • KIP-894 makes `IncrementalAlterConfigs` API available for use in MirrorMaker 2 (MM2), adding a new use.incremental.alter.config configuration which takes values “requested,” “never,” and “required.”
    • KIP-911 adds a new source tag for metrics generated by the `MirrorSourceConnector` to help monitor mirroring deployments.


    Kafka Streams:

    • KIP-339 improves Kafka Streams' error-handling capabilities by addressing serialization errors that occur before message production and extending the interface for custom error handling.
    • KIP-889 introduces versioned state stores in Kafka Streams for temporal join semantics in stream-to-table joins.
    • KIP-904 simplifies table aggregation in Kafka by proposing a change in serialization format to enable one-step aggregation and reduce noise from events with old and new keys/values.
    • KIP-914 modifies how versioned state stores are used in Kafka Streams. Versioned state stores may impact different DSL processors in varying ways, see the documentation for details.


    Kafka Client:

    • KIP-881 is now complete and introduces new client-side assignor logic for rack-aware consumer balancing for Kafka Consumers.
    • KIP-887 adds the `EnvVarConfigProvider` implementation to Kafka so custom configurations stored in environment variables can be injected into the system by providing the map returned by `System.getEnv()`.
    • KIP 641 introduces the `RecordReader` interface to Kafka's clients module, replacing the deprecated MessageReader Scala trait.


    EPISODE LINKS

    • See release notes for Apache Kafka 3.5
    • Read the blog to learn more
    • Download and get started with Apache Kafka 3.5
    • Watch the video version of this podcast
    Show More Show Less
    11 mins
  • A Special Announcement from Streaming Audio
    Apr 13 2023

    After recording 64 episodes and featuring 58 amazing guests, the Streaming Audio podcast series has amassed over 130,000 plays on YouTube in the last year. We're extremely proud of these achievements and feel that it's time to take a well-deserved break. Streaming Audio will be taking a vacation! We want to express our gratitude to you, our valued listeners, for spending 10,000 hours with us on this incredible journey.

    Rest assured, we will be back with more episodes! In the meantime, feel free to revisit some of our previous episodes. For instance, you can listen to Anna McDonald share her stories about the worst Apache Kafka® bugs she’s ever seen, or listen to Jun Rao offer his expert advice on running Kafka in production. And who could forget the charming backstory behind Mitch Seymour's Kafka storybook, Gently Down the Stream?

    These memorable episodes brought us joy, and we're thrilled to have shared them with you. As we reflect on our accomplishments with pride, we also look forward to an exciting future. Until we meet again, happy listening!

    EPISODE LINKS

    • Top 6 Worst Apache Kafka JIRA Bugs
    • Running Apache Kafka in Production
    • Learn How Stream-Processing Works The Simplest Way Possible
    • Watch the video version of this podcast
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    1 min
  • How to use Data Contracts for Long-Term Schema Management
    Mar 21 2023

    Have you ever struggled with managing data long term, especially as the schema changes over time? In order to manage and leverage data across an organization, it’s essential to have well-defined guidelines and standards in place around data quality, enforcement, and data transfer. To get started, Abraham Leal (Customer Success Technical Architect, Confluent) suggests that organizations associate their Apache Kafka® data with a data contract (schema). A data contract is an agreement between a service provider and data consumers. It defines the management and intended usage of data within an organization. In this episode, Abraham talks to Kris about how to use data contracts and schema enforcement to ensure long-term data management.

    When an organization sends and stores critical and valuable data in Kafka, more often than not it would like to leverage that data in various valuable ways for multiple business units. Kafka is particularly suited for this use case, but it can be problematic later on if the governance rules aren’t established up front.

    With schema registry, evolution is easy due to its robust security guarantees. When managing data pipelines, you can also use GitOps automation features for an extra control layer. It allows you to be creative with topic versioning, upcasting/downcasting the data collected, and adding quality assurance steps at the end of each run to ensure your project remains reliable.

    Abraham explains that Protobuf and Avro are the best formats to use rather than XML or JSON because they are built to handle schema evolution. In addition, they have a much lower overhead per-record, so you can save bandwidth and data storage costs by adopting them.

    There’s so much more to consider, but if you are thinking about implementing or integrating with your data quality team, Abraham suggests that you use schema registry heavily from the beginning.

    If you have more questions, Kris invites you to join the conversation. You can also watch the KOR Financial Current talk Abraham mentions or take Danica Fine’s free course on how to use schema registry on Confluent Developer.

    EPISODE LINKS

    • OS project
    • KOR Financial Current Talk
    • The Key Concepts of Schema Registry
    • Schema Evolution and Compatibility
    • Schema Registry Made Simple by Confluent Cloud ft. Magesh Nandakumar
    • Kris Jenkins’ Twitter
    • Watch the video version of this podcast
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    57 mins
  • How to use Python with Apache Kafka
    Mar 14 2023

    Can you use Apache Kafka® and Python together? What’s the current state of Python support? And what are the best options to get started? In this episode, Dave Klein joins Kris to talk about all things Kafka and Python: the libraries, the tools, and the pros & cons. He also talks about the new course he just launched to support Python programmers entering the event-streaming world.

    Dave has been an active member of the Kafka community for many years and noticed that there were a lot of Kafka resources for Java but few for Python. So he decided to create a course to help people get started using Python and Kafka together.

    Historically, Java has had the most documentation, and people have often missed how good the Python support is for Kafka users. Python and Kafka are an ideal fit for machine learning applications and data engineering in general. Yet there are a lot of use cases for building, streaming, and machine learning pipelines. In fact, someone conducted a survey to find out what languages were most popular in the Kafka community and Python came in second after Java. That’s how Dave got the idea to create a course for newbies.

    In this course, Dave combines video lectures with code-heavy exercises to give developers a taste of what the code looks like, how to structure it, a preview of the shape of the code, and the structure of the classes and the functions so you can get hands-on practice using the library. He also covers building a producer and a consumer and using the admin client. And, of course, there is a module that covers working with the schemas supported by the Kafka library.

    Dave explains that Python opens up a world of opportunity and is ripe for expansion. So if you are ready to dive in, head over to developer.confluent.io to learn more about Dave’s course.

    EPISODE LINKS

    • Blog: Getting Started with Python for Apache Kafka
    • Course: Introduction to Apache Kafka for Python Developers
    • Step-by-step guide: Building a Python client application for Kafka
    • Coding in Motion
    • Building and Designing Events and Event Streams with Apache Kafka
    • Watch the video version of this podcast
    • Kris Jenkins’ Twitter
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    32 mins
  • Next-Gen Data Modeling, Integrity, and Governance with YODA
    Mar 7 2023

    In this episode, Kris interviews Doron Porat, Director of Infrastructure at Yotpo, and Liran Yogev, Director of Engineering at ZipRecruiter (formerly at Yotpo), about their experiences and strategies in dealing with data modeling at scale.

    Yotpo has a vast and active data lake, comprising thousands of datasets that are processed by different engines, primarily Apache Spark™. They wanted to provide users with self-service tools for generating and utilizing data with maximum flexibility, but encountered difficulties, including poor standardization, low data reusability, limited data lineage, and unreliable datasets.

    The team realized that Yotpo's modeling layer, which defines the structure and relationships of the data, needed to be separated from the execution layer, which defines and processes operations on the data.

    This separation would give programmers better visibility into data pipelines across all execution engines, storage methods, and formats, as well as more governance control for exploration and automation.

    To address these issues, they developed YODA, an internal tool that combines excellent developer experience, DBT, Databricks, Airflow, Looker and more, with a strong CI/CD and orchestration layer.

    Yotpo is a B2B, SaaS e-commerce marketing platform that provides businesses with the necessary tools for accurate customer analytics, remarketing, support messaging, and more.

    ZipRecruiter is a job site that utilizes AI matching to help businesses find the right candidates for their open roles.

    EPISODE LINKS

    • Current 2022 Talk: Next Gen Data Modeling in the Open Data Platform
    • Data Mesh 101
    • Data Mesh Architecture: A Modern Distributed Data Model
    • Watch the video version of this podcast
    • Kris Jenkins’ Twitter
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    56 mins
  • Migrate Your Kafka Cluster with Minimal Downtime
    Mar 1 2023

    Migrating Apache Kafka® clusters can be challenging, especially when moving large amounts of data while minimizing downtime. Michael Dunn (Solutions Architect, Confluent) has worked in the data space for many years, designing and managing systems to support high-volume applications. He has helped many organizations strategize, design, and implement successful Kafka cluster migrations between different environments. In this episode, Michael shares some tips about Kafka cluster migration with Kris, including the pros and cons of the different tools he recommends.

    Michael explains that there are many reasons why companies migrate their Kafka clusters. For example, they may want to modernize their platforms, move to a self-hosted cloud server, or consolidate clusters. He tells Kris that creating a plan and selecting the right tool before getting started is critical for reducing downtime and minimizing migration risks.

    The good news is that a few tools can facilitate moving large amounts of data, topics, schemas, applications, connectors, and everything else from one Apache Kafka cluster to another.

    Kafka MirrorMaker/MirrorMaker2 (MM2) is a stand-alone tool for copying data between two Kafka clusters. It uses source and sink connectors to replicate topics from a source cluster into the destination cluster.

    Confluent Replicator allows you to replicate data from one Kafka cluster to another. Replicator is similar to MM2, but the difference is that it’s been battle-tested.

    Cluster Linking is a powerful tool offered by Confluent that allows you to mirror topics from an Apache Kafka 2.4/Confluent Platform 5.4 source cluster to a Confluent Platform 7+ cluster in a read-only state, and is available as a fully-managed service in Confluent Cloud.

    At the end of the day, Michael stresses that coupled with a well-thought-out strategy and the right tool, Kafka cluster migration can be relatively painless. Following his advice, you should be able to keep your system healthy and stable before and after the migration is complete.

    EPISODE LINKS

    • MirrorMaker 2
    • Replicator
    • Cluster Linking
    • Schema Migration
    • Multi-Cluster Apache Kafka with Cluster Linking
    • Watch the video version of this podcast
    • Kris Jenkins’ Twitter
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    1 hr and 2 mins
  • Real-Time Data Transformation and Analytics with dbt Labs
    Feb 22 2023

    dbt is known as being part of the Modern Data Stack for ELT processes. Being in the MDS, dbt Labs believes in having the best of breed for every part of the stack. Oftentimes folks are using an EL tool like Fivetran to pull data from the database into the warehouse, then using dbt to manage the transformations in the warehouse. Analysts can then build dashboards on top of that data, or execute tests.

    It’s possible for an analyst to adapt this process for use with a microservice application using Apache Kafka® and the same method to pull batch data out of each and every database; however, in this episode, Amy Chen (Partner Engineering Manager, dbt Labs) tells Kris about a better way forward for analysts willing to adopt the streaming mindset: Reusable pipelines using dbt models that immediately pull events into the warehouse and materialize as materialized views by default.

    dbt Labs is the company that makes and maintains dbt. dbt Core is the open-source data transformation framework that allows data teams to operate with software engineering’s best practices. dbt Cloud is the fastest and most reliable way to deploy dbt.

    Inside the world of event streaming, there is a push to expand data access beyond the programmers writing the code, and towards everyone involved in the business. Over at dbt Labs they’re attempting something of the reverse— to get data analysts to adopt the best practices of software engineers, and more recently, of streaming programmers. They’re improving the process of building data pipelines while empowering businesses to bring more contributors into the analytics process, with an easy to deploy, easy to maintain platform. It offers version control to analysts who traditionally don’t have access to git, along with the ability to easily automate testing, all in the same place.

    In this episode, Kris and Amy explore:

    • How to revolutionize testing for analysts with two of dbt’s core functionalities
    • What streaming in a batch-based analytics world should look like
    • What can be done to improve workflows
    • How to democratize access to data for everyone in the business

    EPISODE LINKS

    • Learn more about dbt labs
    • An Analytics Engineer’s Guide to Streaming
    • Panel discussion: If Streaming Is the Answer, Why Are We Still Doing Batch?
    • All Current 2022 sessions and slides
    • Watch the video version of this podcast
    • Kris Jenkins’ Twitter
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    44 mins
  • What is the Future of Streaming Data?
    Feb 15 2023

    What’s the next big thing in the future of streaming data? In this episode, Greg DeMichillie (VP of Product and Solutions Marketing, Confluent) talks to Kris about the future of stream processing in environments where the value of data lies in their ability to intercept and interpret data.

    Greg explains that organizations typically focus on the infrastructure containers themselves, and not on the thousands of data connections that form within. When they finally realize that they don't have a way to manage the complexity of these connections, a new problem arises: how do they approach managing such complexity? That’s where Confluent and Apache Kafka® come into play - they offer a consistent way to organize this seemingly endless web of data so they don't have to face the daunting task of figuring out how to connect their shopping portals or jump through hoops trying different ETL tools on various systems.

    As more companies seek ways to manage this data, they are asking some basic questions:

    • How to do it?
    • Do best practices exist?
    • How can we get help?

    The next question for companies who have already adopted Kafka is a bit more complex: "What about my partners?” For example, companies with inventory management systems use supply chain systems to track product creation and shipping. As a result, they need to decide which emails to update, if they need to write custom REST APIs to sit in front of Kafka topics, etc. Advanced use cases like this raise additional questions about data governance, security, data policy, and PII, forcing companies to think differently about data.

    Greg predicts this is the next big frontier as more companies adopt Kafka internally. And because they will have to think less about where the data is stored and more about how data moves, they will have to solve problems to make managing all that data easier. If you're an enthusiast of real-time data streaming, Greg invites you to attend the Kafka Summit (London) in May and Current (Austin, TX) for a deeper dive into the world of Apache Kafka-related topics now and beyond.

    EPISODE LINKS

    • What’s Ahead of the Future of Data Streaming?
    • If Streaming Is the Answer, Why Are We Still Doing Batch?
    • All Current 2022 sessions and slides
    • Kafka Summit London 2023
    • Current 2023
    • Watch the video version of this podcast
    • Kris Jenkins’ Twitter
    • Streaming Audio Playlist
    • Join the Confluent Community
    • Learn more with Kafka tutorials, resources, and guides at Confluent Developer
    • Live demo: Intro to Event-Driven Microservices with Confluent
    • Use PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)
    Show More Show Less
    41 mins