Operating Postgres as a data source for your analytics pipelines
April 21–22
The times when analytics systems were just OLAP replicas of Postgres with long-running ETL queries are gone. Modern data analysts with their language models and Jupyter notebooks are no longer just a disturbance for database administrators. They deliver real-time analytics to businesses that in turn use them to make mission-critical decisions. Data analysis systems demand their data from OLTP systems, here and now. This understandable need alongside excellent capabilities provided by Postgres logical replication, takes DBA into the brave new world of DataOps. Unfortunately, this new world is not about hydraulic engineering of shiny data lakes, but is about the day-to-day plumbing of clogged data pipelines. In this talk, I will provide an overview of established and trending approaches to change data capture (CDC) used by modern DataOps, and explain how a mission-critical OLTP Postgres database can survive and deliver under load. We will compare different approaches, such as xmin-based and logical replication-based solutions, and open-source tools, such as Debezium, Kafka, Apache Flink, PeerDB/Clickpipes. Finally, we will discuss benefits, problems, hazards and best practices of running Postgres as a source of data for solutions built on different combinations of the above tools. If you work with applications that operate large analytical data and use it to guide your business decisions, this talk is for you.