Pacific Data Integrators' Technology Insights Blog

Changed-Data Capture: The Critical Tool in Your Integration Architecture

Written by Posted by PDI Marketing Team | Jul 27, 2021 6:54:20 AM

When I attack my around-the-house honey-do's, I depend on critical tools like the re-chargeable power drill and power-screwdriver that are in my tool chest. And when I architect data warehouses, there are also tools that I can't imagine working without. 

But I do remember home projects I did years ago when power-drills and data tools such as Changed-Data Capture (CDC) were not yet available. In the same way that I find it difficult to imagine building houses without nails centuries ago, I can't imagine building data lakes, warehouses, and next-gen analytics today without CDC. But my reasoning might be enlightening. Read on.

Changed-Data Capture is a technology installed on the source system to detect changes to data and allow a mechanism to either stream only those changes out of that system or collect those changes for later extraction. 

CDC solves the common problem where source applications don't keep Creation-Date and Changed-Date fields in critical tables that allow us to use straightforward SQL to collect the changes. Source applications are frequently expensive or impossible to alter in order to add this logic and these fields. So instead, CDC gets into the guts of the database system to monitor its change logs and highlight the relevant data.  

Because CDC is monitoring these changes in real-time, extraction of the changed data in real-time and feeding the data stream is a typical use case. Therefore, CDC is usually associated with streaming analytics. Because of this association of CDC with analytics, IT groups frequently ask if the business has streaming analytic requirements to determine if they need Changed-Data Capture. Often, the business doesn't know what they want to use streaming analytics for, so CDC is not purchased. I believe this is a mistake. CDC is not JUST for streaming analytics.  CDC also makes the data integration architecture Better, Faster, AND Cheaper. (Normally, we are forced to pick two of these. With CDC, we can achieve all three!)

What is the alternative to CDC?  When we must extract data without knowing what data has changed (the usual case when we DON'T have CDC), the typical suggested approach is to do Changed-Data Detection (CDD). CDD extracts all the data and adds an md5 checksum to a new field in each record. We then compare this md5 value to the previously stored target system record value, if it exists. If the md5 checksums are the same, we move on to the next record.  If the md5 checksums are different, we either UPDATE the row or add a new row (slowly changing dimensions or some other approach).  CDD is faster than an md5-less brute-force compare of all the record's fields, and the brute-force compare is 10x faster than an avoidable database UPDATE.

But think of how much data has to move through an organization's network and be stored on disk and be processed by cpus for CDD. CDD requires an enormous amount of data, more moving parts, and therefore more places where problems can occur. 

Our goal is to simplify, simplify, simplify, and to capture only changed data AT THE SOURCE. I've worked with over a dozen companies where they re-architected portions of their enormous data warehouse integration logic to incorporate CDC to cut down by 100x the amount of data flowing through its internal corporate pipes.

Think of how much critical CDC is in reducing the amount of data flowing in a hybrid, cloud and on-premise environment.

So, like a re-chargeable power drill or power-screwdriver, CDC is a critical tool to make our integration architecture Faster (streaming analytics), Better (fewer moving parts, less data), and Cheaper (less network, disk, memory, and CPU required). While we usually have to choose two out of three, with CDC, we can get all three. 

In a time of geometrically increasing data volumes in an increasingly 24x7 business environment, CDC is simply the only way to keep up. (Drop the mic. Walk off.)