Real-time Data Processing: Change Data Capture (CDC) with AWS Glue

In an era where data drives business decisions, building capabilities to streamline data management and integration is a critical first step toward generating deep insights.

In complex and dynamic sectors, such as those with rapidly evolving markets, like insurance, banking, manufacturing, or healthcare, the data generated often contains multiple attributes that add layers of complexity. Consequently, companies operating within these sectors require sophisticated data management capabilities to effectively handle and leverage the diverse and intricate nature of their data.

One transformative approach to data management is ‘Change Data Capture (CDC),’ a method that has revolutionized how companies handle ever-growing, dynamic data streams. CDC is used to identify and capture changes made to data at their source and ensures that end sources are updated automatically, typically in real-time or near real-time. CDC tracks changes such as inserts, updates, and deletions in databases, allowing applications to react promptly. This technology is commonly used in data integration, replication, and synchronization scenarios where it’s essential to keep data consistent across different systems or to enable real-time analytics.

As insurance is a dynamically evolving industry, we can delve into the details of CDC by mapping out how it helped our client, a product insurer. This blog explores how Hexaware overhauled data management for a prominent appliance insurance provider using a CDC framework powered by AWS Glue and Python. We will also delve into the strategic deployment of Python scripts for automating DDL (Data Definition Language) SQL script generation, a move that fully harnessed the capabilities of CDC in AWS Glue.

The initiative for our client not only achieved considerable cost savings but also eliminated the reliance on manual processes.

CDC’s Significance in the Insurance Industry

CDC represents a critical technological advancement for the insurance industry, offering numerous benefits. By embracing the automated process, insurers can improve their data operations and position themselves for success in the increasingly digital and data-driven future.

Where making informed decisions quickly is a competitive advantage for insurers, CDC elevates real-time decision-making with real-time data integration, ensuring the latest information is available for analysis.

Furthermore, customer expectations today are higher than ever, and they prefer quick resolutions and personalized services. In this case, CDC enables insurers to have real-time access to customer data across all touchpoints, increasing personalization and providing highly responsive services.

CDC also automates capturing and integrating changes to data, eliminating manual data entry and batch processing. This reduces operational costs from manual labor, and minimizes the risk of errors.

Comprehensively, CDC helps with risk management, a competency that lies at the heart of insurance. The process provides insurers with real-time data on claims, policy changes, and customer interactions that can be analyzed to identify trends, predict future losses, and adjust premiums accordingly. Ultimately, building a comprehensive outlook on risk enables insurers to maintain stability and competitiveness.

AWS Glue for Effective CDC and DDL Operations

CDC (Change Data Capture) plays a crucial role in identifying and capturing database changes, facilitating the instant availability of updated information in target data warehouses. Traditionally, script-based CDC necessitated detailed manual scripting and constant monitoring, a process prone to time-consuming tasks and errors.

In contrast, DDL (Data Definition Language) employs standardized scripts to define and manage database objects. It operates as a subset of SQL (Structured Query Language), primarily used for defining, modifying, and managing the structure of database objects such as tables, indexes, and views. DDL encompasses SQL statements for creating, altering, and maintaining database schemas.

By automating DDL operations through Python, organizations can ensure swift and accurate execution of schema changes.

AWS Glue: Simplifying Data Integration

CDC, DDL, and AWS Glue are intricately interdependent components within modern data management systems. CDC serves as the foundation for identifying and capturing database changes, ensuring that updated information seamlessly flows into target data warehouses.

DDL complements CDC by automating the management of database objects, enabling swift execution of schema changes.

Meanwhile, AWS Glue acts as the glue that binds these components together, simplifying data integration processes and enhancing overall efficiency.

The seamless integration of CDC, DDL, and AWS Glue streamlines data operations, ensuring that organizations can effectively manage their data with accuracy and agility.

AWS Glue Integration in CDC Processes

The integration of AWS Glue into CDC processes marks a significant advancement in data handling. With its automated data change capture and replication capabilities, AWS Glue eliminates the necessity for complex manual scripting. This not only conserves resources but also diminishes the risk of errors, ensuring that data in target systems remains accurate and up to date.

The Role of AWS Glue in Enhancing Operations

AWS Glue assumes a central role in enhancing database management operations. As a fully managed, serverless data integration service, it simplifies data discovery, transformation, and loading processes. Its compatibility with a wide array of data sources and cloud storage solutions renders it an invaluable tool for modern data management strategies.

Through the strategic integration of AWS Glue and Python, Hexaware has established a new standard in data management, showcasing the profound impact of automation and innovation within the insurance industry.

Understanding CDC in a Use Case Scenario for a Leading Product Insurer

Problem Statement

Our client, a large appliance product insurer, collects tremendous amounts of data from various sources to their on-premises databases (IBM DB2). With the large volumes of data, the client faced challenges with scalability, performance, high upfront and ongoing cost due to its on-premises database.

Historical data migration was successfully accomplished using AWS Database Migration Service (DMS). However, the CDC process remained a significant challenge. The client needed a solution to track and capture the changes to the data at the source database (IBM DB2) and deliver the change to the target database on the cloud (PostgreSQL) as per the desired schedule.

The data from numerous sources had complex data structures, with more than 120 tables. It was extremely exhaustive to generate SQL scripts manually for DDL to capture the change in the data.

To overcome these challenges, we leveraged Python scripts for automating the SQL script generation for DDL and implemented a CDC framework in AWS Glue to efficiently track and capture the change data at source database and replicate the same in the target databases.

The Hexaware Solution

We created a CDC framework in AWS Glue for change data capture, which runs daily and migrates the change of data from Source (IBM DB2) to Target (PostgreSQL). In our next step, we wrote Python code to read the system table at source and created temporary tables in PostgreSQL. The metadata readings at the source were utilized to generate the Python scripts for the automation of SQL script generation for DDL method. We rigorously tested over 120 tables using the CDC framework in AWS Glue to facilitate daily migrations from on-premises to cloud (PostgreSQL). Here’s how we executed the testing process:

For running CDC on 120+ tables:

  • Prepared DDL for temporary tables, some of which contained 80+ columns.
  • Created merge statements for each table and input them into the configuration table.
  • Generated a list of primary keys for each table.

To validate the CDC framework:

  • Crafted insert statements for every table.
  • Validated SQL statements at source and post-framework run, ensuring accuracy.
  • Updated statements at the source and verified updated records’ SQL statement at the source, then post-framework run at the target.
  • Executed delete SQL statements at the source and validated deleted records’ SQL statement at the source, then post-framework run at the target.

Working Architecture

Working Architecture

Business Benefits

Implementing a Change Data Capture (CDC) framework using Amazon Glue and Python, particularly for an organization specializing in appliance insurance that has moved historical data from IBM DB2 to PostgreSQL, provides significant business and technical benefits, further enhanced by the flexibility and cost-efficiency of AWS’ cloud-based solutions.

  • Near real-time data streaming and analysis accelerated claims processing and policy updates, enhancing customer satisfaction and loyalty.
  • Reduced latency in data processing and analysis ensured swift organizational responses to market changes and customer needs.
  • The client achieved a 30% reduction in annual expenses, highlighting the solution’s remarkable cost efficiency.
  • Automation and CDC not only improved cost efficiency but also significantly lightened workloads, enabling the allocation of resources to strategic initiatives.
  • The realization of business value through streamlined operations, improved risk management, and faster innovation laid a robust foundation for future growth and adaptation.

This efficient method of data transfer from on-premises to the cloud will unlock the opportunity to build robust and resourceful cloud-based data warehouses that can perform as data marts as per our client’s analytics requirements.

Pseudo Code

Get Table Definition#

Get Table Definition


Get Table Key Info#

Get Table Key Info

Get Foreign Key Information

Get Foreign Key Information

Explore New Capabilities: Transform Business Agility and Intelligence

In an era where data is both a critical asset and a challenge, the right tools and expertise can transform potential into success. As comprehensive data management supports these goals, agility and accuracy with which an organization can respond to dynamism is critical for maintaining a competitive edge.

Effective Change Data Capture (CDC) practices, when implemented with precision and foresight, are transformative. They empower organizations to act swiftly on data insights, ensuring that their strategies are as dynamic as the market they operate in.

Organizations that recognize and harness the power of effective CDC practices position themselves not just to navigate the evolving data landscape but to shape it. Our client’s successful cloud transformation journey is a testament to the power of strategic CDC implementation—an endeavor that redefines not only data management but the very core of business agility and intelligence.

Ready to embark on your journey towards a more agile and data-driven organization? Our expertise in Data and AI leverages new technologies and blueprints roadmaps tailored to your organization’s unique challenges and opportunities. Contact us today!

About the Author

Ashish Anand

Ashish Anand

Sr. Technical Architect – Cloud Data Practice

Ashish Anand is a seasoned Senior Cloud Architect at Hexaware's Cloud Data Practice. Specializing in seamlessly integrating diverse data sources and establishing robust data lakes for organizations, Ashish is a leading expert in crafting optimized, pluggable, and highly secured cloud solutions utilizing the latest advancements in cloud technology. He has been responsible for developing cloud strategies, evaluating cloud applications and organizing cloud systems to meet the operational needs of the organization.

Read more Read more image

Related Blogs

Every outcome starts with a conversation

Ready to Pursue Opportunity?

Connect Now

right arrow

Ready to Pursue Opportunity?

Every outcome starts with a conversation