Metadata-Driven Serverless Data Processing with AWS

What Drives a NextGen Enterprise Data Platform

Data & Analytics

Last Updated: August 20, 2025

Rising data complexity—driven by advances in AI, increasing regulatory demands, and the shift toward treating data as a product—calls for smarter data cloud architectures and deeper integration across analytics ecosystems.

Highlighting this challenge for high-compliance industries, Gartner’s 2025 predictions report that the rise in GenAI and unstructured data is overwhelming traditional governance models, creating an urgent need for a strategic reset in data and analytics approaches.

A Pivot to Metadata-driven Strategy

Our blog post shares how we used AWS for a global investment firm to build a metadata-driven, serverless data platform. This platform helped them automate data tasks, improve compliance by tracking data clearly, and grow their operations efficiently.

Why Metadata Matters in High-Compliance Areas

Metadata, or “data about data,” helps enterprises manage their information better, especially when rules are strict. It shows where data comes from, how it moves, and how it’s used. This makes it easier to find and trust data, speed up compliance checks, and allow automated reports and analysis.

AWS for Simple, Secure Data Solutions in Industries

In industries where data is complex and strict rules add to the challenges, cloud solutions like AWS are very helpful.

Like for our client, an investment management firm that operates under strict data guardrails while managing vast and diverse datasets, ranging from market information to client transactions.

Here, metadata-driven serverless data processing on AWS helps them win at advanced analytics. Let’s explore how.

The Business Need: A Platform to Manage Data Complexity

The client, a leading global investment firm, needed a robust platform to:

Centralize data ingestion and transformation workflows
Support metadata-based data pipeline design for analytics
Drive team collaboration without infrastructure dependency
Ensure scalability without compromising on cost or complexity

What We Built: A Serverless Data Platform

The objective was to create a unified, serverless data platform that could support the ingestion, processing, and transformation of data across multiple teams and departments.

We aimed to enable seamless data aggregation and analytics while ensuring scalability and operational simplicity. This approach helped the firm streamline operations, accelerate insights, and minimize manual interventions across departments.

Serverless, Metadata-driven Architecture using AWS

Our solution adopted a phased implementation model. The initial release focused on processing CSV-based file ingestions. Future phases are designed to incorporate APIs, SQL integrations, and event-driven streaming pipelines.

We designed a serverless, metadata-driven architecture using AWS-native components to ensure cost-effectiveness, high availability, and minimal operational overhead. Every new serverless data pipeline was now driven by metadata configurations, enabling flexible modifications without the need to rewrite code.

Step-by-Step Architecture Flow

The following describes the step-by-step data ingestion and processing architecture on AWS, integrating multiple native services for automated, scalable, and flexible data management:

The implemented Metadata-Driven Serverless Architecture AWS data architecture

1. Data Source Ingestion

Data is ingested from multiple sources including:

Structured datasets like files, databases, and APIs
Streaming platforms via AWS MSK
SaaS applications, unstructured files, and MDM systems

Each ingestion source is tagged with metadata that defines the expected structure, transformation logic, and destination layer.

2. Ingestion Orchestration

The ingestion process begins with AWS Step Functions, which orchestrate the flow based on the data pipeline configuration:

For batch workloads, it invokes AWS Glue jobs
For streaming data, it routes through MSK + AWS Lambda
For SaaS sources, AWS AppFlow will be used.
For unstructured inputs, there will be a processing framework to convert data into structured format.

All orchestrations are metadata-driven, pulling job parameters, transformation rules, and destination details from Amazon DynamoDB, which serves as the centralized metadata store.

3. Bronze Layer – Raw Storage and Ingestion

Once data is received:

It is landed into an S3-based Bronze Layer.
AWS Glue executes ingestion jobs to clean, format, and log raw data based on metadata configurations.

Glue job execution is dynamic, controlled via workflow entries and job info stored in DynamoDB.

4. Document Management and Enrichment

In parallel, the ingested data is processed by a Document Management System for structural parsing, tagging, or enrichment tasks (e.g., PDF/text parsing). All intermediate files are stored on S3 for auditability.

5. Data Standardization and Silver Layer Transformation

Next, standardized schemas and validations are applied:

A second set of AWS Glue jobs pulls raw data and applies transformation logic (e.g., field mappings, data type enforcement, lookups).
Processed data is written into the Silver Layer using Iceberg Tables in Glue Catalog.

6. Data Aggregation and Gold Layer Publishing

After standardization:

Data undergoes aggregation or business rule calculations using additional Glue jobs.
Aggregated outputs are published to the Gold Layer, again using Iceberg-backed tables in Glue Catalog.

This Gold Layer is the trusted, analytics-ready dataset layer.

7. Reporting and Analytics

The final output is consumed by business reporting tools:

Data from the Gold Layer is exposed to Amazon Athena, or integrated into BI dashboards.
Reports and dashboards are served to users in a self-service or scheduled manner.

8. Metadata Control & Search

All data pipeline activities—job status, bookmarks, catalog entries—are stored in DynamoDB.
A Control Panel (Web UI) allows users to configure new data pipelines, monitor executions, and review audit logs.
Amazon Kendra is used to search across metadata entities for schema properties of the actual dataset.

9. Data Quality & Audit Logging

Audit tables are implemented using the Apache Iceberg table format
Data quality checks and helper functions are triggered during data pipeline execution to ensure correctness and consistency.

10. Archival & Retention

Cold/older data is automatically archived into Amazon S3 Glacier for long-term storage.
Archival logic is governed by metadata rules around data age or usage frequency.

11. Security, Access, and Monitoring

IAM roles govern access to data and services
Secrets used for accessing external systems are managed securely via AWS Secrets Manager
CloudWatch monitors the serverless data pipeline’s health, and SNS sends out alerts and notifications

Key Benefits of the New Data Platform

Modern data processing needs to be fast, flexible, and easy to manage. Using metadata to control pipelines lets teams quickly make changes without redeploying.

Combined with AWS’s serverless setup, our approach scales automatically and removes the hassle of managing servers.

Here’s a full view of the benefits:

Metadata-Driven Flexibility: The serverless data pipeline behaviour is controlled via metadata, allowing quick onboarding and changes without redeployment

Serverless & Scalable: Zero server management with native AWS scalability built in.
Team Collaboration: Centralized UI and metadata store enable teams to build, modify, and track data pipelines independently
Phase-wise Deployment: Allows for incremental delivery, with current support for CSV file ingestion and future support for APIs and SQL

Build Fast, Stay Flexible: AWS for NextGen Enterprise Data Platforms

Our metadata-driven serverless data processing setup shows how AWS’s native tools can create a modular, metadata-driven, and serverless data platform. By focusing on a phased, configuration-first approach, it’s designed to grow and adapt easily, making it perfect for businesses that want speed and flexibility in their data modernization journey.

Curious about the right AWS data and analytics services for you, learn more here.

About the Author

Sreeram KVS

Associate Vice President

With more than 20 years of experience in Data Engineering & Analytics, Sreeram brings deep expertise of working on data platforms to generate insights that help customers derive a competitive edge. He has been instrumental in enabling Hexaware's key customers to monetize data and build compliance in their data platforms.

In his current role he leads the GCP Data CoE and is responsible for nurturing a strong GCP competency at Hexaware focusing on technical expertise and executional excellence. He also leads data & AI solutions and pre-sales initiatives across diverse industries in the EMEA and APAC regions.

About the Author

Nihali Sawant

Solution Specialist, Data and AI (AWS)

Nihali Sawant is a Solution Specialist at Hexaware, having contributed to the company for the past one year. As part of the AWS Data Center of Excellence (CoE), she is focused on advancing AWS data and analytics solutions. Her expertise lies in developing data-driven strategies, handling the GTM related activities and delivering impactful solutions that align with business objectives and promote organizational growth in the data and AI domain.

FAQs

Serverless architecture on AWS means you don’t have to worry about managing or setting up servers. AWS takes care of all the infrastructure, so you can focus on writing your data processing code. It automatically scales up or down based on how much data you have, and you only pay for the computing time you actually use. This makes it faster and cheaper to build and run data pipelines.

Metadata is like information about your data—where it came from, what format it’s in, and how it’s organized. Using metadata helps your data pipelines adjust automatically when new data arrives or when data changes. This means your system can handle more data or different types of data without needing a lot of manual work, making it easier to grow and adapt.

This solution uses a range of AWS-native, serverless services to handle data ingestion, processing, transformation, storage, and analytics. Here’s how each service fits into the architecture:

Amazon S3: Acts as the central storage layer for all raw, processed, and archived data. Data lands in S3’s “Bronze Layer” for raw storage, and is later moved to Silver and Gold layers for standardized and analytics-ready datasets.
AWS Glue: Handles serverless ETL (extract, transform, load) jobs. Glue jobs clean, format, and transform data at multiple stages, all driven by metadata configurations.
AWS Lambda: Processes streaming data and event-driven tasks, especially for real-time or near-real-time data flows.
AWS Step Functions: Orchestrates the entire data pipeline, coordinating the execution of Glue jobs, Lambda functions, and other steps based on metadata-driven workflows.
Amazon DynamoDB: Serves as the centralized metadata store, holding pipeline configurations, job parameters, transformation rules, and tracking pipeline activities.
Amazon MSK (Managed Streaming for Apache Kafka): Ingests and routes streaming data from various sources.
AWS AppFlow: Integrates SaaS applications as data sources, enabling secure and scalable data transfers.
AWS Glue Catalog with Iceberg Tables: Manages metadata for datasets and supports scalable, versioned data tables for analytics.
Amazon Athena: Provides serverless, interactive querying of data stored in S3, allowing users to analyze data using standard SQL without managing infrastructure.
Amazon Kendra: Enables advanced search across metadata entities, making it easy to find schema properties and other data attributes.
Amazon S3 Glacier: Archives older or less frequently accessed data for long-term, cost-effective storage.
AWS IAM: Manages secure access to data and services.
AWS Secrets Manager: Securely stores and manages credentials for accessing external systems.
Amazon CloudWatch: Monitors the health and performance of the serverless data pipelines.
Amazon SNS: Sends alerts and notifications about pipeline events or issues.

The main benefits are easy scaling, cost savings, and less work managing infrastructure. Your data processing can grow or shrink automatically based on demand, so you don’t pay for unused resources. It also lets developers focus on building features instead of managing servers, speeding up innovation and reducing complexity.

Related Blogs

Secure Data Collaboration at Scale with Databricks Delta Sharing

Data & Analytics

Cloud Cost Optimization: Cutting Costs to Build Smarter IT

Data & Analytics

How to Build a Metadata-driven Data Ingestion Framework on Google Cloud Platform

Data & Analytics

Choosing the Right Data Catalog for Apache Iceberg with Snowflake

Data & Analytics

Databricks Lakeflow Connect Explained

Data & Analytics

How We Built an AI-powered Cheque Fraud Detection Solution with AWS

Data & Analytics

Agentic AI with Snowflake Cortex: What’s New for Enterprise Data Intelligence

Data & Analytics

A Recap of Databricks Data+AI Summit 2025: Strategic Insights for Your Data and AI Teams

Data & Analytics

The Big Disconnect: Where Data Investments Are Surging—But Performance Isn’t

Data & Analytics

Enterprise Data Services: The Backbone of Modern Businesses

Data & Analytics

Every outcome starts with a conversation

Ready to Pursue Opportunity?

Connect Now

Ready to Pursue Opportunity?

Every outcome starts with a conversation

Your name*

Email address*

Country code*

Country*

Phone number*

Company

Tell us about your opportunity*

How did you hear about us?*

Upload your RFP/RFI document (maximum file size: 10 MB)

Accepted file formats: .xlsx, .xls, .doc, .docx, .pdf, .rtf, .zip, .rar

LK4SEE

Type the characters to the left*