Choosing the Right Data Catalog for Apache Iceberg with Snowflake

Getting Started with an Open Data Lake

Data & Analytics

Last Updated: November 4, 2025

Our client asked us to update their data lake using Apache Iceberg for an open data lake for its analytic datasets using the Snowflake platform for its enterprise analytics. They wanted a solution that would let them run fast, cost-effective analytics directly on Amazon S3, integrated with the Snowflake platform.

To make this work, they wanted to explore Apache Iceberg with both most compatible data catalogs: Snowflake Horizon Catalog and AWS Glue Catalog.

But how do you make the choice? Our guide distills our hands-on experience, helping you navigate choices for data catalogs and build a data lake that’s both open and optimized.

But First, Why Apache Iceberg for an Open Data Lake?

Apache Iceberg is an open table format purpose-built for large analytical datasets. It delivers ACID transactions, schema evolution, and time travel—features that make it ideal for high-scale analytics and AI workloads in cloud data warehouse environments.

Paired with the Snowflake platform, Iceberg lets you:

Store data in Amazon S3 while managing metadata in your chosen catalog.
Interact with multiple analytics tools and engines.
Avoid lock-in while retaining enterprise-grade governance.

Our Approach: Matching Catalog Types to Business Needs

In our experience, both Snowflake Horizon Catalog and AWS Glue Catalog work well with Apache Iceberg tables, but they fit different needs.

Choosing Different Data Catalogs for Different Teams

As our client’s teams were all using the Snowflake platform, we set it up, so each team could use a data catalog that suited their working style and requirements.

Here are our practical tips to help data teams choose the right catalog for their needs.

You can evaluate on three criteria:

Performance: How fast can analytics work with the Snowflake platform.
Ecosystem Support: How well each works with Apache and Amazon s3.
Catalog Cost: How much each catalog option costs for setup and usage.

When Do You Choose Snowflake Horizon Data Catalog?

The Snowflake Horizon Catalog (inbuilt in the Snowflake platform) is easy to use and manage on the Snowflake platform, making it a great choice for teams needing Snowflake’s in-built analytics features.

It supports real-time analytics, strong metadata management, and simple access control, but it can be more expensive and is less flexible with open-source tools.

Here are some decision points:

Access to Snowflake analytics: Real-time analytics and simple data access control.
Considerations: Higher cost, less flexibility with open-source tools.
Best for: Teams seeking minimal setup and managed services.

When Do You Choose Amazon Glue Data Catalog?

On the other hand, AWS Glue Catalog is more flexible and cost-effective, especially for teams using AWS or multiple platforms. It’s ideal for building open data lakes that work with tools like Athena, EMR, and Redshift.

Here are some decision points:

Access to AWS tools: Centralized metadata for tools like Athena and Redshift.
Considerations: More setup for integration, but better open-source compatibility.
Best for: Teams needing cost-effective, extensible solutions.

While it takes more setup—especially to connect with Snowflake—it gives you more control and can be extended to fit different needs.

Everything to Consider for Data Catalogs with the Snowflake Platform

With the Snowflake platform, both AWS Glue and the Snowflake Horizon catalog can be used for powerfully managing metadata and improving data access.

To truly zero in on a choice for each team, here are the strengths of each catalog and how they work best with the Snowflake platform.

How it Works with Snowflake’s Platform?

Snowflake Horizon Catalog

AWS Glue Catalog

Platform Integration

• Seamless integration with Snowflake’s platform and partner ecosystem

• Enables rapid deployment and managed services

• Simplifies client onboarding and support

• Enables partners to offer hybrid and multi-cloud solutions

• Supports integration with AWS analytics tools

Advantages

• Fully integrated with Snowflake’s analytics, security, and governance features

• High performance for BI and real-time analytics

• Minimal setup and maintenance for clients

• AWS-native metadata management

• Multi-engine compatibility (Athena, EMR, etc.)

• Cost-effective for AWS-centric clients

• Supports diverse file formats

Disadvantages

• Higher cost for compute and storage

• Less flexibility for clients needing open-source tool integration outside Snowflake

• Requires manual refresh for metadata

• Additional setup for secure Snowflake integration

• Potentially more complex support model

Best Fit Use Cases

• Teams with minimal analytics needs or not using Snowflake

• Small-scale data projects where Snowflake’s capabilities are not required

• Real-time or low-latency analytics needs

• Small, simple data environments

• Teams not invested in AWS ecosystem

When Not to Use

• Teams with minimal analytics needs or not using Snowflake

• Small-scale data projects where Snowflake’s capabilities are not required

• Real-time or low-latency analytics needs

• Small, simple data environments

• Teams not invested in AWS ecosystem

Cost Model

• Consumption-based pricing: pay only for compute and storage used within Snowflake

• Predictable billing for managed services

• Pay-as-you-go pricing for AWS Glue resources

• Potential cost savings for AWS-heavy workloads

• Additional costs for cross-platform integration

Everything to Consider for Data Catalogs with Apache Iceberg

Apache Iceberg is an open table format designed to simplify data processing on large datasets stored in data lakes. It is particularly useful for managing large analytical tables.

Let’s see the benefits of the two catalogs chosen:

Snowflake Horizon Catalog

Snowflake Horizon lets you manage Iceberg tables that store data in Amazon S3.
The table information (metadata) is kept inside Snowflake, while the actual data files remain in S3.
You can easily control access and run queries directly from the Snowflake platform, making management simple and secure.

AWS Glue Catalog

AWS Glue Catalog is a central service for storing and managing table information, used by tools like Athena and EMR.
You can manage Apache Iceberg tables using AWS Glue Studio or other AWS tools.
If you want to use these tables with Snowflake, you need to set up a connection and configure the right permissions using AWS IAM roles.

Everything to Consider for Data Catalogs with Amazon S3

When deciding how to catalog your Apache Iceberg tables on Amazon S3, start by asking: Will engines other than Snowflake (like Athena, EMR, Spark, Databricks, or Redshift) need to read or write these tables?

If the answer is yes, it’s best to use the AWS Glue Data Catalog. This option offers open access and compatibility across multiple analytics platforms, allowing seamless collaboration and flexibility.

If only Snowflake Horizon will access the Iceberg tables, or you want to take full advantage of Snowflake’s advanced governance and security features, choose Snowflake as the catalog. This keeps everything inside the Snowflake ecosystem and lets you leverage powerful platform-native controls.

Further, there is also a hybrid option: If external engines only need to read (not write) the data, you can primarily use Snowflake for cataloging and governance, while still enabling read-only access for other engines through supported integrations.

This balances Snowflake Horizon’s powerful data management with the openness of AWS Glue for analytics and reporting.

Decision tree: Choose AWS Glue Data Catalog or Snowflake Horizon for cataloging Iceberg tables on Amazon S3.

Understanding Data Catalog Pricing Factors

When planning your modern data lake architecture, understanding the cost structure of different catalog services is crucial for budgeting and long-term planning.

Both AWS Glue Catalog and Snowflake Horizon Catalog offer powerful ways to manage metadata and connect analytics tools to data stored on Amazon S3, but they have different pricing models and cost drivers.

This section breaks down the key cost components for each service, including charges for data processing, S3 API requests, and data transfers, helping you make an informed choice based on your workload and usage patterns.

Pricing Factors to Consider for Snowflake Horizon Catalog

Crawler Execution: AWS Glue charges for crawler execution based on the number of Data Processing Units (DPUs) used. The cost is approximately $0.44 per DPU-hour.
S3 API Requests: Charges for API requests made to the S3 bucket, such as GET, PUT, LIST, and HEAD requests. These costs vary depending on the number and type of requests.

Pricing Factors to Consider for AWS Glue Catalog

AWS Glue charges for ETL jobs based on the number of Data Processing Units (DPUs) used. The cost is approximately $0.44 per DPU-hour.
For example, if your job uses 6 DPUs and runs for 15 minutes, the cost would be 6 DPUs×0.25 hours×$0.446 DPUs×0.25 hours×$0.44, which equals $0.66.
If data is transferred out of the S3 bucket to another region or the internet, additional data transfer charges apply.

Build Your Open Data Lake to Suite Business Needs

Our solution was built for the best of both worlds: true business intelligence in real-time via Snowflake Horizon accessibility and mainframe-scale scalability and flexibility with AWS, fueling Apache Iceberg and the open data lake it supports.

This guide provides a useful reference when you plan to use Apache Iceberg for enterprise data modernization journeys.

Whether performance, price point, or multi-tool support is your top priority, the right data catalog solutions let you create a modern, open data lake that suites every team’s needs.

Let’s achieve the true potential of your business with our data and analytics strategy and Snowflake partnership benefits.

About the Author

Bal Reddy Kandi

Senior Consultant

Bal Reddy Kandi is a Senior Consultant within Hexaware's Cloud & Snowflake Data Practice. He plays a pivotal role in shaping the future of Cloud, Data, and AI solutions. His expertise in solutions architecture, consulting, and delivery, where he adeptly guides and leads solution discussions for prominent accounts in the financial markets sector.

Led the design and implementation of robust ETL pipelines, migrating data from legacy systems like Teradata and Oracle to modern cloud platforms. Expert in building end-to-end data pipelines in Snowflake and AWS to support business intelligence and analytics initiatives. Certified in Snowflake Advanced Architecture, with deep expertise in leveraging AWS services for data solutions.

About the Author

Ruth Roshini

Associate Software Engineer

Ruth Roshini, an Associate Software Engineer at Hexaware Technologies Ltd., plays a key role in the Snowflake Center of Excellence (COE) team. With a strong foundation in data engineering and a passion for solving real-world problems, she has been instrumental in developing impactful use cases that leverage Snowflake’s powerful cloud data capabilities.

Her current focus is on building a comprehensive solution for marketing data—designing scalable data pipelines and intelligent models that enable teams to uncover actionable insights and enhance campaign effectiveness. Ruth’s approach blends technical expertise with a deep understanding of business needs, making her contributions both innovative and practical.

Beyond her work in data engineering, Ruth is deeply interested in the evolving field of Generative AI. She actively explores how GenAI can complement traditional analytics, automate workflows, and unlock new dimensions of creativity and intelligence in data-driven environments.

About the Author

Logesh Dhamodaran

Associate Software Engineer

Logesh Dhamodaran is an Associate Software Engineer at Hexaware, playing a vital role in the Snowflake Centre of Excellence (CoE). He specializes in Data Engineering with a strong focus on Snowflake and ETL development. With expertise in consulting and innovation, he contributes to designing and delivering scalable, high-performance data solutions.

His work focuses on enhancing business efficiency, agility, and data-driven decision-making through modern data platforms. By combining deep technical proficiency with a proactive mindset, Logesh plays an active role in building future-ready systems that drive business transformation and unlock strategic value.

FAQs

Hexaware offers expertise in integrating Apache Iceberg with platforms like Snowflake and AWS Glue, enabling businesses to modernize their data lakes for fast, cost-effective analytics.

Hexaware’s approach includes tailored solutions, hands-on proof-of-concept testing, and detailed evaluations of catalog options (e.g., Snowflake Horizon and AWS Glue) to ensure optimal performance, cost efficiency, and ecosystem compatibility.

Apache Iceberg is a key component of an open data lake strategy because it provides a scalable, open-source table format that supports ACID transactions, schema evolution, and time travel. It enables interoperability across multiple platforms and query engines, making it ideal for modern analytics and AI workloads.

Common challenges include managing metadata across multiple catalogs (e.g., Snowflake Open Catalog vs. AWS Glue), ensuring compatibility with existing data pipelines, and addressing performance issues like query optimization and compaction.

Additionally, organizations may face complexities in migrating existing data lakes to Iceberg without rewriting large datasets.

Apache Iceberg differs from traditional formats by offering advanced features like hidden partitioning, schema evolution, and snapshot-based querying.

Unlike traditional formats, Iceberg supports ACID transactions and is designed for large-scale, distributed datasets, making it more reliable and efficient for modern data lakes.

Apache Iceberg is commonly used for: