Data Integration Checklist - Environment Setup & Process Design

Posted by Muneeswara C Pandian
Comments (6)
March 19th, 2009

Following are the key points that needs to considered when setting up a Data Integration environment.

Data Integration Environment Setup

  • Repository setup and folder structures to hold the development objects (code) like transformations/mappings/jobs.
  • Coding standards and development process.
  • Document templates for low level design specifications and for capturing test case & results.
  • Version management process of the objects.
  • Backup and restore process of the repository.
  • Code migration process to move the object from one environment to the other like from development to the production environment.
  • Recommended configuration variables like commit interval, buffer size, log file path etc
  • User group and security definition
  • Integration of the metadata of the database with the DI metadata and that of the DI metadata with the reporting environment
  • Process for Impact Analysis for change request
  • Data Security needs for accessing the production data and the process of data sampling for testing
  • Roles and Responsibilities of the environment users like Administrator, Designer etc

Data Integration Process Design

  • What are the different data sources and how are they to be accessed.
  • How the data are provided by the source systems, is it incremental or full feed, how to determine the incremental records.
  • What are the different target systems and how would the data be loaded
  • Validation and reconciliation process for the incoming source data
  • Handling late arriving dimension records
  • Handling late arriving fact records
  • Having dynamism in the validation and transformation process
  • Error handling process definition
  • Table structures for holding the error data and the error messages
  • Process control or audit information gathering process definition
  • Table structures for holding the process control data
  • Determining reusable objects and its usage
  • Template creation for commonly used logics like error handling, SCD handling etc.
  • Data correction and reentry process
  • Metadata capture during the development process
  • Means of scheduling
  • Initial data load plan
  • Job failure and restartability methods

Comments (6)

Lakshminarasimhan - November 10th, 2009

Very goood Chain. There are two Key points that customer is expecting that should be part of this. Data Integration - tested best practices or Re usables Integrating predictable source system like Oracle ERP to DW. as GL is always GL every where, design should force to use this reusables. Data Qualty - Having a detailed focus on Data Cleansing / DQ life cycle is important from data purity stand point.

Kumaran Karthikeyan - October 24th, 2009

Hi Munees, The following points can be added to the list. 1. Process for parallel development of the same application for different releases ( e.g., leverage branching mechanism if the tool provides, custom automated process, etc) - DI ENV Setup 2. Process for parallel execution of the same application in a given STAGE for 2 different testing needs. For E.g., ability to execute the same App & code base in TEST env concurrently for 2 different testing needs. Should have process to create DB Schemas, Unix (App data) File System, Scheduling instance, etc. Application should be easily configurable to point to DB/File System/Scheduling instances. - DI ENV 3. Identify and capture the Technical Metadata - expecially for ABC (Audit, Balance & Control) purposes. - DI Design 4. Utility APIs for any on-demand & special needs. - DI Design Thanks Kumaran

Muneeswara C Pandian - September 11th, 2009

Hi Sandeep, The number of CPU and the RAM required are to be defined for sizing, these can be primarily derived by determining the following 1. The maximum number of sessions/jobs that would run in parallel 2. The longest session that would run, basically the session that would process the maximum number of records both while reading and writing 3. The maximum amount of data to be cached in memory, could be decided by getting to know which data integration processes have functions that require good amount of memory like Sort, Join, aggregate, lookup etc 4. The number of other applications with which the ETL engine has to share the CPU and the memory If we have utilization details on an existing Data Integration hardware environment in terms of its CPU usage and memory usage for a month or so then that info could also help in arriving at the sizing for a new environment. We will need to map the characteristics between both these environments. Thanks, Pandian

sandeep Jain - September 8th, 2009

What should be the possible considerations for size an ETL engine (DI Jobserver machine) Like platform( UNIX, Linux, Windows), Volume of data, type of source(relational, application, files), type of transformation/transforms(type of operartion such as SCD) ,time window, sampling rate, commit size How it affect the sizing (directly or indirectly) Or is it depend upon the tool you are using like SAP BO Data Services, Informatica, Datastage Any help!!

Muneeswara C Pandian - March 20th, 2009

Thanks Gregory. Have not used Jitterbit, on open source have worked with Kettle.

data integration - March 19th, 2009

Thanks for the post Muneeswara - many of these are on my list as we look for our own data integration set up. I ask you, do you have experience with Jitterbit? They have a data integration solution that is open source that we are looking to use. Any info you have will be appreciate. Thank you, gregory.

Comments are closed.