Posted by Muneeswara C Pandian
June 19th, 2008

Ids in Source System: Sometimes we would have been in a dilemma to decide whether to carry the identity (id) fields from source system as identity fields into the data warehouse as well. There is couple of situations which would push us to this state1.The business users are familiar with the product ids like 1211 , 1212 than by the product name it self and they need them in the target system

2.Why should we create an additional id field in the target table when I can have a unique identity for each record from the source system

What are source Id fields, they are usually the surrogate keys or unique record keys like product id, customer id etc which the business might be more familiar than with their descriptions, descriptions of these ids are more found on report printouts. In general most of the source id fields would get mapped to the dimension tables.

Here are the reasons why we should carry the source id fields as it is

  1. The business is comfortable talking and analyzing in terms of ids than descriptions
  2. Having source ids fields which are usually numeric or if not of smaller length is very much lookup friendly, using ids for lookup or filter or join conditions when pulling data from source systems is much better than descriptions
  3. Source id fields enables linking of the data from the data warehouse to the source system
  4. Just consider the id as another attribute to the dimension and not as a unique identifier

Here are the reasons why we should create additional id (key) field in addition to the source id field

  1. Avoiding duplicate keys if the data to be sourced from multiple systems
  2. The source ids can merge, split, anything can happen, we would want to avoid the dependency on the source system
  3. The id field created in the data warehouse would be index friendly
  4. Having a unique id for each record in the data warehouse would help in determining the number of unique records in a much easier way

Dates in Source System: One other field that we usually confuse is the date field in the source systems. The date field present in the source record might provide the information when the record arrived in the source system and a date field generated in the data warehouse system would provide the information when a source record arrived into the data warehouse.

The data warehouse record date and the source record date can be same if the source record gets moved into the data warehouse the same day, certainly both date fields might represent different date values if there is a delay in the source data arrival into the data warehouse.

Why we need to store Source Date in the data warehouse, this need is very clear, we always perform date based analysis based on the arrival of source record in the source systems.

Why we need to generate a Data Warehouse Date, capturing the arrival of the record into the data warehouse answers queries related to audit, data growth and as well to determine what new records arrived into the warehouse which is especially useful for providing incremental extracts for downstream marts or other systems.

Comments (0)