Data Integration Challenge – Initial Load - I
In a data warehouse all tables usually go through two phases of data load process they are the initial load and the incremental load. ‘History Load’ or ‘Initial Seeding/Load’ involves a one time load of the source transaction system data of the past years into the data warehouse system. The process of adding only the new records (updations or insertions) to the data warehouse tables either daily or on a predefined frequency is called ‘Incremental Load‘. Also certain tables that are of small in size and largely independent set of tables which receives full data (current data + history data) as input would be loaded by means of a ‘Full Refresh‘; this involves complete delete and reload of data.
Especially code tables would usually under go a one time initial load and may not be required for a regular incremental load, incremental load is common for fact tables. Initial Load of a data warehouse system is quite a challenge in terms of getting it completed successfully within a planned timeframe. Some of the surprises or challenges faced in completing the history load are
- Handling invalid records
- Data Reconciliation
- System performance
- Catching up
Handling Invalid Records:
The occurrence of an invalid record becomes much more prominent as we process the history data which was collected into the source system much long before and the history data might not fit into the current business rules. The records from a source system can become invalid in the data warehouse due to multiple reasons like invalid domain value for a column or null value for a non null-able field or aggregate data not matching to the detail data. The ways of handling this problem effectively are
- Determine the years of data to be loaded into the data warehouse very initially and ensure that the data profiling is performed on the sample data for all the years that has to be loaded. This ensures that most of the rules of data validation are identified up front and built as part of ETL process. In certain situations we may have to build separate data validation and transformation logic based on the year and data
- Especially in situations like re-platforming or migrating the existing data warehouse to a new platform, even before running the data through regular ETL process we might need to load the old data into a data validation (staging) area through which the data analysis is done, cleaned and then data loaded into the data warehouse through regular ETL process
- Design the ETL process to divert all the key values of the invalid records to a separate set of tables. In some sites we see that the customer just needs to be aware for the error records and fine if these records doesn’t get aligned into the current warehouse, but at times the invalid records are corrected and reloaded
- For certain scenarios like aggregate data not matching to detail data, though we might always derive aggregate from detail data at times we might also generate detail data to match aggregate data
Once the initial load from the source system into the data warehouse has been completed we have to validate to ensure that the data has been moved in correctly.
- Having a means of loading records in groups separated by years or any logical grouping like by customer or product would give a better control in terms of data validation. In general data validations performed are like count and sum should be tied to certain business specific validation rules like all customers from region ‘A’ belonging to division ‘1’ in the source should be classified under division ‘3’ in the current warehouse.
- All the validations that needs to be performed after the initial load for each data group has to prepared and verified with the business team, many a times the data is validated by the business as a adhoc query process though the same can be verified by an automated ETL process by the data warehouse team
We shall discuss further on the other challenges in Part II.