Had discussed on three options by which instantaneous data refresh in a table can be achieved, as following.
‘Partition Based’ includes the option of ‘Partition Exchange’ and as well as one other option ‘Partition Key’ commented by Karthik Srinivasan.
In the ‘Partition Key’ scenario, two partitions exist for a table but they are not exchanged. After ETL load one of the partitions will have the latest data. Which partition is to be accessed is determined from a work table. The work table has a flag which is set as part of the ETL load process, this flag indicates which partition key is to be used in the reporting.
Comparison of the three options
|Options\Criteria||Full Load Possible (Truncate & Load)||Additional Disk Space Size||Occurrence of data Inconsistency||Additional Memory Space Required|
|Partition Based||Yes||Twice the size of the actual table||Yes||No|
|Synonyms||Yes||Twice the size of the actual table||Yes||No|
|Bulk Commit||No||The actual table size + size to hold incremental data||No||Yes To hold the uncomitted data|
All the criteria used for comparison are self explanatory, except for ‘Occurrence of data Inconsistency‘.
What is ‘Occurrence of data Inconsistency‘? In any of the three options if there can be a state where in the data across a set of tables are not refreshed synchronously then we have ‘Occurrence of data inconsistency’
Let us take the scenario of refreshing a set of 5 tables. In the ‘Partition Based’ or ‘Synonym’ option if the script (the script that performs partition exchange or the flag update or the synonym replace) fails after the 2nd table then at that point of failure 3 tables will be referring to the old data and 2 tables will be referring to the new data. This is ‘data inconsistency’ and can occur in both ‘Partition Based’ and ‘Synonym’ options.
The ‘Bulk COMMIT’ option is a SQL operation; we will need to issue a COMMIT statement to complete the transaction. So in the case of set of 5 tables, the COMMIT statement will be present at the end of the fifth table load. Having COMMIT as the last statement ensures that any script failure while refreshing the data across the set of tables rolls back all the data that was loaded in the other tables as well, hence we will never have a scenario of ‘data inconsistency’ in ‘Bulk COMMIT’ option.
Which is the best option: Have seen implementations of ‘Bulk Commit’ and as well ‘Partition Based’ but not ‘Synonym’. Usually ‘Partition Based’ is seen as an easier and quicker way of getting things done, with a downside of having to hold twice the storage space than that is required for the actual table.
Using ‘Bulk Commit’ requires good logic handling in terms of table grouping and as well controlled commit for memory management. If we need to have an optimized environment with complete control in terms of what data gets refreshed, with minimal disk space and with no risk of data inconsistency then would recommend having the option of ‘Bulk Commit’.
Let me know your thoughts…