Data Integration, what to document and where
‘Documentation’, provides details about the process and logic present in a software engineering system. ‘Documentation’ is a must when we have systems that are not easy to understand by a code walkthrough.
Does a Data Integration system that has been developed using tools like Informatica, DataStage or other GUI based tools require documentation, how do we document the details?
The two documents that we prepare while building a Data Integration environment are
- High Level Design (HLD) document
- Low Level Design (LLD) document or the Program Specifications
HLD documents are a must; this document captures details in terms development standards, how to access the data and other data integration process design related information. This is prepared initially before starting the development then it gets updated as and when new design principles are introduced into the data integration environment.
LLD documents provide details on what an individual mapping or job performs. In a GUI based tool the individual jobs are built as a data flow diagram, where the logic in the system is clearly depicted pictorially, if it’s Informatica then the details are much richer. LLD are not really required in a GUI environment. Can we write more details in a document and convey a message than what is represented pictorially?
When do we need LLD?
- LLDs are required for processes that are built in COBOL or ‘C’ or any other programming languages.
- We may also need it if we are using ETL tools as a scheduling tool, where in all transformation logics are built into external database procedures or into programming languages.
- We may also need LLD prepared during the first time design of a job to capture the design steps in a document before starting the coding process, but there after once the design captured in the doc has been coded into an ETL tool, then maintaining such LLD documents outside ETL process is not required.
How can we avoid LLD?
Will just pictorial representation be fine? No, we need to follow the below steps to ensure that the metadata captured in a ETL tool is rich and enables documentation generation process.
- All data processing logics have to be built into the ETL tool using the transformation components provided.
- Do not embed the logics into free hand SQLs. Tools like Informatica also provide features to the capture the ‘WHERE’, ‘FROM’ clause separately by which the whole SQL (for source or lookup) need not typed as a free text.
- All comments about the process should be provided at the levels of mapping, transformation, field and formula expressions. Add labels of description at all possible instances.
- When ever the ETL job gets enhanced and changed, add the comments for the changes at all the impact points
If we can follow the above steps then the program specification or LLD can be just generated from the metadata of the tool. The thought is that all jobs have to be developed with proper comments and it should enable LLD document generation from the captured metadata.
Currently data integration tools do not support with features that can support data integration design phase like defining the data sources, defining the access methods, defining other data integration process design needs within the tool. There have been attempts to integrate Visio and the ETL tool….but still there is a long way to go before deciding whether we can avoid maintain external HLD documents….