This article applies to: Enterprise Data Services
Data warehousing broadly includes building a data mart (DM); building or the use of existing tools within the DM; DM tuning; and the extract, transform and load of data to the data warehouse.
Extract Transform Load (ETL)
ETL tools are designed to aid in the transformation of data needed by the data model or by the reporting tool. Examples include derived values such as flags, calculated fields, concatenated fields, surrogate keys, aggregated data, and lookup values. ETL tools also have the ability to identify changed source records so they may be flagged in the target data mart to maintain record history in the mart. All new ETL developed by the Enterprise Data Services team is done using WhereScape and all code written in the prior ETL tool IBM Data Manager will eventually be redone with WhereScape Red.
Data Replication/Data Staging
For large data marts, the process of running the ETL to populate the data marts from the transactional system databases can take many hours. If records are being updated in the transactional system during the ETL processing, it can result in data inconsistencies in the target data mart. A second challenge is the impact of ETL on the source transactional database for longer ETL jobs. Data staging areas help avoid these challenges by taking a snapshot of the transactional database to guarantee a consistent frozen copy of the needed source tables, and protect the source production database from the lengthy ETL process.
The Enterprise Data Services team uses Attunity Replicate to incrementally synchronize the PeopleSoft and Kuali data staging areas for the nightly ETL jobs to the data marts. Using a data replication tool like Attunity means the entire transactional database does not need to be copied each night, and instead the data staging database can be simply synchronized with the source transaction system. The ability to synchronize the data-staging database with minimal data movement is becoming more important as Cornell moves its transactional applications to the cloud.
High Data Availability
Another challenge in data warehousing is minimizing the outage of the data marts during the nightly ETL processing. In traditional data warehousing, the data marts are unavailable during the nightly ETL processing. More significant is the fact that failures during ETL processing could lead to an extended outage of the warehouse, while the ETL problems are being solved. High data availability is the name of an approach used to avoid outages related to ETL processing. All of the high data availability approaches involve maintaining and updating multiple copies of the tables or databases being refreshed or updated.
The Enterprise Data Services team uses DMTools for all of our data mart load processes to handle high data availability; central status and detail logging of data mart load processes; and load status notification.