CEP as sauce for alphabet soup (Part 9): ETL

Linux_Bot · November 24, 2007, 8:41pm

vincent
Sat, 24 Nov 2007 21:54:36 +0000
Extract, Transform and Load - sounds like a command from a dalek, but is really a whole subindustry of supporting acts for the data warehouse community (with some uses as well in integrating various operational database-oriented systems, usually via batch �catch-up� processes).
The rationale behind ETL is that one needs to get operational data from various operational databases (with schemae optimized for operational use) into a data warehouse for analysis / reporting / analytics (with a schema optimized for a �higher level� viewpoint). So you need to extract the data, run transformations on it (i.e. filters, aggregation queries, correlations etc) into the data format required for the data warehouse, and then do some (usually batch) load operation, in a way that minimizes impact on the operational system performance, and so that eventually the data warehouse users can start slicing and dicing the data�
So what relevance is CEP to this batch-world of uber-databases? Well, its the usual issue of real-time (responsiveness) versus batch (its ready when its ready) architectures and benefits. Its even hinted at in the Wikipedia article on ETL (as of the time of writing, anyway - Wiki content changes constantly) - excerpt follows�
Drawbacks to ETL
As the number of highly-connected computers in any data exchange grows, ETL suffers from exponentially increasing costs. See Metcalf�s Law. A solution to ETL cost growth is to use XML standards on an Enterprise Service Bus.
So, the CEP industry will say: instead of expensive and expansive batch operations on your data, why not treat the data in real-time as events?
Ah-ha! But surely CEP systems cannot handle the volumes or potential insights we are talking about? Well probably they can.
To partially prove the point, there are ETL users who are augmenting their toolkits with rule engines for complex transformations (indeed, the use of rule engines for complex systems integration goes back a long way, to at least the early 90s). And then quickly realizinging that the rule engine does all the transformations [*1] with all the integrations [*2] and performance [*3] they need. A rule-driven CEP engine lets them do this in �real-time�, too [*4]. And you can easily see where other CEP techniques can be used here (e.g. event stream processing).
Notes:
[*1] Transformations can be filtering for data quality based on content or metadata, aggregation / comparison across multiple sources, and so forth. In a rule engine these are carried out in memory: the data is loaded into the rule engine first and then transformed. For simpler cases a stateless rule engine will suffice here.
[*2] Most of the first-generation rule engines are designed to be good citizens and integrate with many standard data sources. New generation engines like TIBCO BusinessEvents can exploit EAI tools like TIBCO Adapters and literally feed off any source in the enterprise�
[*3] Rule engines are optimized for rule execution performance, which has a beneficial effect on rule-based transformations in ETL tasks�
[*4] A slightly different ETL fish is TIBCO DataExchange, that provides event-driven transformations on top of TIBCO BusinessWorks. Could be used to preprocess data before feeding to a CEP engine and/or a rules engine, too�

Source...