Thursday, July 21, 2005

A tactical view of Analytical Reporting in osC

The long-term success of any business is dependent on the decisions it makes. And as a small business grows, so does the need to analyze the data they have collected to make these important business decisions. Forecasting, trend analysis and other aggregations of raw data are imperative to understanding your business to assist you in making these decisions. Yearly, quarterly, monthly, weekly, and daily aggregation of sales data, for example, give you great insight into how well the volume of your business is doing. And the more ways you look at your data, the better you can understand your business.

Unfortunately, these types of data aggregations can be incredibly taxing on server resources that your application relies on. So much so in fact, that eventually, your application will grind to a halt. Performance and usability of your application will become so poor that you’ll end up losing customers.

So, what does a medium to large storeowner who wishes to leverage all of that captured data without compromising performance do?

Ideally, data would be extracted, transformed (if necessary) and transferred over to an Operational Data Store (ODS) or Enterprise Data Warehouse (EDW) in a daily, incremental fashion, and then purged from the OLTP (On Line Analytical Processing) system after 30, 60 or 90 days. That would keep your application running on a ‘lean’ set of data for optimal performance. Data that would be occasionally needed by the application, such as order history, would be transferred over to the ODS, while data who’s only purpose would be to aggregate and analyze, would be moved over to the EDW.

However, this introduces a new level of complexity into the data model. Two things specifically jump out at me that would require consideration in the OLTP database in order to insure data integrity between the reporting structures, and the OLTP database. In order to support the incremental extraction, I think you’d need to have a couple of very dependable timestamps. You’d need to have one column for the timestamp of the row was inserted, and one for the timestamp that the row was last updated. In conjunction, you’d also need a solid set of triggers, that you’d use to ensure that any time a row in the OLTP database was updated, to make sure that new, updated row was included in the extractions.

In addition to the timestamps, I think you’d need to have a reliable deletion audit trail. Obviously, if data is deleted in the OLTP database, you need a way to also delete it from the ODS and/or EDW. Typically, this is handled with a delete trigger, and a ‘delete_log’ table. After the ETL (Electronic Load and Transfer) process extracts and loads data from the OLTP system into the reporting structures, it will also grab all of the deleted rows from the ‘delete_log’ table, iterate through them, and update the tables in the ODS/EDW accordingly, either deleting them, or marking them as deleted.

These are all fairly common industry standard methods of keeping the OLTP database from becoming bogged down with too much data, and to keep contention down by moving reporting away from the application.

Unfortunately, there seem to be some inherent problems with MySQL that hinder us from being able to accomplish this easily. First of all, MySQL doesn’t appear to support triggers. This requires that the application itself be responsible for ensuring that all deletions are recorded in the delete log table, and that all updates also include an update to the last updated timestamp. I’m don’t know about you, but I’m not a big proponent of trusting the application to do anything when it come to data integrity.

With all of this being said, while I’d certainly be up to the challenge of creating an ODS/EDW for an osC OLTP system (through, I’m of the opinion that once a small business grows beyond the point of being ‘small’, it’s time to start looking for a new shopping cart application.