Tuesday, January 4, 2011

Internal vs. External Events in Databases

A database generally comprises data that records the (instantaneous) state of entities in the real world. If the historical states of the data are of interest, then a history of values may be recorded for these entities.  This history of values will usually be timestamped, and thus this historical list of data values can be considered as recording the occurrence of an event in the real world.  In other words, the real-world event changed the state of a real-world entity, whose updated state must then be recorded in the database.  Let's call these real-world events external events.

But now we must also consider that the process of updating the database to record the new real-world value is an event unto itself. Let's call this an internal event, since it is an event that is intimately tied to the database itself.  This event may be a human operator who manually types in a new value or hardware/software that captures the new real-world value and updates the database.  Consider that the act of recording the new value may or may not take place at the same time as the real-world value itself changed.  In other words the external event is distinct from the internal event.

I believe it is important to differentiate between external and internal events when developing a data model that captures the history of data states.  For one, recording internal events can help to audit data entry errors.  For example, erroneous data can sometimes be detected and even corrected if it is examined in the context of other temporally-proximal data entry events (e.g., a data entry operator that repeated a value from a previous data entry record).

Recording internal events explicitly can also help to troubleshoot querying anomalies caused by failure to take into account the difference in time between the occurrence of an external and its corresponding internal event.  For example, why did a query that was run at time t not reflect the updated state of the real world event that occurred at time t-1? It will not if the data entry--the internal event--occurred at time t+1.


Recording internal events can aid in the determining the time periods during which a database is "out of sync" with real-world entity values, due to data input errors.  Ideally, a database will always correctly reflect the state of the real-world, but in practice this is rarely the case. Inevitably, bad data will enter a system, and, at best, is corrected at some point in the future.  When corrections are made and recorded as internal events, the original and the corrected internal event timestamps can be used to determine when and for how long the database maintained inaccurate state.  If internal events can be marked as having been invalid, it is then also possible to generate reports that either ignore or include erroneous data states.  The advantage is that the database is not attempting to forget or otherwise hide the fact that erroneous data existed.  Much as database designers contend that data should never be deleted, one can argue that erroneous data should also not be deleted, but simply flagged.  Knowing that a database was temporarily maintaining bad data can be just as important as storing the correct values and their history.

Note that both external and internal events can also be used to record persons and comments associated with the event, in addition to just timestamps.

Internal events are most commonly recorded in log files, rather than as data in the database itself.  It can be very useful though to record internal events directly in the database, as this avoids the need to join log file output (of internal events) with database records of external events when performing troubleshooting or auditing tasks.




No comments: