Last week I blogged about the accountability problem: how to handle tracking who made what change, and providing the ability to revert it.
Thanks to some great comments, and offline discussions, I’m seeing a few options, and have changed my mind on my desired approach. There's also some great technical reading on the subject: Slowly changing dimension

Option 1: Store history in the table

This was the approach I considered initially. Storing each revision of a record in the table, with an active bit set to indicate which version is active. More details in the original post. This approach is simple: it doesn’t require more tables, queries are still pretty simple, and with a combination of timestamps it’s easy to track down associated records that were changed.

The downsides are pretty clear, and somewhat damaging. The number of rows in the most commonly hit tables will grow much faster than the table overall, which is sub-optimal. It can also complicate issues like reporting and the like. In the case of highly normalized designs this design will likely break down, as you’ll need to match revisions across multiple tables on every query.

classes
idversionactivesessiontitlestart_dateend_datetimestamplogin_id
110SpringDancing 101May 1, 2009May 28, 20091274467708 23
121SpringDancing 101May 1, 2010May 28, 20101274467709 23
211SummerDancing 102June 1, 2010July 15, 2010 127446771027
311FallDancing 101September 1, 2010September 28, 2010127446780835

I would have used this approach in a smaller project. The system I’m replacing has been running for ten years, I have to plan for this lasting that long with only minimal maintenance. I’d rather not watch those row counts become 3-5x what they should be.

Option 2: Store histories in an external table

Rather than storing the histories in the regular table (say classes) creating a duplicate table (classes_history) and storing the revisions there. This helps trim the size of your most heavily used tables, and keeps things relatively simple.

While trimming the row count, it does double the table count, and still fails to handle complicated objects, or normalized table structures where a single datum requires multiple tables.

classes
idsessiontitlestart_dateend_date
1SpringDancing 101May 1, 2010May 28, 2010
2SummerDancing 102June 1, 2010July 15, 2010
3FallDancing 101September 1, 2010September 28, 2010
classes_history
idversionactivesessiontitlestart_dateend_datetimestamplogin_id
110SpringDancing 101May 1, 2009May 28, 20091274467708 23

I’ll be using this method for my upcoming project.

Option 3: Store serialized information

Rather than storing rows when information is modified, the entire encapsulating object is serialized and stored. This can be done either with one table per object type, or in one large history table.

The advantage here is that each revision stands alone and complete. Composite records are stored complete, and in a single location. This makes obtaining an accurate history relatively easy. On the downside, since information is serialized it’s hard to locate specific pieces of information via sql. This approach seems perfect for complex objects where a single composite record spans many tables, but a little overkill for this specific case.

classes
idversionactivesessiontitlestart_dateend_datetimestamplogin_id
121SpringDancing 101May 1, 2010May 28, 20101274467709 23
211SummerDancing 102June 1, 2010July 15, 2010 127446771027
311FallDancing 101September 1, 2010September 28, 2010127446780835
history
idversiontabledata
11classesa:8:{s:2:"id";i:1;s:7:"version";i:1;s:7:"session";s:6:"spring";s:5:"title";s:11:"Dancing 101";s:10:"start_date";s:11:"May 1, 2009";s:8:"end_date";s:12:"May 28, 2009";s:9:"timestamp";s:10:"1274467708";s:8:"login_id";i:23;}

I'll consider this for larger projects with complex objects.


Other points to ponder:

  • How long will you store histories?
    • Forever (∞)
    • Some fixed period of time (30 days)
    • Application specific (e.g. storing change to a class for three months after it ends)
  • Storage engine for history
    • MyISAM
    • Archive
  • Backup strategies
    • History tables are all appends



Thanks to everyone who commented on the last post.

Comments »

No Trackbacks
I found this article http://www.jasny.net/articles/versioning-mysql-data/ a while ago that has triggers for creating a history table and a script for converting existing tables.
#1 Joelle Tegwen on 2010-05-25 19:50 (Reply)

How about option 3, but you store the information in a NoSQL database (like MongoDB), where it can be queried efficiently
#2 Avi Block (Homepage) on 2010-05-25 20:13 (Reply)

Hi Paul,

You asked me what I thought earlier, so I'll reply here -

I'm not completely opposed to the serialization method, provided that:

(a) You don't need to search.
(b) You always read/write serialized objects one at a time.

I wrote about this in January:
http://www.mysqlperformanceblog.com/2010/01/21/when-should-you-store-serialized-objects-in-the-database/

The only other advice I might have about the serialization technique, is make sure your code handles that there might be multiple versions of the serialization.

Re: Backup strategies, if it becomes a lot of data, it may become worth while to segment your backup. Every week or month copy inactive data from your 'hot' tables to a series of archival tables, then do a complete backup. But every day, just backup the hot tables.

This has the advantage of very fast recovery times post-disaster, since users probably won't care if it takes you a few more hours (or days) to recover the less vital info.
#3 Morgan Tocker (Homepage) on 2010-05-25 20:18 (Reply)

Yeah, we use #2 as well for stuff where we want to keep the main table lean and mean.
#4 Brian Moon (Homepage) on 2010-05-25 20:42 (Reply)

you option #2 should be modified. basically every UPDATE you do to your normal table should still happen + an INSERT into your history table (with the PK being the PK from the original table plus a timestamp). there is no need for an active column. this is how the Doctrine Versionable behavior works.
#5 Lukas (Homepage) on 2010-05-26 06:21 (Reply)

My thoughts are you have two solutions because this sounds like a small scale application. By small scale, I mean, the number of reads/writes are low and the amount of data being dealt with isn't huge and the grow of the DB will probably be small.

MySQL:
Use InnoDB, with solution #2 and FK's and create a "JOIN" table.

MongoDB:
Store the history as an "Embedded doc"

Of course proper indexes are always the must for both.

Storing schema-less data I can see working, but why add another layer?
#6 EllisGL on 2010-06-19 05:52 (Reply)


Enclosing asterisks marks text as bold (*word*), underscore are made via _word_.
Standard emoticons like :-) and ;-) are converted to images.
 

Hi, I’m Paul Reinheimer, a developer working on the web.

I co-founded WonderProxy which provides access to over 200 proxies around the world to enable testing of geoip sensitive applications. We've since expanded to offer more granular tooling through Where's it Up

My hobbies are cycling, photography, travel, and engaging Allison Moore in intelligent discourse. I frequently write about PHP and other related technologies.

Search