Last week I blogged about the accountability problem: how to handle tracking who made what change, and providing the ability to revert it.
Thanks to some great comments, and offline discussions, I’m seeing a few options, and have changed my mind on my desired approach. There's also some great technical reading on the subject: Slowly changing dimension
Option 1: Store history in the table
This was the approach I considered initially. Storing each revision of a record in the table, with an active bit set to indicate which version is active. More details in the original post. This approach is simple: it doesn’t require more tables, queries are still pretty simple, and with a combination of timestamps it’s easy to track down associated records that were changed.
The downsides are pretty clear, and somewhat damaging. The number of rows in the most commonly hit tables will grow much faster than the table overall, which is sub-optimal. It can also complicate issues like reporting and the like. In the case of highly normalized designs this design will likely break down, as you’ll need to match revisions across multiple tables on every query.
classesid | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 1 | 0 | Spring | Dancing 101 | May 1, 2009 | May 28, 2009 | 1274467708 | 23 |
1 | 2 | 1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 | 1274467709 | 23 |
2 | 1 | 1 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 | 1274467710 | 27 |
3 | 1 | 1 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 | 1274467808 | 35 |
I would have used this approach in a smaller project. The system I’m replacing has been running for ten years, I have to plan for this lasting that long with only minimal maintenance. I’d rather not watch those row counts become 3-5x what they should be.
Option 2: Store histories in an external table
Rather than storing the histories in the regular table (say classes) creating a duplicate table (classes_history) and storing the revisions there. This helps trim the size of your most heavily used tables, and keeps things relatively simple.
While trimming the row count, it does double the table count, and still fails to handle complicated objects, or normalized table structures where a single datum requires multiple tables.
classesid | session | title | start_date | end_date |
1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 |
2 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 |
3 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 |
id | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 1 | 0 | Spring | Dancing 101 | May 1, 2009 | May 28, 2009 | 1274467708 | 23 |
I’ll be using this method for my upcoming project.
Option 3: Store serialized information
Rather than storing rows when information is modified, the entire encapsulating object is serialized and stored. This can be done either with one table per object type, or in one large history table.
The advantage here is that each revision stands alone and complete. Composite records are stored complete, and in a single location. This makes obtaining an accurate history relatively easy. On the downside, since information is serialized it’s hard to locate specific pieces of information via sql. This approach seems perfect for complex objects where a single composite record spans many tables, but a little overkill for this specific case.
classesid | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 2 | 1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 | 1274467709 | 23 |
2 | 1 | 1 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 | 1274467710 | 27 |
3 | 1 | 1 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 | 1274467808 | 35 |
id | version | table | data |
1 | 1 | classes | a:8:{s:2:"id";i:1;s:7:"version";i:1;s:7:"session";s:6:"spring";s:5:"title";s:11:"Dancing 101";s:10:"start_date";s:11:"May 1, 2009";s:8:"end_date";s:12:"May 28, 2009";s:9:"timestamp";s:10:"1274467708";s:8:"login_id";i:23;} |
I'll consider this for larger projects with complex objects.
Other points to ponder:
- How long will you store histories?
- Forever (∞)
- Some fixed period of time (30 days)
- Application specific (e.g. storing change to a class for three months after it ends)
- Storage engine for history
- MyISAM
- Archive
- Backup strategies
- History tables are all appends
Thanks to everyone who commented on the last post.
Comments »
You asked me what I thought earlier, so I'll reply here -
I'm not completely opposed to the serialization method, provided that:
(a) You don't need to search.
(b) You always read/write serialized objects one at a time.
I wrote about this in January:
http://www.mysqlperformanceblog.com/2010/01/21/when-should-you-store-serialized-objects-in-the-database/
The only other advice I might have about the serialization technique, is make sure your code handles that there might be multiple versions of the serialization.
Re: Backup strategies, if it becomes a lot of data, it may become worth while to segment your backup. Every week or month copy inactive data from your 'hot' tables to a series of archival tables, then do a complete backup. But every day, just backup the hot tables.
This has the advantage of very fast recovery times post-disaster, since users probably won't care if it takes you a few more hours (or days) to recover the less vital info.
MySQL:
Use InnoDB, with solution #2 and FK's and create a "JOIN" table.
MongoDB:
Store the history as an "Embedded doc"
Of course proper indexes are always the must for both.
Storing schema-less data I can see working, but why add another layer?