Last week I blogged about the accountability problem: how to handle tracking who made what change, and providing the ability to revert it.
Thanks to some great comments, and offline discussions, I’m seeing a few options, and have changed my mind on my desired approach. There's also some great technical reading on the subject: Slowly changing dimension
Option 1: Store history in the table
This was the approach I considered initially. Storing each revision of a record in the table, with an active bit set to indicate which version is active. More details in the original post. This approach is simple: it doesn’t require more tables, queries are still pretty simple, and with a combination of timestamps it’s easy to track down associated records that were changed.
The downsides are pretty clear, and somewhat damaging. The number of rows in the most commonly hit tables will grow much faster than the table overall, which is sub-optimal. It can also complicate issues like reporting and the like. In the case of highly normalized designs this design will likely break down, as you’ll need to match revisions across multiple tables on every query.
classesid | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 1 | 0 | Spring | Dancing 101 | May 1, 2009 | May 28, 2009 | 1274467708 | 23 |
1 | 2 | 1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 | 1274467709 | 23 |
2 | 1 | 1 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 | 1274467710 | 27 |
3 | 1 | 1 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 | 1274467808 | 35 |
I would have used this approach in a smaller project. The system I’m replacing has been running for ten years, I have to plan for this lasting that long with only minimal maintenance. I’d rather not watch those row counts become 3-5x what they should be.
Option 2: Store histories in an external table
Rather than storing the histories in the regular table (say classes) creating a duplicate table (classes_history) and storing the revisions there. This helps trim the size of your most heavily used tables, and keeps things relatively simple.
While trimming the row count, it does double the table count, and still fails to handle complicated objects, or normalized table structures where a single datum requires multiple tables.
classesid | session | title | start_date | end_date |
1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 |
2 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 |
3 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 |
id | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 1 | 0 | Spring | Dancing 101 | May 1, 2009 | May 28, 2009 | 1274467708 | 23 |
I’ll be using this method for my upcoming project.
Option 3: Store serialized information
Rather than storing rows when information is modified, the entire encapsulating object is serialized and stored. This can be done either with one table per object type, or in one large history table.
The advantage here is that each revision stands alone and complete. Composite records are stored complete, and in a single location. This makes obtaining an accurate history relatively easy. On the downside, since information is serialized it’s hard to locate specific pieces of information via sql. This approach seems perfect for complex objects where a single composite record spans many tables, but a little overkill for this specific case.
classesid | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 2 | 1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 | 1274467709 | 23 |
2 | 1 | 1 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 | 1274467710 | 27 |
3 | 1 | 1 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 | 1274467808 | 35 |
id | version | table | data |
1 | 1 | classes | a:8:{s:2:"id";i:1;s:7:"version";i:1;s:7:"session";s:6:"spring";s:5:"title";s:11:"Dancing 101";s:10:"start_date";s:11:"May 1, 2009";s:8:"end_date";s:12:"May 28, 2009";s:9:"timestamp";s:10:"1274467708";s:8:"login_id";i:23;} |
I'll consider this for larger projects with complex objects.
Other points to ponder:
- How long will you store histories?
- Forever (∞)
- Some fixed period of time (30 days)
- Application specific (e.g. storing change to a class for three months after it ends)
- Storage engine for history
- MyISAM
- Archive
- Backup strategies
- History tables are all appends
Thanks to everyone who commented on the last post.
I'm working on a volunteer side project for a local dance studio, the website will handle the standard sort of things: creating classes, registering students, matching leads to follows, etc.
We would like to have some level of accountability when various records (user, class, etc) are modified. That way if problems arise later, we can see who did what, and when. If needs be, even roll the change back.
This must be a common problem, how have you solved it?
My (proposed) solution is something like this: Take an ordinary table like class:
id | session | title | start_date | end_date |
1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 |
2 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 |
3 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 |
And add a few columns to for change management
id | version | active | session | title | start_date | end_date | timestamp | login_id |
1 | 1 | 0 | Spring | Dancing 101 | May 1, 2009 | May 28, 2009 | 1274467708 | 23 |
1 | 2 | 1 | Spring | Dancing 101 | May 1, 2010 | May 28, 2010 | 1274467709 | 23 |
2 | 1 | 1 | Summer | Dancing 102 | June 1, 2010 | July 15, 2010 | 1274467710 | 27 |
3 | 1 | 1 | Fall | Dancing 101 | September 1, 2010 | September 28, 2010 | 1274467808 | 35 |
The list of classes is obtained by selecting where the active bit is set. Administrators will be able to look at the list of changes and see who made them, and when. I’m tracking login id rather than simply user id so information like IP, and timestamp can be associated with the login, rather than simply who. In the case of reverting a change, the desired version will be duplicated into a new row, and have the active bit set. In reality active will always be the highest revision number (and latest date) but the duplication makes queries easier.
I think there’s a lot of power in this approach, but it is a bit complicated. Thoughts?