<?paul

The accountability problem

Friday, May 21. 2010

I'm working on a volunteer side project for a local dance studio, the website will handle the standard sort of things: creating classes, registering students, matching leads to follows, etc.

We would like to have some level of accountability when various records (user, class, etc) are modified. That way if problems arise later, we can see who did what, and when. If needs be, even roll the change back.

This must be a common problem, how have you solved it?

My (proposed) solution is something like this: Take an ordinary table like class:

id	session	title	start_date	end_date
1	Spring	Dancing 101	May 1, 2010	May 28, 2010
2	Summer	Dancing 102	June 1, 2010	July 15, 2010
3	Fall	Dancing 101	September 1, 2010	September 28, 2010

And add a few columns to for change management

id	version	active	session	title	start_date	end_date	timestamp	login_id
1	1	0	Spring	Dancing 101	May 1, 2009	May 28, 2009	1274467708	23
1	2	1	Spring	Dancing 101	May 1, 2010	May 28, 2010	1274467709	23
2	1	1	Summer	Dancing 102	June 1, 2010	July 15, 2010	1274467710	27
3	1	1	Fall	Dancing 101	September 1, 2010	September 28, 2010	1274467808	35

The list of classes is obtained by selecting where the active bit is set. Administrators will be able to look at the list of changes and see who made them, and when. I’m tracking login id rather than simply user id so information like IP, and timestamp can be associated with the login, rather than simply who. In the case of reverting a change, the desired version will be duplicated into a new row, and have the active bit set. In reality active will always be the highest revision number (and latest date) but the duplication makes queries easier.

I think there’s a lot of power in this approach, but it is a bit complicated. Thoughts?

Comments »

The accountability problem - Concluded
Last week I blogged about the accountability problem: how to handle tracking who made what change, and providing the ability to revert it. Thanks to some great comments, and offline discussions, I’m seeing a few options, and have changed my mind

Weblog: <?paul
Tracked: May 25, 18:44

We use our ORM objects to store serialized versions of the data before a change is saved. We store the time and user info along with it. We tried using stored procedures and history tables, but there was not enough control.

#1 Brian Moon (Homepage) on 2010-05-21 17:57 (Reply)

At my last job we did something very similar but stored the inactive rows in a history table.

The reasoning for the additional table:
- wanted the primary keys to remain serial and unique
- we were doing large amounts of reporting on the tables and didn't want to have to filter on yet another damn column
- the ORM we were using [Doctrine] told us to do it this way.
- we already had a column named active on many tables and didn't want to cause more confusion than there already was

Point three was a key in our decision making process.

I think your solution is complete. The only thing I can see happening that you will have to ward against in your code base is you may inadvertently end up with two rows of id X with the active bit set.

#2 Shannon Patterson (Homepage) on 2010-05-21 18:16 (Reply)

This is basically a slowly-changing-dimension problem right out of dimensional modeling. Check out:
http://en.wikipedia.org/wiki/Slowly_changing_dimension

or any good reference on Ralph Kimball's SCD system.

#3 kabel on 2010-05-21 18:29 (Reply)

er.. Ralph Kimball's SCD management methodology. Sorry.

#4 kabel on 2010-05-21 18:30 (Reply)

This isn't bad, and it has advantages.

For example, because the revision history is stored within the actual table, you can run reports/queries to find specific revisions in a very granular way.

Correct on the `Active` column also - you could do without it in theory, but it would force you to start using GROUP BY whenever you wanted to join on that table, and it would get very messy (and slow) very quickly. It does mean that any logic that makes alterations to that data MUST include that 'set old revsisions to active = 0' or else you break your site, but that should be tested for in integration testing.

I don't quite do it this way however. Rather, I keep a separate table - `security_sensitive_event` - in which I record anything that I want accountability for. Columns are something like:

- security_sensitive_event_id
- security_sensitive_event_type_id
- event_timestamp
- visitor_ip_address

I have a couple of other tables that join with this such as `security_sensitive_event_has_user` for when the visitor is logged in whilst taking the action and also `security_sensitive_event_data` for storing key-value-pairs of information relating to the event.

In this accountability scenario that extra data might include 'what was changed'. In the case of a failed login it might be why the login failed, such as 'account_disabled' or 'unknown_username'.

Writes to my security_sensitive_event table are event-driven (I use .NET so I have events natively, not sure if PHP can do this easily) so I just have one OnSecuritySensitiveEvent event handler. It observes any relevant events elsewhere in the business logic that are security-sensitive and pops 'em in the database when they occur.

#5 Craig Fowler (Homepage) on 2010-05-21 18:38 (Reply)

We do something like that at my company for customer information (and have since long before I got here), and it is a nightmare when it gets to be a large dataset. We have 2.5 million unique customers, each of which has an average of 20-50 rows.

I would strongly recommend putting the history in a separate table. This could be done as Shannon suggested--by creating an almost-replica table. Or, an alternative way would be to have a single change history table for the entire database. It could look something like:

id | foreign_id | table_name | column_name | orig_value | new_value | timestamp | login_id

Every column that gets changed in the original table would require its own row in this table. This wouldn't be very efficient if all of your tables have a small number of columns, but it does add up to some big savings if you have very wide tables, and most changes are only to one or two columns.

This requires more complexity when making the changes and when trying to display the change history of an item. However, that complexity can be easily hidden from the developer with clever use of OOP, and this approach gives more flexibility and less data duplication than having to create a replica of every table whose changes you want record.

Whichever way you do it, I would definitely recommend putting the history in a separate table.

#6 Matthew on 2010-05-21 18:41 (Reply)

Is the source of the site OSS? I share an apartment with a couple who have a dance studio, and they keep asking me if I cant set them up with a nice registration and course management web apps.

#7 Lukas (Homepage) on 2010-05-21 19:24 (Reply)

I'm hoping I can get MySQL to lend a hand and enforce id + active to be unique.

#8 Paul Reinheimer (Homepage) on 2010-05-21 19:44 (Reply)

This reference looks great, thanks!

#9 Paul Reinheimer (Homepage) on 2010-05-21 19:45 (Reply)

At present it's a couple of sketched diagrams in my notebook, I plan on sticking it on github once I've got some real code written.

#10 Paul Reinheimer (Homepage) on 2010-05-21 19:47 (Reply)

Haven't had to solve this problem in a while.

But having been reading about Event Sourcing [1] and CQRS (Command and Query Responsibility Segregation) I might explore it when I do.

[1] http://martinfowler.com/eaaDev/EventSourcing.htm

#11 Jared on 2010-05-21 20:08 (Reply)

By happy coincidence we've been looking into this at work recently for a CMS project so your post is well timed! I believe this database design is also known as versioning, or history tables.

Your version is nice and has the benefit of simplicity. However, I agree with others the many records issue may get problematic. We've inherited a website at work (built by a different agency) which has all its versioned data in one table and it is very, very big and unwieldy. I know total row size can be an issue with database optimisation.

A more reliable approach is the duplicate table system where you basically have a versions/history table for each table you wish to track changes for. In your example that's similar to the second table example, just without the active flag. It's useful to record a change type field to track the type of change: Create, Update, or Delete. For example:

history_id | history_type | history_timestamp | history_author_id | duplicate of table data

You don't really need a version number in this instance since its inferred from the timestamp. You may need to add a 'history_status' column if you need to record draft data (i.e. unpublished, autosave, etc).

Prefixing the specific history columns with 'history_' also means you are less likely to get clashes with the normal table data.

The second table design avoids any query speed issues on the live data and allows you to keep as many versions as you wish. This is basically how Doctrine does it.

The downside with this is you need to keep the history table structure up to date with the core table. But if this is automated via your ORM / PHP then that's not such an issue.

You may also find this Stack Overflow question useful - http://stackoverflow.com/questions/39281/database-design-for-revisions

#12 simon r jones (Homepage) on 2010-05-21 22:30 (Reply)

I did a project using the revisioning in the same table like you showed it in your post and i would strongly advise against it!

If you change a lot of data, you bloat the table, make primary keys unneccessarily large and sometimes you build your queries and forget to filter by active, everything seems to work until the first historic record enters the table, not good.

I also worked on a project that used separate history tables, that is a very nice thing that does not polute your database caches that much and the versioning can also be done by triggers (if you put the userid or sessionid into a @-variable).

But here is one thing to be aware of if you use mysql with innodb:
on a server restart, innodb re-calculates the next auto-increment value of a table as the maximum value+1. If you put a record in the table, change it and then delete it, a record with the highest (but now unused) primary key value remains in the history table. If the mysql server is now restarted and a new entry arrives in the basetable, it gets the same primary key as the previously deleted, so your trigger or php function needs to check for that or you will get duplicate key errors.

#13 Dennis Birkholz (Homepage) on 2010-05-22 09:32 (Reply)

On one project, we stored all history in one table. When updating a record, we would create an array containing the old values of any field that changed, then store that in serialized format.

The history table structure is as follows:
(table_name, main_id, record_id1, record_id2, serialized_prior_data, creation_date, creation_user)

"main_id" is the record id from the main table in the database that nearly all other tables related to. Storing main_id simplifies pulling history for any record in the main table.

"record_id1" is the primary key number from "table_name".

"record_id2" is for cases where "table_name"'s primary key has two fields.

Below is the function we used to find changes between the old and new data because PHP's array_diff() didn't behave the way we wanted it to (I can't remember why right now). (Unfortunately, it seems markup plugins aren't enabled, so it looks like garbage...)

/**
Determines which keys are in both $old and $new but have different
values in $new

@param array $new the new data set
@param array $old the old data set

@return array an array of $old values that have been changed in $new
/
public static function array_diff($new, $old) {
$diff = array();
foreach ($new as $key => $value) {
if (array_key_exists($key, $old)) {
if (empty($new[$key]) && empty($old[$key])) {
// Handle PHP's loose typing of "empty" values.
switch (gettype($new[$key]) . ':' . gettype($old[$key])) {
case 'NULL:NULL':
case 'string:string':
case 'integer:integer':
case 'double:double':
// The types are the same, don't put in diff.
break;
default:
$diff[$key] = $old[$key];
}
} else {
if ($new[$key] != $old[$key]) {
$diff[$key] = $old[$key];
}
}
}
}
return $diff;
}

#14 Daniel Convissor (Homepage) on 2010-05-22 15:09 (Reply)

That will not work! If you make id+active unique, there can only be one active and one inactive revision, not more than these two!

#15 Dennis Birkholz (Homepage) on 2010-05-24 12:42 (Reply)

Not completely true. NULL is not considered as part of a unique key. So, you can have 1 or NULL and it will be allowed. Major hackery no?

#16 Brian Moon (Homepage) on 2010-05-25 18:52 (Reply)

Name
Email
Homepage
In reply to
Comment	Enclosing asterisks marks text as bold (word), underscore are made via _word_. Standard emoticons like :-) and ;-) are converted to images.

The accountability problem

Comments »

Hi, I’m Paul Reinheimer, a developer working on the web.

Top Posts

Search