Brad Feld writes about how the web is today a denormalised relationship database, and is increasingly becoming even more denormalised.
Quick aside: Put simply, normalisation is the process of designing a database structure so that no redundant data exists (cost considerations). Denormalisation is the opposite -- adding redundant data, usually to improve performance (speed). I'm a little rusty on this stuff so if you want further details, you should try Wikipedia.
Feld then poses the interesting question of whether denormalisation of the web is a good thing or a bad one.
I started noticing something about a year ago – the web was becoming massively denormalized. As a result of the proliferation of user-generated content (and the ease at which is was created), services where appearing all of the place to capture that same data (reviews: books, movies, restaurants), people, jobs, stuff for sale. “Smart” people were putting the data in multiple places (systems) – really smart people were writing software to automate this process.
Voila – the web is now a massively denormalized database. I’m not sure if that’s good or bad (in the case of the web, denormalization does not necessarily equal bad). However, I think it’s a construct that is worth pondering as the amount of denormalized data appears to me to be increasing geometrically right now.
Let me hypothesise on why the web is getting more denormalised in the first place.
- Denormalisation is a product of any mass medium, almost by definition. A mass medium is one that is not meant solely for a niche/targeted audience. Content providers and advertisers will therefore try to blanket the medium with their message as far as possible. Whoever has the loudest voice/biggest bilboard/TV coverage/coverage on the 'net wins.
- The web shifts (some of) the costs of denormalisation to others. Think spammers. Whether spam on email, newsgroups, blogs or elsewhere, I would hazard that a large proportion of the costs are borne by ISPs, hosting providers and other third parties. So, spammers don't care about whether or not the database is normalised.
- The web is not controlled centrally. No authority exists to "clean up" the database and prevent different people from posting the same content multiple times.
- Keyword search engines like Google don't need a properly structured database to function. So neither consumers nor producers of content need to care that much about redundant data. (In fact, Google in particular actually takes advantage of redundant data via its PageRank algorithm.)
Any other reasons?
Now if all those reasons are true, especially #4, then the question of whether denormalisation is a good thing or a bad one becomes less relevant. It just is. Fact of life. And we're dealing with it just fine.
But then, this is an analysis of today. If the amount of redundant data is truly increasing geometrically as Feld claims, then could tomorrow be different? Would a SuperGoogle (not necessarily a search engine as we understand it today; could also be some kind of filter/community-based engine/whatever) come along that would help cut through the crap?
Link: Feld Thoughts