website stat

Is denormalizing a database that evil?

A straight and fast answer: it depends, but if done cautiously, no.

I’ve already given the example of keeping a field for the COUNT(*) query, which is way faster that applying a COUNT(*) to hundreds of rows.

Now imagine you have a product that belongs to a category. Simple as that. For the sake of the example’s clearness, no parent categories, no subcategories.

The “normalized” way of doing this would be keeping a table products and a table categories. The table categories would simply contain an id as a primary key and a field name to store the name of the category.

Now, every single time you want to know the product’s category, according to this database schema, you’ll have to 1) do another query using the category_id; 2) be a little bit smarter and eager load the association with a JOIN.

Either way, it sucks. It truly sucks if you have to do a lot of these queries. JOINs are expensive.

So, why not keep the category name in the products table instead of the id? In the past, we were warned against lack of space. True, having “Shoe” repeated several times instead of “10″ takes a lot more space. But is space we’re worried with? Here’s a bold NO. Nowadays we’re worried with speed and scalability, not with space. Space is cheap, performance is not.

Let’s keep these denormalization demons away and turn out database into a faster database :-)


22 Responses to “Is denormalizing a database that evil?”

  1. JP Antunes
    Published at January 24th, 2008 at 3:45 pm

    … unless of course you, by some reason, had to change ’shoe’ to ‘footwear’ or something.
    And of course, in your example you only gain performance on “simple” read operations. How would you go about aggregate functions if you’re not sure integrity is kept in your data set? Update (potentially) several thousands products on each category update? And if you had a situation where you had to support multi-language environment, how would you go then?

  2. Manuel Padilha
    Published at January 24th, 2008 at 3:53 pm

    I always thought of normalization as a way of measuring how “relational” a database is.

    It’s not a good thing or a bad thing. It’s just a characteristic. For some applications and workloads being relational is the way to go (flexibility and manageability are important), and for others being relational is just being slow.

    For the specific case that you mention, you’d probably be much better off using a document oriented database (like couchdb or something) than using a relational database, since “relations” is not something that seems to be needed.

  3. mlopes
    Published at January 24th, 2008 at 3:56 pm

    JP Antunes,

    My example is given provided that you do more SELECTs than INSERTs or UPDATEs. It’s not a common operation to change the category but it’s quite common to see the product’s category.

    Now, as for the i18n, I’d keep that out of the table itself. I’d just mark that field as subject to localization and everytime it had to be displayed I’d go check for the translation on the table containing the translations. Of course one could also denormalize the translations but that would be a PITA. Actually, even keeping the translations on a separate “categories” table would be horrible.

    Anyway, this doesn’t always apply. This was a simple example where if you’re looking for performance it clearly applies.

  4. mlopes
    Published at January 24th, 2008 at 3:58 pm

    By the way, as a rule of thumb denormalization might be viable if you do more READs that WRITEs on a database.

  5. mlopes
    Published at January 24th, 2008 at 4:01 pm

    Manuel Padilha,

    You have a point there, but usually this is a case in the midst of a huge database that has relations. My example is useful if you want to achieve performance on a RDBM like MySQL or PostgreSQL.

  6. Sérgio Carvalho
    Published at January 24th, 2008 at 4:19 pm

    I’ve already said this, but I’ll repeat: Most times, denormalization is either:

    a) A hack around a limitation in the database engine; or
    b) A hack around developer ignorance

    Some, very rare times, denormalization is needed. Usually, because the feature needed to properly execute our query is so rarely used that no database engine implements it.

    In the count(*) case, you need either materialized views or readable expression indexes.

    In the category case, you need a decent database engine. Joins are not expensive if the database engine can keep the junction table in-memory (it’s so small, that it should fit). If the relation is many-to-one (and not many-to-many) you can also fallback again on materialized views or expression indexes to cache the joins if need be.

  7. JP Antunes
    Published at January 24th, 2008 at 4:32 pm

    @mlopes: ah grasshopper, that rule of thumb is a fallacy. It’s not the amount of Updates, it’s _who_ can Update, and _where_.

    @sergio: I could try, but I wouldn’t say it better.

  8. mlopes
    Published at January 24th, 2008 at 4:35 pm

    Sérgio,

    I find you comment very assumptuous, but moving forward…

    If you have a materialized view and you want immediate updates you still need to have a trigger, plus the effort of generating the materialized view (which might be sums, counts, whatever). If you have it denormalized you just need to increment or decrement a field upon being set by the trigger. Way easier, way faster.

    As for creating a readable expression index, you’ll have the overhead of parsing the index. Alas, it’s way slower than using an int.

    Finally, regarding using another database engine, that’s doable if you can have both the products and categories using a different engine. If you need to use a particular RDBMS you can’t afford to do that. And that’s the case for 99% of the people.

    P.S. - A lot of DBA experts would disagree with you comments regarding the usage of denormalization. But anyway, that’s your style anyway, right? Assumptuous.

  9. Sérgio Carvalho
    Published at January 24th, 2008 at 5:16 pm

    Mário,

    You probably mean presumptuous, not assumptuous. Yup, that’s me, when I’m talking about stuff that I know the inside out.

    You’re wrong on materialized views. Go read on materialized view fast refreshing.

    Also, go read on indexes. Postgresql has one very nice manual. You’re wrong on indexes. I’ll offer a nugget: Indexes are integers (or reals, or booleans, or whatever). They are not parsed on read. They are usually not stored in-row, and that is the only difference vs a database column, performance-wise.

    See, I managed to beat my own presumptuousness. And that’s one heck of a long word.

    Oh, and if you promise not to be offended, here’s a bit of presumptuous humour:
    http://www.youtube.com/watch?v=WrjwaqZfjIY

    (I’m not only presumptuous, I’m a mean SOB in written form)

  10. mlopes
    Published at January 24th, 2008 at 6:39 pm

    Sérgio,

    You wrote your comment in a assumptuous way, therefore you were presumptuous. Your correction is correct, although assumptuous does exist in informal English.

    Anyway, point being that you assume you always have the right answer. For this matter, there are multiple right answers and approaches. You lack a little bit of humility by stating that you’re right because you know what you’re talking about. You assume that your way of doing things is the right way of doing things. Anyway, back to trail.

    MySQL 4.x and 5.x, for instance, does not support fast refreshing of materialized views. Take a look here. So, it’s not a cross-database solution.

    As for the indexes, of course they’re not maintained in row, they’re the key on a hashtable that points to the row. But that doesn’t take out my point. If you use an expression on the index that also accounts the number of comments you still have to parse it to get the number of comments.

    I’m not stating that your proposals don’t work. They do work. As normalization or simply doing a COUNT(*). You can’t just assume that *your proposals* are simply better because everyone else is dumb and you’re right. A lot of DBA experts recommend my solution for being simple. It might not be better, ymmv depending on your specific case.

  11. JP Antunes
    Published at January 24th, 2008 at 8:02 pm

    @mario: a lot of DBA are really bad devs gone worst. If you can show me a DBA expert who advises de-normalization “because it is simple” I’ll show you a DBA expert who probably isn’t very good at is job.

    I hope you don’t find me presumptuous, but over the years I’ve had some fights with DBAs because, like you, I tend to think as a developer and not as one of them.

  12. mlopes
    Published at January 24th, 2008 at 8:18 pm

    JP Antunes,

    The advice I give obviously doesn’t apply to all cases and I’ve never claimed so. It’s a good, fast and simple way of achieving performance and avoid nailing the database with COUNTs.

    Not that using a materialized view or an index expression aren’t also viable choices — they are. But this denormalization is still faster and a good choice for a panacea of databases as it doesn’t depend on the DBMS functionality.

    As I’ve previously mentioned, if you’re doing more category updates than listing products along with categories, this solution doesn’t suit you.

  13. Sérgio Carvalho
    Published at January 24th, 2008 at 11:02 pm

    Mário,

    There is always more than one way to do it. In this case, I’m absolutely certain my way is better.

    Performance-wise, index lookups — assuming materialized views are impossible — are virtually the same as field lookups. Contrary to what you write, index expressions need not be parsed — not for every row at least, and query parsing time is not relevant. Indexes are value series, much as columns are, stored out of row.

    Some databases don’t store the pointer from the table row to the index row. For these, reading the index value incurs in the cost of searching the index structure (usually a b-tree). Whenever the pointer is there, reading the index has cost zero, for all intents and purposes. Decent RDBMSs contain the read pointer from rows to indexes.

    Logic-wise, indexes introduce no breaches in database consistency.

    Trigger-updated fields introduce a significant performance drag when updating a field.

    Worse, they introduce normalization breaches. Data is duplicated in the database, and is therefore prone to inconsistency. What happens when someone wrongly updates the count(*) field? If the field is in-/decremented, what happens when two transactions concur in a field update? If the field is recalculated, how prone to row-level locks is the recalculation on high-concurrency databases (think webapps)? How do you handle write-permissions to different users when they aren’t allowed to update the table, but can produce updates that affect the count(*) field?

    Denormalization is all about reinventing the wheel.

  14. JP Antunes
    Published at January 24th, 2008 at 11:47 pm

    @sergio: you said “Denormalization is all about reinventing the wheel.”

    Didn’t you notice the url for this blog? Mario has his own wheel going.

    @mario: just kidding!

  15. mlopes
    Published at January 25th, 2008 at 12:13 am

    Sérgio, again, it’s one way of doing things. MySQL 5.x does not support expression indexes (PostgreSQL or Firebird - to name some FOSS RDBMS - do!). Did it stop YouTube, Digg and so many high-load heavy traffic websites from using it? No. Now, go ask them how they they managed to “index” the count, considering they don’t have either materialized views or expression indexes.

    So, qed, there’s no “best” way. Mine is cross-database as it works in *any* database. Yours in cleaner. Using expression indexes definitely saves you from a lot of trouble.

    Finally, as for keeping the integrity the database, the solution I’ve proposed should only be applied to helper fields, fields that can be recalculated (the COUNT(*)) for instance. So, it’s not a big issue if you run into sync problems. Just run a COUNT again and update the field accordingly. No need for locks but you can still do it, if you think you’re unfortunate enough to have a query being done between the INSERT and the trigger UPDATE.

    JP Antunes,

    Well, that’s exactly the meaning of the name. It’s not reinventing the wheel though, it’s my own wheel ;-)

  16. Luís Miguel Silva
    Published at January 25th, 2008 at 12:57 pm

    What about data inconsistency? And why don’t you simply do a count(id) instead of a count on all the fields? :o)

  17. mlopes
    Published at January 25th, 2008 at 1:51 pm

    LMS,

    That’s not relevant. Unless the data is clustered, which probably isn’t, the count still has to be done by running through all rows that have a certain id. It’s still an expensive operation.

  18. cpinto
    Published at January 26th, 2008 at 10:32 pm

    I’m all for document storage when it makes sense. if you don’t trust me (which is by all means perfectly understandable) go ask amazon or google. here’s a quick example of when you could use key-value storage for better performance: shopping cart/order system. when you place an order in some site, your order reflects the state of the system at that exact point in time. instead of having a series of tables which reference other tables, it’s more than enough to create an order document that includes _all_ it’s items (and each item’s details) and store it somewhere under an unique id. when the customer checks her’s past orders, you won’t need to join N tables. when you need to check your customer’s order, you won’t need to join N tables. if you need to change the title of the book, your customer won’t give a damn because she knows the book with the “Document storage” title and not the “RDBMS storage” title. so stop thinking about relational data when there’s little to no need in some use-cases.

    another quick example: blog software (and I know this one will send Sergio through the roof =) ). You don’t need an RDBMS for your blog. if you’re smart (which I’m trusting you all to be), you can pretty much get away with using a key-value system. just like the, grasp, filesystem.

    (edited stuff out)

    it’s up to you to have the judgement of using the most appropriate tool for the job. in other words, it’s perfectly OK to mix them. and this is something sh*tloads of people in IT hate to do, everyone likes to put all their eggs in the same basket.

  19. cpinto
    Published at January 26th, 2008 at 10:34 pm

    oh yeah, one more thing: I’d say Mario’s example is a terrible one. As soon as you need one product to exist under two distinct categories you’re f*cked.

  20. mlopes
    Published at January 26th, 2008 at 10:50 pm

    Celson,

    Your “document storage” model is a good one, although it doesn’t always apply. Still, when it does, it’s an excellent salvation to doing multiple JOINs and, hence, hinder performance.

    As for the example, it had to be a very simple example. Take a look at what I’ve wrote:

    For the sake of the example’s clearness, no parent categories, no subcategories.

    Same applies to products with several categories. Of course it only works if one product has one and only one category.

  21. cpinto
    Published at January 26th, 2008 at 11:00 pm

    Mario, a couple of notes:
    - when I write document storage, you should read key/value system, just like sleepycat’s db
    - hinder is creating obstacles :-) which contradicts the key/value system to being a good way of avoiding joins
    - i’m sorry for not reading your post more carefully. i have to admit that I was more keen to read the comment thread :-)

  22. mlopes
    Published at January 26th, 2008 at 11:10 pm

    Celso,

    What hinders database performance is doing multiple JOINs. The semantics of that sentence are not very perceptible but I guess one grasps the meaning :-)

    As for not reading the whole post, shame of you, humpf :-P You rather take a look at the gossip and personal fights. eheh

Leave a Comment

Comments for this post will be closed on 25 January 2009.