Difference between revisions of "The GROUPing pitfall"

From ISoft Wiki
Jump to navigationJump to search
(Version 0.11)
(What is this, version 0.21? I forget)
Line 54: Line 54:
| 4.0
| 4.0
|-  
|-  
| 3
| 4
| 2
| 2
| 2010-01-02 15:05:33
| 2010-01-02 15:05:33
Line 63: Line 63:


I'm sure you understand what happens when they are joined together:
I'm sure you understand what happens when they are joined together:
{{MySQL|query=SELECT monkey.value AS monkey_name, poop.size AS poop_size
<source lang="mysql">
SELECT monkey.value AS monkey_name, poop.size AS poop_size
FROM monkey
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id}}
JOIN poop ON poop.monkeyid = monkey.id
ORDER BY monkey.id
</source>


You get this!
You get this!
Line 89: Line 92:
But you already knew that.  I mean, that's kid stuff!  I didn't even write it out, I actually got one of my younger siblings to type it out because I was so bored by it.
But you already knew that.  I mean, that's kid stuff!  I didn't even write it out, I actually got one of my younger siblings to type it out because I was so bored by it.


You see that Bimbo shows up twice - the one row in the monkeys table got multiplied by all the matching rows in the poop table that it got joined to.  You've seen it before, and if you were going to use this data in a report that displayed one record per monkey, you would know what to do - use a GROUP BY!
You see that Bimbo shows up twice - the one row in the monkeys table got multiplied by all the matching rows in the poop table that it got joined to.  You've seen it before, and if you wanted to use this data in a report that displayed one record per monkey, you would know what to do - use a GROUP BY!


{{MySQL|query=SELECT monkey.value AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) total_poop, COUNT(poop.id) AS number_of_poops  
=== A completely valid query ===
 
<source lang="mysql">
SELECT monkey.value AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) total_poop, COUNT(poop.id) AS number_of_poops  
FROM monkey
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id  
JOIN poop ON poop.monkeyid = monkey.id  
GROUP BY monkey.id}}
GROUP BY monkey.id
</source>
 
{| border = 1
|+ Query results
! monkey_name
! average_poop_size
! total_poop
! number_of_poops
|-
| Bobo
| 4.00
| 4.0
| 1
|-
| Bimbo
| 3.45
| 6.9
| 2
|-
| The Hooker
| 3.80
| 3.8
| 1
|}
 
Boom, bitches.  Fire up Crystal and turn that sucker into a report, amirite?
 
=== BUT WAIT! ===
Thanks to my narrative-writing powers, you remember another table that contains data relevant to this report!
 
{| border = 1
|+ Meal
! id
! description
! weight
! date
|}

Revision as of 14:59, 20 January 2010

The GROUPing pitfall is a very common logical error in SQL queries, especially among people getting familiar with a DBMS.

This is what will happen - you'll be writing a query with a couple JOINs in it. You add a GROUP BY for the correct columns, and everything looks peachy. Then, at some point in the future (could be 5 minutes, could be a month), BOOM! Your query is returning random shit, and you have no clue why.

Considering that we here at ISoft pay the bills by writing queries that DON'T return random shit, this is something you should probably care about.

The short version

If your query joins on more than one table with a one-to-many relationship with the row you're grouping by, you CAN NOT use any aggregate functions in your query.

If you do, then any time that more than one of those JOINs matches more than one row, all of your aggregate functions will break.

Wait, why?

You know about JOINs, right? Let's say you have two tables:

Monkey
id (primary key) value weight
1 Bobo 220
2 Bimbo 200
3 The Hooker 375
Poop
id (primary key) monkeyid date size
1 2 2010-01-01 13:43:04 4.3
2 3 2010-01-02 06:12:44 3.8
3 1 2010-01-02 09:14:56 4.0
4 2 2010-01-02 15:05:33 2.6

As you can infer, there is a one-to-many relationship between "monkey" and "poop". That is to say, one monkey can have any number of poops.

I'm sure you understand what happens when they are joined together:

SELECT monkey.value AS monkey_name, poop.size AS poop_size
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id
ORDER BY monkey.id

You get this!

Query results
monkey_name poop_size
Bobo 4.0
Bimbo 4.3
Bimbo 2.6
The Hooker 3.8

But you already knew that. I mean, that's kid stuff! I didn't even write it out, I actually got one of my younger siblings to type it out because I was so bored by it.

You see that Bimbo shows up twice - the one row in the monkeys table got multiplied by all the matching rows in the poop table that it got joined to. You've seen it before, and if you wanted to use this data in a report that displayed one record per monkey, you would know what to do - use a GROUP BY!

A completely valid query

SELECT monkey.value AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) total_poop, COUNT(poop.id) AS number_of_poops 
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id 
GROUP BY monkey.id
Query results
monkey_name average_poop_size total_poop number_of_poops
Bobo 4.00 4.0 1
Bimbo 3.45 6.9 2
The Hooker 3.80 3.8 1

Boom, bitches. Fire up Crystal and turn that sucker into a report, amirite?

BUT WAIT!

Thanks to my narrative-writing powers, you remember another table that contains data relevant to this report!

Meal
id description weight date