The GROUPing pitfall

From ISoft Wiki
Jump to navigationJump to search

The GROUPing pitfall is a very common logical error in SQL queries, especially among people getting familiar with a relational database.

This is what will happen - you'll be writing a query with a couple JOINs in it. You add a GROUP BY for the correct columns, and everything looks peachy. Then, at some point in the future (could be 5 minutes, could be a month), BOOM! Your query is returning random shit, and you have no clue why.

Considering that we here at ISoft pay the bills by writing queries that DON'T return random shit, this is something you should probably care about.

The short version

If your query joins on more than one table with a one-to-many relationship with the row you're grouping by, you CAN NOT use any aggregate functions in your query.

If you do, then any time that more than one of those JOINs matches more than one row, all of your aggregate functions will break.

The long version

You know how JOINs work, right? Let's say you have two tables:

Monkey
id (primary key) name weight
1 Bobo 220
2 Bimbo 200
3 The Hooker 375
Poop
id (primary key) monkeyid date size
1 2 2010-01-01 13:43:04 4.3
2 3 2010-01-02 06:12:44 3.8
3 1 2010-01-02 09:14:56 4.0
4 2 2010-01-02 15:05:33 2.6

As you can infer, there is a one-to-many relationship between "monkey" and "poop". That is to say, one monkey can have any number of poops.

I'm sure you understand what happens when they are joined together:

SELECT monkey.name AS monkey_name, poop.size AS poop_size
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id
ORDER BY monkey.id

You get this!

Query results
monkey_name poop_size
Bobo 4.0
Bimbo 4.3
Bimbo 2.6
The Hooker 3.8

But you already knew that. I mean, that's kid stuff! I didn't even write it out, I actually got one of my younger siblings to type it out because I was so bored by it.

You see that Bimbo shows up twice - the one row in the monkeys table got multiplied by all the matching rows in the poop table that it got joined to. You've seen it before, and if you wanted to use this data in a report that displayed one record per monkey, you would know what to do - use a GROUP BY!

A completely valid query

SELECT monkey.name AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) AS total_poop, 
 COUNT(DISTINCT poop.id) AS number_of_poops 
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id 
GROUP BY monkey.id
Query results
monkey_name average_poop_size total_poop number_of_poops
Bobo 4.00 4.0 1
Bimbo 3.45 6.9 2
The Hooker 3.80 3.8 1

Boom, bitches. Fire up Crystal and turn that sucker into a report, amirite?

BUT WAIT!

Thanks to my narrative-writing powers, you remember another table that contains data relevant to this report!

Meal
id monkeyid description weight date
1 2 A delicious spaghetti dinner with a side of garlic bread. 1.5 2010-01-01 19:24:00
2 3 Shoots 4.7 2010-01-01 08:45:55
3 1 Leaves 3.4 2010-01-01 13:17:00
4 3 Monkey food 5.1 2010-01-01 17:43:19

Dude, the eating habits of these monkeys are totally the sort of thing you would want to use in your report on pooping frequency. Let's join that shit into your query!

SELECT monkey.name AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) total_poop, 
 COUNT(DISTINCT poop.id) AS number_of_poops, COUNT(DISTINCT meal.id) AS number_of_meals
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id 
JOIN meal ON meal.monkeyid = monkey.id # This other table has useful data, too!
GROUP BY monkey.id
Query results
monkey_name average_poop_size total_poop number_of_poops number_of_meals
Bobo 4.00 4.0 1 1
Bimbo 3.45 6.9 2 1
The Hooker 3.80 7.6 1 2

Well aren't you a query-writing beast! Time to put this query into production code, eh?

WRONG!

If you were observant, you would have noticed something is amiss with those results - the total_poop of The Hooker is different than it was in the last query! What the junk?

Debugging time - let's run the query without GROUP BY (or the aggregate functions) and see what data we're working with:

SELECT monkey.name AS monkey_name, poop.size AS poop_size, poop.id AS poop_id, meal.id AS meal_id
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id 
JOIN meal ON meal.monkeyid = monkey.id
Query results
monkey_name monkey_id poop_id meal_id
Bimbo 2 1 1
Bimbo 2 4 1
The Hooker 3 2 2
The Hooker 3 2 4
Bobo 1 3 3

Or, to try to show how the JOINs work a bit more clearly:

Query results
monkey_name monkey_id poop_id meal_id
Bimbo 2 1 1
4 1
The Hooker 3 2 2
4
Bobo 1 3 3

Duh!

Of course! As you can see by the poop_id column, poop #2 is getting counted twice! And not only that, but meal #1 is also being counted twice - if we had used any aggregate functions on columns from the meal table, they would have returned fux0red data as well!

By drawing on your basic understanding of how JOINs work, you should quickly see what causes this. Every row that is related to a monkey id is going to be multiplied by the number of matching rows from tables are joined to that monkey id.

Of course, all those multiplied rows disappear from sight as soon as you GROUP BY some column - but the first aggregate function you use will expose the dirty truth: you didn't put any thought into how these tables related!

How to NOT make the mistake

For EVERY SINGLE JOIN you make, you must go through this thought process:

  1. What is this query grouping by - what is the "primary key" of my result set?
  2. This table/subquery I'm joining on - what is its relationship to that "primary key" (one-to-one, one-to-many)?
  3. Does my (pre-GROUP) dataset already have a one-to-many relationship with my final result set?

If it does, then stop what you're doing - you're about to fuck up your results, and probably confuse yourself horribly in the process.

As someone who (I'm assuming) is going to be writing a lot of queries in the future (I don't see us moving to using flat-files as a storage engine any time soon), this is something that will need to become automatic for you.

Brain farts happen, but if you put effort into keeping track of which columns are unique on both sides of your joins, you should be able to make sense of what's going on in your results.

Protip: write that query correctly

How do you make sure that your aggregate functions return the data you want? Well, until MySQL starts reading minds (7.0?), you make sure that only one of the data sources in your query has a one-to-many relationship with your result.

Like most of life's difficulties, this was made easier with the release of MySQL 4.1!

Want to join some phat data into your query, but don't want to risk multiplying your data set by all the matching rows in another table? Join a subquery!

Subqueries can be joined (and even selected from) just like you would a table:

SELECT monkey.name AS monkey_name, poop_grouped.average_poop_size, poop_grouped.total_poop, poop_grouped.poops,
 meal_grouped.average_meal_weight, meal_grouped.total_meal_weight, meal_grouped.number_of_meals
FROM monkey
JOIN
(
	SELECT poop.monkeyid, AVG(poop.size) AS average_poop_size, SUM(poop.size) AS total_poop, COUNT(*) AS poops
	FROM poop
	GROUP BY poop.monkeyid
) AS poop_grouped ON poop_grouped.monkeyid = monkey.id 
JOIN 
(
	SELECT meal.monkeyid, AVG(meal.weight) AS average_meal_weight, SUM(meal.weight) AS total_meal_weight, 
	 COUNT(*) AS number_of_meals
	FROM meal
	GROUP BY meal.monkeyid
) AS meal_grouped ON meal_grouped.monkeyid = monkey.id
Query results
monkey_name average_poop_size total_poop poops average_meal_weight total_meal_weight number_of_meals
Bobo 4.00000 4.0 1 3.40000 3.4 1
Bimbo 3.45000 6.9 2 1.50000 1.5 1
The Hooker 3.80000 3.8 1 4.90000 9.8 2

That's right - every data set (the original table, as well as those subqueries) has a 1-to-1 relationship with our final result set. Because of this, we don't even need a GROUP BY in the main query!

This query is also valid:

SELECT monkey.name AS monkey_name, AVG(poop.size) AS average_poop_size, SUM(poop.size) AS total_poop, 
 COUNT(poop.id) AS poops, meal_grouped.average_meal_weight, meal_grouped.total_meal_weight, 
 meal_grouped.number_of_meals
FROM monkey
JOIN poop ON poop.monkeyid = monkey.id 
JOIN 
(
	SELECT meal.monkeyid, AVG(meal.weight) AS average_meal_weight, SUM(meal.weight) AS total_meal_weight, 
	 COUNT(*) AS number_of_meals
	FROM meal
	GROUP BY meal.monkeyid
) AS meal_grouped ON meal_grouped.monkeyid = monkey.id 
GROUP BY monkey.id

You just have to bear in mind that in your main query, the rows in `monkey` are being multiplied by the number of matches in `poop`, and the GROUP BY combines them safely.

Adding another JOIN with a one-to-many relationship to any of the data already in the query would result in a massive fail.

To summarize

Logical errors in your queries are easy to miss. If something looks wrong in your results, check all the places that data is coming from - look at the relationships between the data sets that contribute to your final results.

If you don't pay close attention, and don't happen to catch problems during testing, your bad query will sit in production until some customer calls up wondering why some of the numbers on their screen "look weird". And that's just embarrassing.