Difference between revisions of "Writing MySQL queries"

Latest revision as of 12:29, 25 January 2016

Formatting

When writing queries for production (for use in code, or in reports) there are some steps you can take to make your queries easier to read.

Case

Query text that references MySQL elements (basic query syntax, functions, etc) should be in all uppercase letters, while references to elements of a database structure should be in lowercase. For example, instead of

select PANTS from tableofstuff Where date(teh_date) = Now()

you would write

SELECT pants FROM tableofstuff WHERE DATE(teh_date) = NOW()

Backticks

When writing production queries, all references to database names, table names, column names, index names, etc should be enclosed in backticks.

For example, instead of

SELECT shit FROM creek

you would write

SELECT `shit` FROM `creek`

Newlines

Add a newline before every major part of the query (FROM, JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT). i.e., instead of

SELECT `butt_size`, `ugliness` FROM `mothers` WHERE `fat` = 'True' ORDER BY `ugliness`

you would use

SELECT `butt_size`, `ugliness` 
FROM `mothers` 
WHERE `fat` = 'True' 
ORDER BY `ugliness`

Referencing table names

MySQL allows you to reference column names without specifying the table name, as long as the column name is not ambiguous in your query. In other words, this:

SELECT `hat` FROM `thing1` JOIN `thing2` ON `something` = `somethingelse`

is a valid query as long as `hat` is a column in either `thing1` or `thing2`, and not both. The same goes for the columns `something` and `somethingelse` - they can only be in one of the tables involved in the query.

However, the way you SHOULD write the query is like so:

SELECT `thing1`.`hat` FROM `thing1` JOIN `thing2` ON `thing2`.`something` = `thing1`.`somethingelse`

In large queries, it is MUCH easier to debug/comprehend queries that tell you the table names of all columns. Also, if a `hat` column is added to the `thing2` table at some point in the future, the former query will produce an error, while the latter will continue working.

Aliases

It is possible to give your tables aliases using the AS command, so that you can refer to them with an alternate name in a query.

If you are joining a table to itself, it is necessary to give the table an alias at least once.

However, in any other case, it is generally not a good idea in production code - when debugging large queries written by someone else, it's generally a lot easier to figure out where the 20 columns in a SELECT are coming from if the correct table name is used, so you don't have to figure out what "pl.name" is, or where "is.amount" might be in the database.

Things you just shouldn't do

These are not suggestions. If I catch you doing these things in reports or in code, I *will* find you, and inflict serious physical pain upon you.

I am vigilant, and quick to anger when exposed to shoddy query-writing. It's not worth the risk.

SELECT *

Sure, it's fine for quick queries when you want to see what's sitting in the database. But when you're writing queries for a product, just don't do it.

You should read this explanation of why it's a bad idea - or you could just take my word for it. Whatever it takes to keep you from committing a query that will return an undefined number of rows.

Reference a comma-separated list of tables

Much to my consternation, most introductions to query-writing seem to start off by showing you an example query like this:

SELECT `table1`.`something`, `table2`.`something_else`
FROM `table1`, `table2`
WHERE `table1`.`key` = `table2`.`key`

Which is perfectly valid SQL, to be sure - but it makes for poor queries, for 2 reasons:

It's often inefficient
It takes more work to understand what's happening in the query

This is the sort of query you should be writing:

SELECT `table1`.`something`, `table2`.`something_else`
FROM `table1`
JOIN `table2` ON `table1`.`key` = `table2`.`key`

It's more efficient

As rows from table2 are joined to table1, rows that do not match the requirements of the ON clause are dropped - in a query that references several tables, this is important.

If all the relationships are defined in the WHERE clause, what happens is that MySQL builds a data set that contains the cartesian product of all the tables and then narrows them down afterward using the WHERE clause.

MySQL tries its best to do this intelligently, but why expect it to read your mind? Write your JOINs so that after every single JOIN, the dataset will be as small as possible.

It's easier to read

Generally speaking, the tables that you join together will have an established relationship. When you are reading a query, it should be easy to see how the different parts of it relate to each other.

As long as every JOIN has an ON clause that defines its relationship to the previous tables, this should be the case.

Common traps

Over the history of ITrack, there have been hundreds of bugs fixed in reports, as well as the software itself, that were the result of a logical error in a query.

By using what you know about the database structure, you should be able to write queries that you can guarantee will always return the correct number of results. Here are some tips:

Stupid table joins

Many of the queries you write will involve 2 tables that have a 1-to-1 or 1-to-many relationship with each other.

This means that you have one table that holds a bunch of records, where each entry links to exactly 1 row in another table - for example, every record in an "inventory" table will have a relationship to a single "inventory type" record in another table.

You may know that inventory records have an inventory type, and you will probably have to write a query that joins them together at least a few times.

When that time comes, you may think to yourself "but how do I know which fields to use to join them together? What will I be joining on?"

Your first step should always be LOOK AT THE PRIMARY KEY. If you're joining the inventorytype table onto the inventory table, look at the primary key of the inventorytype table. That way, you will know what fields are the bare minimum that must be referenced in your query.

Confusing MySQL by treating numbers as strings

The short version: if you are dealing with numbers in MySQL, never enclose them in quotes. The value '25' is different from the value 25.

The longer version (which isn't really that long, and you should totally read it right now) is well-explained by the smart Mr. Noach.

Getting bad data when grouping and summarizing information

This is such an important topic that I have written an entire doc on it.

Logical errors won't stop your query from running, but they will result in bad data. If you're writing queries, you need to familiarize yourself with The GROUPing pitfall.

Seriously. Logical errors in your queries will fuck you over. Pay attention.

@@ Line 3: / Line 3: @@
 === Case ===
-Query text that references MySQL elements (basic query syntax, functions, etc) should be in all uppercase letters, while references to elements of a database structure should be in lowercase.  For example, instead of {{MySQL|query=select PANTS from tableofstuff Where date(teh_date) {{=}} Now()}} you would write {{MySQL|query=SELECT pants FROM tableofstuff WHERE DATE(teh_date) {{=}} NOW()}}
+Query text that references MySQL elements (basic query syntax, functions, etc) should be in all uppercase letters, while references to elements of a database structure should be in lowercase.  For example, instead of
+<source lang="MySQL">
+select PANTS from tableofstuff Where date(teh_date) = Now()
+</source>
+you would write
+<source lang="MySQL">
+SELECT pants FROM tableofstuff WHERE DATE(teh_date) = NOW()
+</source>
 === Backticks ===
 When writing production queries, all references to database names, table names, column names, index names, etc should be enclosed in backticks.
-For example, instead of {{MySQL|query=SELECT shit FROM creek}} you would write {{MySQL|query=SELECT `shit` FROM `creek`}}
+For example, instead of
+<source lang="MySQL">
+SELECT shit FROM creek
+</source>
+you would write
+<source lang="MySQL">
+SELECT `shit` FROM `creek`
+</source>
 === Newlines ===
-Add a newline before every major part of the query (FROM, JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT).  i.e., instead of {{MySQL|query=SELECT `butt_size`, `ugliness` FROM `mothers` WHERE `fat` {{=}} 'True' ORDER BY `ugliness`}} you would use {{MySQL|query=SELECT `butt_size`, `ugliness` <br/>
+Add a newline before every major part of the query (FROM, JOINs, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT).  i.e., instead of
-FROM `mothers` <br/>
+<source lang="MySQL">
-WHERE `fat` {{=}} 'True' <br/>
+SELECT `butt_size`, `ugliness` FROM `mothers` WHERE `fat` = 'True' ORDER BY `ugliness`
-ORDER BY `ugliness`}}
+</source>
+you would use
+<source lang="MySQL">
+SELECT `butt_size`, `ugliness`
+FROM `mothers`
+WHERE `fat` = 'True'
+ORDER BY `ugliness`
+</source>
 === Referencing table names ===
-MySQL allows you to reference column names without specifying the table name, as long as the column name is not ambiguous in your query.  In other words, this: {{MySQL|query=SELECT `hat` <br/>FROM `thing1` <br/>JOIN `thing2` ON `something` = `somethingelse`}} is a valid query as long as `hat` is a column in either `thing1` or `thing2`, and not both.  The same goes for the columns `something` and `somethingelse` - they can only be in one of the tables involved in the query.
+MySQL allows you to reference column names without specifying the table name, as long as the column name is not ambiguous in your query.  In other words, this:
+<source lang="MySQL">
+SELECT `hat` FROM `thing1` JOIN `thing2` ON `something` = `somethingelse`
+</source>
+is a valid query as long as `hat` is a column in either `thing1` or `thing2`, and not both.  The same goes for the columns `something` and `somethingelse` - they can only be in one of the tables involved in the query.
-However, the way you SHOULD write the query is like so: {{MySQL|query=SELECT `thing1`.`hat` <br/>FROM `thing1` <br/>JOIN `thing2` ON `thing2.`something` = `thing1`.`somethingelse`}}
+However, the way you SHOULD write the query is like so:
+<source lang="MySQL">
+SELECT `thing1`.`hat` FROM `thing1` JOIN `thing2` ON `thing2`.`something` = `thing1`.`somethingelse`
+</source>
 In large queries, it is MUCH easier to debug/comprehend queries that tell you the table names of all columns.  Also, if a `hat` column is added to the `thing2` table at some point in the future, the former query will produce an error, while the latter will continue working.
@@ Line 30: / Line 69: @@
 However, in any other case, it is generally not a good idea in production code - when debugging large queries written by someone else, it's generally a lot easier to figure out where the 20 columns in a SELECT are coming from if the correct table name is used, so you don't have to figure out what "pl.name" is, or where "is.amount" might be in the database.
+== Things you just shouldn't do ==
+These are not suggestions.  If I catch you doing these things in reports or in code, I *will* find you, and inflict serious physical pain upon you.
+I am vigilant, and quick to anger when exposed to shoddy query-writing.  It's not worth the risk.
+=== SELECT * ===
+Sure, it's fine for quick queries when you want to see what's sitting in the database.  But when you're writing queries for a product, just don't do it.
+You should read this [http://parseerror.com/sql/select*isevil.html explanation] of why it's a bad idea - or you could just take my word for it.  Whatever it takes to keep you from committing a query that will return an undefined number of rows.
+=== Reference a comma-separated list of tables ===
+Much to my consternation, most introductions to query-writing seem to start off by showing you an example query like this:
+<source lang="MySQL">
+SELECT `table1`.`something`, `table2`.`something_else`
+FROM `table1`, `table2`
+WHERE `table1`.`key` = `table2`.`key`
+</source>
+Which is perfectly valid SQL, to be sure - but it makes for poor queries, for 2 reasons:
+* It's often inefficient
+* It takes more work to understand what's happening in the query
+This is the sort of query you should be writing:
+<source lang="MySQL">
+SELECT `table1`.`something`, `table2`.`something_else`
+FROM `table1`
+JOIN `table2` ON `table1`.`key` = `table2`.`key`
+</source>
+==== It's more efficient ====
+As rows from table2 are joined to table1, rows that do not match the requirements of the ON clause are dropped - in a query that references several tables, this is important.
+If all the relationships are defined in the WHERE clause, what happens is that MySQL builds a data set that contains the [http://en.wikipedia.org/wiki/Cartesian_product cartesian product] of all the tables and then narrows them down afterward using the WHERE clause.
+MySQL tries its best to do this intelligently, but why expect it to read your mind?  Write your JOINs so that after every single JOIN, the dataset will be as small as possible.
+==== It's easier to read ====
+Generally speaking, the tables that you join together will have an established relationship.  When you are reading a query, it should be easy to see how the different parts of it relate to each other.
+As long as every JOIN has an ON clause that defines its relationship to the previous tables, this should be the case.
 == Common traps ==
@@ Line 36: / Line 115: @@
 By using what you know about the database structure, you should be able to write queries that you can guarantee will always return the correct number of results.  Here are some tips:
-=== Table joins ===
+=== Stupid table joins ===
 Many of the queries you write will involve 2 tables that have a 1-to-1 or 1-to-many relationship with each other.
@@ Line 47: / Line 126: @@
 Your first step should always be '''LOOK AT THE PRIMARY KEY'''.  If you're joining the inventorytype table onto the inventory table, ''look at the primary key'' of the inventorytype table.  That way, you will know what fields are the bare minimum that must be referenced in your query.
-=== Grouping and summarizing information ===
+=== Confusing MySQL by treating numbers as strings ===
-Many queries you write will use GROUP BYs and aggregate functions (like SUM(), AVG(), and COUNT()) to gather information.
+The short version: if you are dealing with numbers in MySQL, never enclose them in quotes.  The value '25' is different from the value 25.
-If you disobey the following rule, your results WILL BE WRONG.  Maybe not on the first dataset you test it with, but it will be inevitable.
-It's a very important rule.  Don't fucking forget it.
-'''Never use more than 1 many-to-many JOINs in the same query that uses aggregate functions'''
+The longer version (which isn't really that long, and you should totally read it right now) is [http://code.openark.org/blog/mysql/beware-of-implicit-casting well-explained by the smart Mr. Noach].
-This query has a logical bug:
+=== Getting bad data when grouping and summarizing information ===
+This is such an important topic that I have written an entire doc on it.
-{{MySQL|query=SELECT `inventory`.`tagnumber`, SUM(`inventoryserial`.`status` = 'Available') AS `number_of_available_serialized_parts`, SUM(`salesorderline`.`quantity`) AS `number_of_parts_on_sales_orders` <br/>FROM `inventory` <br/>JOIN `inventoryserial` ON `inventory`.`inventoryid` = `inventoryserial`.`inventoryid` # one inventory record can have multiple inventoryserial records <br/>JOIN `salesorderline` ON `salesorderline`.`inventoryid` = `inventory`.`inventoryid` AND `salesorderline`.`inventorystoreid` = inventory`.`storeid` # one inventory record can be on many sales orders <br/>GROUP BY `inventory`.`storeid`, `inventory`.`inventoryid` # the primary key of the inventory table}}
+Logical errors won't stop your query from running, but they will result in bad data.  If you're writing queries, you need to familiarize yourself with [[The GROUPing pitfall]].
-If any inventory record matches with more than one serialized record, or with more than one sales order lineitem, the SUMs will return unexpected values.
+Seriously.  Logical errors in your queries will fuck you over.  Pay attention.