Advanced MySQL user variable techniques
MySQL’s user variables have interesting properties that enable the useful techniques. One property is that you can read from and assign to a user variable simultaneously, because an assignment can be an r-value (the result of the assignment is the final value of the variable). Another property, which sometimes causes confusing behavior, is un-intuitive evaluation time. This post show you how to make sure your variables get updated at the time they’re used, instead of potentially reading and updating them at different stages of query execution. This technique enables a whole new range of applications for user variables. As a bonus, it also avoids extra columns of output created by variable manipulations.
Setup
CREATE TABLE fruits (
type varchar(10) NOT NULL,
variety varchar(20) NOT NULL,
price decimal(5,2) NOT NULL default 0,
PRIMARY KEY (type,variety)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
insert into fruits(type, variety, price) values
('apple', 'gala', 2.79),
('apple', 'fuji', 0.24),
('apple', 'limbertwig', 2.87),
('orange', 'valencia', 3.59),
('orange', 'navel', 9.36),
('pear', 'bradford', 6.05),
('pear', 'bartlett', 2.14),
('cherry', 'bing', 2.55),
('cherry', 'chelan', 6.33);
Simultaneous assignment and reading
MySQL lets you assign and read a variable at the same time. This is familiar in many programming languages where an assignment can be an r-value. For example,
set @test1 := 0;
set @test2 := @test1 := 5;
select @test1, @test2;
+--------+--------+
| @test1 | @test2 |
+--------+--------+
| 5 | 5 |
+--------+--------+
The second statement sets @test1
to 5, and then sets @test2
to the result of that assignment, which is the current value of @test1
.
My previous articles have shown you how to exploit this to number rows
in a result set, among other things. For example, you can keep a
running count as MySQL processes rows, updating and returning the count
at the same time.
Side effects
Unfortunately, it got a bit messy sometimes. For example, the following batch, which restarts the numbering every time type
changes, spews an extra dummy
column into the output, because that column is where the calculations are taking place:
set @type := '', @num := 1;
select type, variety,
@num := if(@type = type, @num + 1, 1) as row_number,
@type := type as dummy
from fruits
order by type, variety;
+--------+------------+------------+--------+
| type | variety | row_number | dummy |
+--------+------------+------------+--------+
| apple | fuji | 1 | apple |
| apple | gala | 2 | apple |
| apple | limbertwig | 3 | apple |
| cherry | bing | 1 | cherry |
| cherry | chelan | 2 | cherry |
| orange | navel | 1 | orange |
| orange | valencia | 2 | orange |
| pear | bartlett | 1 | pear |
| pear | bradford | 2 | pear |
+--------+------------+------------+--------+
In previous articles I suggested wrapping that query in a subquery so you can pick which columns you want in the output. That is a bit inefficient (it creates a temporary table internally) and feels kind of kludgey.
Lazy evaluation
MySQL doesn’t evaluate expressions containing user variables until they are sent to the client, so some expressions don’t work as expected. Setting a variable in one place (such as the SELECT
list) and reading it another (such as the HAVING
clause) might give weird results, like as those I demonstrated in my
last article where every row was numbered 1 instead of getting
incremented as expected.
Here’s further clarification from the manual:
In a SELECT statement, each expression is evaluated only when sent to the client. This means that in a HAVING, GROUP BY, or ORDER BY clause, you cannot refer to an expression that involves variables that are set in the SELECT list. For example, the following statement does not work as expected:
mysql> SELECT (@aa:=id) AS a, (@aa+3) AS b FROM tbl_name HAVING b=5;
The reference to b in the HAVING clause refers to an alias for an expression in the SELECT list that uses @aa. This does not work as expected: @aa contains the value of id from the previous selected row, not from the current row.
In other words, the “alias” in the HAVING
clause is
probably a pointer to a memory location, whose content is not
determined for the current row until the current row is output to the
client — at which point it’s too late to apply any HAVING
criteria to the row.
Side effects of lazy evaluation
In my last article I showed you how to select the top N rows from
each group with user variables. To make that work right, I had to group
the query, use a HAVING
clause, and force a certain index
order for that query — because of lazy evaluation. Otherwise, I might
have been able to just use the variable in a WHERE
clause, right? Lazy evaluation is why this doesn’t work:
set @type := '', @num := 1;
select type, variety, price,
@num := if(@type = type, @num + 1, 1) as row_number,
@type := type as dummy
from fruits
where @num <= 2;
+-------+------------+-------+------------+-------+
| type | variety | price | row_number | dummy |
+-------+------------+-------+------------+-------+
| apple | gala | 2.79 | 1 | apple |
| apple | fuji | 0.24 | 2 | apple |
| apple | limbertwig | 2.87 | 3 | apple |
+-------+------------+-------+------------+-------+
That last row gets output even though it seems @num
should have the value 3, eliminating it from the results. However, you can infer from this behavior that @num
really had the value 2 at the time the WHERE
clause was evaluated, and was only incremented to 3 after the row was sent to the client.
This aspect of user variable behavior makes user variables significantly harder to understand. Sometimes the results are non-deterministic and/or hard to predict. It would be great if there were a way to update those variables in the context in which they’re declared, so they get assigned and read at the same time, instead of having to wait for rows to be sent to the client — a different step in the query execution plan.
Forcing variable evaluation with multi-staged queries
If you understand the order of the steps MySQL uses to execute a
query, you can see there are opportunities to make MySQL “finish up”
variable assignments before sending the query to the next step. In
fact, perhaps it’s a bit misleading to say assignments in the SELECT
are done when rows are sent. I think it’s more accurate to say they’re done when rows are generated for each stage in query execution.
You can see this in a subquery in the FROM
clause,
which is internally stored as an intermediate temporary table. Variable
assignments are done before or as the rows are stored in the temporary
table, so when results are read from the temporary table, there are no
funny side effects.
Let me show you the previous query slightly rewritten, and you’ll see what I mean:
set @type := '', @num := 1;
select type, variety, price, row_number
from (
select type, variety, price,
@num := if(@type = type, @num + 1, 1) as row_number,
@type := type as dummy
from fruits
) as x
where row_number <= 2;
+--------+----------+-------+------------+
| type | variety | price | row_number |
+--------+----------+-------+------------+
| apple | gala | 2.79 | 1 |
| apple | fuji | 0.24 | 2 |
| orange | valencia | 3.59 | 1 |
| orange | navel | 9.36 | 2 |
| pear | bradford | 6.05 | 1 |
| pear | bartlett | 2.14 | 2 |
| cherry | bing | 2.55 | 1 |
| cherry | chelan | 6.33 | 2 |
+--------+----------+-------+------------+
Just by introducing an intermediate step in the query, I forced the
variables to be evaluated so the results, when they get to the outer WHERE
clause, are deterministic. But as I mentioned before, this is kind of
kludgey, and depending on the data, it might not be very efficient to
create an intermediate temporary table for the results.
Are there better ways? You bet!
Try 1: Use functions to force immediate evaluation
Here’s an idea: what if certain functions evaluate their arguments
immediately? You could exploit that to create a context that has to be
evaluated first, sort of like parenthesizing an expression in an
equation. You know, a = (a + b) * (b + c)
means “do the
additions first,” which wouldn’t be the case without the parentheses —
normally multiplication comes before addition.
For this to work, you’d need a function that guarantees the expression is evaluated. For example, COALESCE()
might be a good choice as long as you put the expression first in the argument list, since COALESCE()
shortcuts and doesn’t evaluate any more arguments as soon as it find a non-NULL argument.
Theoretically, then you could write something like the following and get the desired results:
set @type := '', @num := 1;
select type, variety, price,
coalesce(@num := if(@type = type, @num + 1, 1)) as row_number
...
It doesn’t work. Why not? Because the COALESCE
itself isn’t evaluated until the rows are generated. So much for that idea.
What about a scalar subquery, then?
set @num := 0, @type := '';
select type, variety, price,
(select(@num := if(@type = type, @num + 1, 1))) as row_number,
...
Sorry, no dice. This gives exactly the same results.
This idea will not work, period. Each and every expression in the SELECT
list is evaluated as the rows are generated.
Functions are expressions, scalar subqueries are expressions… the only
things that will work are operations that result in rows being
evaluated for a final value.
Try 2: Force my will on the query
One thing I do know: subqueries in the FROM
clause are
materialized to a temp table, so this will definitely result in rows
being generated. This might do what I want, at the expense of
generating temporary tables willy-nilly:
set @type := '', @num := 1;
select type, variety, price,
(select n from (select @num := if(@type = type, @num + 1, 1) as n) as x) as row_number,
(select t from (select @type := type as t) as x) as dummy
from fruits
where @num <= 2;
That won’t work either, as it turns out. The subqueries are
correlated — they refer to columns from the outer table. That isn’t
allowed because of the intermediate step, which insulates the inner
queries from the outer. This is a limitation of correlated subqueries:
you can’t nest a subquery in the FROM
clause inside them.
This is really getting silly. It’s time to stop trying to force this to work.
Try 3: Work with me, son
What if I stop trying to get the SELECT
clause to be evaluated at the same time as the WHERE
clause? What if I work with the server’s order of operations, and do all the evaluating and updating in the WHERE
clause instead of in two places? Maybe it looks like this:
set @num := 0, @type := '';
select type, variety, price, @num
from fruits
where
2 >= @num := if(@type = type, @num + 1, 1)
and @type := type;
+--------+------------+-------+------+
| type | variety | price | @num |
+--------+------------+-------+------+
| apple | gala | 2.79 | 0 |
| apple | fuji | 0.24 | 0 |
| apple | limbertwig | 2.87 | 0 |
| orange | valencia | 3.59 | 0 |
| orange | navel | 9.36 | 0 |
| pear | bradford | 6.05 | 0 |
| pear | bartlett | 2.14 | 0 |
| cherry | bing | 2.55 | 0 |
| cherry | chelan | 6.33 | 0 |
+--------+------------+-------+------+
Hmm, that was not really what I wanted. It looks like the variable
is never getting updated at all! I’m not sure why not. Maybe if I
‘parenthesize’ the variable expression like I tried before? I’ll use
the GREATEST()
function, which I know will evaluate all its arguments instead of short-cutting like COALESCE()
:
set @num := 0, @type := '';
select type, variety, price, @num
from fruits
where
2 >= @num := greatest(0, if(@type = type, @num + 1, 1))
and @type := type;
No, that gives the same result. I feel like I’m getting close, though. What if I separate out the assignment and comparison?
set @num := 0, @type := '';
select * from fruits
where @num := if(type = @type, @num + 1, 1)
and @type := type
and @num <= 2;
Empty set (0.00 sec)
select @num, @type;
+------+-------+
| @num | @type |
+------+-------+
| 0 | 0 |
+------+-------+
That didn’t work either. How did @type
get assigned an integer? It should be a string. It turns out the :=
operator has the lowest possible operator precedence, so that WHERE
clause is actually equivalent to
where @num := (
if(type = @type, @num + 1, 1)
and (@type := (
type and @num <= 2)));
If I use parentheses right, maybe I can get it to do what I want:
select * from fruits
where (@num := if(type = @type, @num + 1, 1))
and (@type := type)
and (@num <= 2);
Empty set (0.00 sec)
select @num, @type;
+------+--------+
| @num | @type |
+------+--------+
| 9 | cherry |
+------+--------+
Now I’ve gotten the variables to be assigned, but the WHERE
clause is still eliminating all the rows. This feels so close to being right. What’s missing?
Pay dirt: do the assignment inside the function
In fact, I was very close. All I need to do is move the entire
assignment and the evaluation inside the function. It seems the
variable expressions need to be sealed away from the comparison
operator. In the example below, I’ve put everything inside the GREATEST()
function, but the expression that updates @type
has an incompatible type (string), so I convert it to a number with LENGTH()
and mask its value with LEAST()
.
set @num := 0, @type := '';
select type, variety, price, @num
from fruits
where 2 >= greatest(
@num := if(@type = type, @num + 1, 1),
least(0, length(@type := type)));
The entire GREATEST()
expression evaluates to the resulting value of @num
, which is what I want on the right-hand side of the comparison. And guess what? This works:
+--------+----------+-------+------+
| type | variety | price | @num |
+--------+----------+-------+------+
| apple | gala | 2.79 | 1 |
| apple | fuji | 0.24 | 2 |
| orange | valencia | 3.59 | 1 |
| orange | navel | 9.36 | 2 |
| pear | bradford | 6.05 | 1 |
| pear | bartlett | 2.14 | 2 |
| cherry | bing | 2.55 | 1 |
| cherry | chelan | 6.33 | 2 |
+--------+----------+-------+------+
After playing with more and more combinations, I found another way that works too:
select *, @num
from fruits
where
(@num := if(type = @type, @num + 1, 1)) is not null
and (@type := type) is not null
and (@num <= 2);
I confess, I don’t fully understand this. I figured it out through trial and error. If the user manual explains it well enough for me to have gotten there by reason, I don’t know where. Can someone make it make sense please? I don’t want to have to read the source…
What’s so great about this?
Two words: one pass. One pass through the table — no quadratic-time
algorithms, no grouping or sorting. This is highly efficient. I showed
you another technique with UNION
in my last article,
which might be more efficient in some cases. But if you have lots of
types of fruit, each of which has just a few varieties, you will be
hard-pressed to find a more efficient algorithm to output the first two
rows from each group. In fact, I doubt it can be done.
Spurious columns are gone
Putting the variable assignments inside functions not only let me put everything into the WHERE
clause, it also got rid of the extra columns in the output — without
kludges like subqueries. You can use this technique to clean up your
output whenever you’re doing row-by-row calculations.
Notice the order of rows!
As in previous articles, rows are processed and numbered in order. I
never really stated what I was trying to accomplish in the example
above. The query I showed you will just output a maximum of two
consecutive rows of the same type, in the order they’re read from the
table (actually, I guess that’s the order they pass through the WHERE
filter,
which might not be the same). If I want to do something specific, such
as get the two cheapest varieties from each type of fruit, I need to
add an explicit ORDER BY
to get the rows in order of price:
set @num := 0, @type := '';
select type, variety, price, @num
from fruits
where 2 >= greatest(
@num := if(@type = type, @num + 1, 1),
least(0, length(@type := type)))
order by type, price;