Performance

First of all, let’s compare the performance of row enumeration queries using a self join and variables:

1) The classic method with a self join:

SELECT COUNT(*)N, T1.*
FROM TestTable T1
    JOIN TestTable T2 ON T1.order_id >= T2.order_id
GROUP BY T1.order_id;

For 10000 records in the TestTable, this yields the following results:

Duration / Fetch

16.084 sec / 0.016 sec

2) Using variables:

SELECT @N:=@N+1 N, T1.*
FROM TestTable T1, (SELECT @N := 0)M
ORDER BY T1.order_id;

yields:

Duration / Fetch

0.016 sec / 0.015 sec

The result speaks for itself. However, it should be understood that value calculation using variables is suboptimal for filtering. Sorting and calculation will be done for ALL rows, even if we need just a small portion of them in the end.

Let’s examine this more closely using the following problem as an example:

Get the first 2 rows from TestTable for each group_id, with records in a group being sorted by order_id.

In a DBMS supporting analytic functions, the solution would look like this:

SELECT group_id, order_id, value
FROM(
    SELECT *, ROW_NUMBER()OVER(PARTITION BY group_id ORDER BY order_id) RowNum
    FROM TestTable
    )T
WHERE RowNum <= 2;

Since the DBMS “knows” about ROW_NUMBER and how it works, the query optimizer doesn’t need to enumerate ALL rows to get the first two. Thus, everything works fast (if indexes exist for group_id, order_id, of course).

For MySQL, the solution using a similar algorithm will look as follows:

SELECT group_id, order_id, value
FROM(
    SELECT T.*,
        IF(@last_group_id = group_id, @I:=@I+1, @I:=1) RowNum,
        @last_group_id := group_id
    FROM TestTable T,
        (SELECT @last_group_id:=NULL, @I:=0)I
    ORDER BY group_id, order_id
    )T
WHERE RowNum <= 2;

However, the MySQL optimizer doesn’t know anything about the rules we use to calculate the RowNum field. Thus, it will have to enumerate ALL rows, and only then pick out the ones needed.

Now suppose we have one million records and 20 unique values for group_id. That is, to get 40 rows, MySQL will calculate the value of RowNum a million times!There is no neat single-query solution for this problem in MySQL. Yet it’s possible to get the list of unique group_id values – e. g., as shown below:

SELECT DISTINCT group_id FROM TestTable;

and then, using any other programming language, generate a query like this:

SELECT * FROM TestTable WHERE group_id=1 ORDER BY order_id LIMIT 2
UNION ALL
SELECT * FROM TestTable WHERE group_id=2 ORDER BY order_id LIMIT 2
UNION ALL
…
SELECT * FROM TestTable WHERE group_id=20 ORDER BY order_id LIMIT 2;

20 lightweight queries will be executed much faster than the calculation of RowNum for a million rows.

#MySQL #Performance

Equivalents for analytic functions