First of all, let’s compare the performance of row enumeration queries using a self join and variables:
1) The classic method with a self join:
Duration / Fetch
16.084 sec / 0.016 sec
2) Using variables:
Duration / Fetch
0.016 sec / 0.015 sec
The result speaks for itself. However, it should be understood that value calculation using variables is suboptimal for filtering. Sorting and calculation will be done for ALL rows, even if we need just a small portion of them in the end.
Let’s examine this more closely using the following problem as an example:
Get the first 2 rows from TestTable for each group_id, with records in a group being sorted by order_id.
In a DBMS supporting analytic functions, the solution would look like this:
Since the DBMS “knows” about ROW_NUMBER and how it works, the query optimizer doesn’t need to enumerate ALL rows to get the first two. Thus, everything works fast (if indexes exist for group_id, order_id, of course).
For MySQL, the solution using a similar algorithm will look as follows:
However, the MySQL optimizer doesn’t know anything about the rules we use to calculate the RowNum field. Thus, it will have to enumerate ALL rows, and only then pick out the ones needed.
Now suppose we have one million records and 20 unique values for group_id. That is, to get 40 rows, MySQL will calculate the value of RowNum a million times!There is no neat single-query solution for this problem in MySQL. Yet it’s possible to get the list of unique group_id values – e. g., as shown below:
and then, using any other programming language, generate a query like this:
20 lightweight queries will be executed much faster than the calculation of RowNum for a million rows.