Random sampling of rows from a table in SQL Server

In due time we used random sampling of entries to create a list of questions for a test. We did it on the client with the help of basic language functions which generate pseudorandom numbers within a given range (for example, the RND function in Visual Basic).

It turned out, however, quite simple to arrange it on the server side. Both by similar tools (with the help of the RAND function in T-SQL) and on the basis of uniqueidentifier datatype which is called global unique identifier and looks like this:

xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx,

where each x is a hexadecimal digit between 0-9 or a-f. For example,
EEA8BF3A-7633-477A-B8C1-8C60DC9AD20C.

For the solution of our problem it is essential that this unique identifier can be automatically generated by employing the NEWID function. Just write down in SSMS (Management Studio)

SELECT NEWID();

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

SELECT TOP 1 NEWID() FROM PC

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

on the SQL-EX.RU website, and you will see everything. Every time the query being executed, you’ll be getting different values. Therein lies the uniqueness of the identifier.

So, the idea is to add a computed column NEWID, by which the sorting should be performed, to the query yielding the required number of entries. Since generated values are arbitrary, we’ll get a random fetch.

Assume that we need to select two random models from the Product table:

Solution 8.8.1

SELECT model
FROM (SELECT TOP 2 model, NEWID() [id]
      FROM Product
      ORDER BY [id]
     ) x;

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

Or simpler

Solution 8.8.2

SELECT TOP 2 model
FROM Product
ORDER BY NEWID();

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

Let us run this query several times:


–1–
1232
1433
–2–
1276
1260
–3–
1750
1298

You can carry on if you like.

Now let us come back to the “traditional” way. The RAND function generates pseudorandom floating point numbers between 0 and 1.

The term ‘pseudorandom’ denotes that such numbers are computed by means of a certain atithmetic algorythm. That is, with the initial conditions (input) being equal, the derived numbers will be the same. The initial conditions may be set explicitly, by supplying the function argument which can be any number of tinyint, int or smallint type, or implicitly. In the last case the argument is omitted and the initial value is selected by SQL Server.

Let us execute the following query:

SELECT TOP 1 RAND(), RAND(350)
FROM Product;

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

We will run it in MS. I got 0.0485421339242268 and 0.72009490018203537. It may be safely stated that the first number you get will be different, but the second may be the same, since in this case the initial value was specified (350).

Now, let us employ the RAND function for our problem, namely, for producing two randomly selected models:

SELECT TOP 2 model
FROM Product
ORDER BY RAND(model);

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

We obtain two models in ascending order by their numbers. That’s where pseudorandomness shows up. Let’s have a look at what kind of random numbers we’ve got:

SELECT model, RAND(model) rnd
FROM Product
ORDER BY rnd;

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

For brevity’s sake, I drop part of them:

Model	rnd
1121	0.734461
1232	0.736529
1233	0.736548
1260	0.737051
1276	0.737349
1288	0.737573
1298	0.737759
…

I don’t know which algorithm is used for random number computing, nevertheless I can state that the RAND function behaves monotonously within this range of model numbers. That is why we failed.

BOL gives an example of generating a sequence of random numbers using the system time in order to dynamically change initial value. Here it is:

SELECT RAND( (DATEPART(mm, GETDATE()) * 100000 )
            + (DATEPART(ss, GETDATE()) * 1000 )
            + DATEPART(ms, GETDATE())
           );

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

However, in this form, the query can be run only in slow packages, so that its repeated execution be less frequent than the time unit used in the query (milliseconds). The use of this function will obviously not meet the condition. But multiplying the argument by some unique identifier will score a success (this approach was suggested by V.I. Gershovich):

SELECT model, RAND(model*(DATEPART(mm, GETDATE()) * 100000 )
                        + (DATEPART(ss, GETDATE()) * 1000 )
                        + DATEPART(ms, GETDATE())
                  )
FROM Product
ORDER BY model;

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

But here is another problem, related to the argument of the RAND function being integer. If we exceed the maximum allowable value for the integer (for SQL Server it makes 2^31-1 or 2 147 483 647), we’ll get the following error message:

Arithmetic overflow error converting expression to data type int.

You can check it by running the above query on the website. The error appears for model numbers exceeding 2000. In such cases a normalizing factor is needed as well, for instance,

CASE
    WHEN model < 2000
    THEN model
    ELSE model/10+model % 10
END

Here, model % 10 (remainder from division by 10) is incorporated, lest significant digits should be lost; otherwise, for models which numbers differ by less then ten equal values might be produced.

The net solution looks like this (naturally, sorting should be performed by rnd rather than by model which we left for clearness of result).

Solution 8.8.3

SELECT model,
       RAND(CASE
                WHEN model < 2000
                THEN model
                ELSE model/10+model % 10
            END *
             (DATEPART(mm, GETDATE()) * 100000 )
             + (DATEPART(ss, GETDATE()) * 1000 )
             + DATEPART(ms, GETDATE())
            ) rnd
FROM Product
ORDER BY model;

🚫

[[ error ]]

[[ column ]]
NULL [[ value ]]

And now compare results:

model	Rnd
1121	0.401381
1232	0.487199
1233	0.983468
1260	0.382721
1276	0.323019
1288	0.278243
1298	0.240929

Conclusion. The simpler and more reliable way of solving the problem in question is to use the NEWID() function which guarantees the uniqueness of values. However, these values are not numeric. Therefore, when it is a number that is needed, one should pay attention to the RAND() function.

#RND #Random Sampling of Rows #RAND #NEWID

About implicit conversion of SQL Server data types Grouping of time intervals