• nine_k 7 days ago

    I prefer to prepare the datasets first, and not depend on the order of evaluation (which you may fail to predict correctly):

      select * from some_table
      join (select rand() as random_id) on
        some_table.the_id = random_id
    This way it is obvious that rand() will be evaluated only once.
    • mjevans 7 days ago

      I like your query as a portable version of the written function, but I also think that it could fail to yield the results probably intended before that mess was written.

      If the intent is actually to select one, and only one, random row from the table than this Stack Overflow entry has a result that is in most cases in the direction I was thinking (LIMIT 1, and have some way of selecting a random entry, remembering that the ID field might be monotonic, but isn't assured to be continuous (have no gaps))


      Unfortunately it seems that it depends on language extensions that are not portable.


      It unfortunately also appears that the easiest query scales poorly to large tables as it works by assigning EACH row a random ID and then limiting it to one of those by some order (probably lowest assigned row).

      A more effective approach might be a function or procedure based around a pre-recorded 'maximum' ID for the column (assuming it's vaguely monotonic and not a GUID), picking a random result between base and that ID, Verifying it exists, and if it doesn't trying again up to N times before using a fallback procedure. (E.G. after 100 random stabs that fail selecting ~500 records in the range of lastRND +/- 250 and using the ORDER BY rand() or equivalent on that sub-set). If even /that/ fails pick the next or previous record by which is closest.

      • nine_k 7 days ago

        I think that something like

          select * from the_table 
          offset rand(select count(*) from the_table)
          limit 1;
        would do the trick with least bias, though not necessarily very fast.
        • mjevans 7 days ago

          The offered example query is could return no results in the presence of record gaps / deleted records; also the OFFSET value should probably be computed within a sub-query, not exposed.

          A corrected version based on being >= or <= the random number (picking the nearest) COULD work (in fact, that's the way my suggest final fallback works), but would still be biased.

          That was the reason I suggested taking a limited number of stabs at random numbers within the allowed range; my example still has some slight bias (a little in the 'best case' fallback, and a lot in the absolute fallback), but it tries to deliver a truly random result first, many times.

    • wolf550e 7 days ago

      Why do people use "i.e." instead of "e.g."?

      EDIT:" For people who think I don't know what they mean, I do. I meant that the author of TFA obviously meant to write "Nondeterministic Functions in MySQL (e.g. Rand)", because the entire article is about all non-deterministic functions and just uses rand as an example.

      • lotyrin 7 days ago

        Why do people use the wrong there/their/they're?

        Why do people match salmon colored items with rose ones?

        Why do people sing off key?

        They aren't themselves sensitive to the difference and don't think about it.

        • thisacctforreal 7 days ago

          "i.e." – for "id est" in latin – means "that is", or "in other words", "in essence".

          "e.g." is "exempli gratia" in latin; just meaning "for example".

          • qubex 7 days ago

            Because they mean absolutely different things: ”e.g.” stands for “exempli gratia” (“for example”) while “i.e.” stands for “id est” (”in other words”).

            • optimuspaul 7 days ago

              well in this case they only really talk about the one function so i.e. is not incorrect.

              • paulddraper 7 days ago

                They say "Functions" though.

                • abundant 7 days ago

                  Here one would read “that is” as something like “specifically”.

                  • Zalastax 7 days ago

                    But that's not how one should read "that is". The author should have used "e.g." here, no question. Using "i.e." would imply that there exists only one nondeterministic function in MySQL, which is not the case.

                    • paulddraper 6 days ago

                      Yeah, or I'd say "namely"

              • cryptonector 7 days ago

                As to SQLite3, the author should have indicated what version they tested, and they should have tested more versions. SQLite3 3.8.3 introduced a flag for marking functions as deterministic (vs. not), and in the version I just tried (3.11.0) it works as expected.

                The rule I expect for where clause expression evaluation is: standard short-circuit semantics. That calls to deterministic functions get memoized, or that constant sub-expressions get hoisted out, is a plus -- I do expect GCSE and optimization of deterministic functions from modern RDBMSes. I also expect that functions meant to be used in queries have no side effects -- that they insert/update/delete no rows on any tables.

                • alikrubin 3 days ago

                  My name is Alex, I'm the author of the original article on Percona Blog. I wanted to thank everyone for the great comments and discussion! Additional SQL examples are great as well.

                  • jcriddle4 7 days ago

                    For SQL Server it looks like an index seek on the following so it is evaluated once.

                    SELECT * FROM SomeTable WHERE ID = CAST(Rand() * 1000 as int)

                    In addition this query: SELECT TOP(100) rand() AS RandData from SomeTable

                    Returns: 0.940284064056996 0.940284064056996 0.940284064056996 . . . .

                    A old trick to actually get random data on SQL Server that I think has been replaced with newer functionality is:

                    SELECT ABS(CHECKSUM(NEWID())) FROM SomeTable

                    However if we move the NEWID.... stuff into the WHERE clause we end up switching to an index seek so back to being evaluated once.

                    • WhyNotHugo 6 days ago

                      FWIW, the proper what to get a random row is actually:

                          SELECT column FROM table
                          ORDER BY RANDOM()
                          LIMIT 1