Written by

Tech Lead at BPlus
MOD
Question Henry Pereira · May 20, 2020

How to select a random row from a table?

Hi community,

I need to write an SQL query  to fetch a random record from a table, that table has millions of data.

In postgresql, for example, there is a RANDOM() function to do something like that:

SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1

Is it possible to do something like that in Caché?

Thanks in advance

Comments

Jon Willeke · May 20, 2020

One thing I've done to split machine learning datasets is to use an auxiliary table that maps IDs to a random number. I write a stored procedure that returns a random number. I create a table with two columns, one for the ID of the source table, and one to hold a random number. I populate the column for the source IDs:

insert into random_table (source_id)
select id from source_table

I then populate the column for the random number:

update random_table
set random_number = MySchema.MySP_Random(1E9)

Then I can select with an ORDER BY clause on the random number:

select top 10 source_id
from random_table
order by random_number, source_id

It depends on your use case whether this will be appropriate for a source table with millions of rows. It's an expensive way to select just one row.

0
Eduard Lebedyuk  May 20, 2020 to Jon Willeke

Jonathan, where do you perform the training/fit part of the AI/ML pipeline?

0
Jon Willeke  May 20, 2020 to Eduard Lebedyuk

This is for a simple train-test split as part of testing for IntegratedML, currently in beta.

0
Robert Cemper · May 20, 2020

As there is no default RANDOM() in Caché/IRIS SQL you have to write it yourself.
Create this class:

Class User.Henry
{ClassMethod Random(maxval As %Integer = 2) As %Integer [ SqlName = RANDOM, SqlProc ]
{
 quit $random(maxval)
}}

Assuming your table has defautl ID (%Integer, MINVAL=1)

SELECT TOP 1  column FROM table
HAVING ID >= RANDOM(MAX(ID))

>= ensures that you just get an existing ID and don't drop on a deleted one

Of course, if you know in advance the highest existing ID   (e.g. maxid)  then RANDOM(maxid)  will do it.

0
Michael Davidovich · May 21, 2020

I'm curious to know what you're doing that you are looking for a random record in such a large dataset?  Sincerely interested in what you're doing so I can learn and not challenging the validity of your task!

0