Question
Andre Larsen Barbosa · May 3, 2021

Caché native random command

Hello. I have a query in the database that returns several rows, I need to get an "X" number of results randomly. I tried traditional commands from other languages but was unsuccessful. Does Caché have something like that?  

Product version: Ensemble 2018.1
10
2 0 8 127
Log in or sign up to continue

Replies

https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY...

You can just do "SELECT TOP X FROM ...." where X is the number of rows you're looking for. As mentioned in the above link

The TOP clause limits the number of rows returned to the number specified in int. If no ORDER BY clause is specified in the query, which records are returned as the “top” rows is unpredictable.

Kevin, the idea is good, I imagined a command like that, however, for some reason the return is always the same.  

While I don't know the specifics in this case, the issue is that unpredictable isn't quite the same as random and being unpredictable is only one desired property for a random function (for starters, you might also want some sort of even distribution over the range). For example, a function that "randomly" returns either 1, 24 or 73 is unpredictable but isn't the same as a function that returns random numbers over the entire range 1-100.

I'm posting this as a potential solution. I imagine there are smarter ways of doing this. Hoping someone will jump in and show me the way 😁

I've written a classmethod that accepts 3 arguments: the lower and upper limits of the IDs to select, along with the number of records you'll want to select:

Class Sample.Rand
{
ClassMethod RandList(pMin As %Integer, pMax As %Integer, pLen As %Integer = 10) As %String [ SqlName = List, SqlProc ]
{
        Set tList = ""
        Set tCnt = 1
        For i=1:1:pMax
        {
            Set tNum = $R(pMax + 1)
            If tNum >= pMin
            {
                Set $LIST(tList,tCnt) = tNum
                Set tCnt = tCnt + 1
            }
            Quit:(tCnt > pLen)
        }
    Return tList
}
}

The method can be called as a custom SQL function in a subquery:

SELECT * FROM MyTable WHERE %Id %INLIST (SELECT Sample.List(MIN(%Id),MAX(%Id),10) FROM MyTable)

The query above will select up to 10 rows randomly from table MyTable.

Caveats: This could take a long time to run if your range from minimum to maximum %Id is large. And it's not guaranteed to return the number of rows specified as the 3rd argument (there may be deleted records, or insufficient random values generated before the upper limit for %Ids is reached). Finally, it assumes that %Id is numeric.

EDIT: And as pointed out by @Julius Kavay, @Robert Cemper has a better idea (he usually does 😉)

Some time ago I was working on the issue of a fast data generation for an arbitrary data model.

Data model was connected (object properties) and so to generation was going in stages from low-cardinality independent tables (i.e. product) to high-cardinality dependent tables (i.e. orders). Furthermore tables did not have continuous IDs, there were gaps (due to sharding in my case but even if there's no hard delete we might want to exclude soft deleted rows).

Here's the approach I used.

1. For a given class build a local with all ranges of IDs:

/// d ##class().Ranges()
ClassMethod Ranges(class As %Dictionary.Classname, Output ranges)
{
    kill ranges    

    set table = ##class(%CSP.UI.Portal.SQL.Home).Quoter2(##class(%DeepSee.Utils).%GetSQLTableName(class))
    
    set rs = ##class(%SQL.Statement).%ExecDirect(,"SELECT ID FROM " _ table _ " ORDER BY ID ASC")

    do rs.%Next()
    set start = rs.ID
    set end = rs.ID
    while rs.%Next() {
        if rs.ID - 1 = end {
            d $i(end)
        } else {
            set ranges($i(ranges)) = $lb(start, end-start)
            set start = rs.ID
            set end = rs.ID
        }
    }
    
    set ranges($i(ranges)) = $lb(start, end-start)
}

This method accepts class name and returns this structure:

ranges(num) = $lb(startID, length)

2. Call GetRandomInRanges method to get random ID from ranges local:

/// d ##class().GetRandomInRanges()
ClassMethod GetRandomInRanges(ByRef ranges)
{
    set sum = 0
    
    for i=1:1:ranges {
        set sum = sum + $lg(ranges(i),2)
    }
    
    set threshold = $random(sum) + 1
    
    set sum = 0
    for chunk=1:1:ranges {
        set sum = sum + $lg(ranges(chunk),2)
        quit:sum>=threshold
    }    
    
    set val = ranges(chunk)
    if $listvalid(val) {
        if $ll(val)=2 {
            set val = $lg(val, 1) + $random($lg(val, 2))
        } else {
            set:$listvalid(val) val = $lg(val, ($random($ll(val))+1))
        }
    }
    
    quit val
}

This method guarantees that returned IDs would conform to the normal distribution.

Ranges should be called once and GetRandomInRanges  can be called as often as needed.

The simplest solution was already answered by Robert Cemper in https://community.intersystems.com/post/how-select-random-row-table. I just want to show a more "universal variant" of that solution.

First, create an SQL stored procedure

class SP.Utilis Extends %RegisteredObject
{
ClassMethod Random(number As %Integer, dummy As %String) As %Integer [SqlProc]
{
   quit $random(number) // we do not use dummy but we need it!!
}
}

then make your query as follows:

select top 10 * from whatever.table where SP.Utils_Random(100,ID)<50

This has following advantages:

1) works with all classes, i.e. the ID column has not to be an integer (greater 0), can be a compound column too (like part1||part2, etc)

2) by adjusting the comparison:

Random(1000,ID) < 50   // gives you more "greater" distances then

Random(1000,ID) <500  // between the returned rows

For testing of edge conditions you can use

Random(1000,ID)<0    // no rows will be returned or

Random(1000,ID)<1000 // all rows will be returnd

With the right side of the comparison you can fine tune the distances between the returned rows.

For the dummy argument in the above function you can use an arbitrary column name, the simplest is to use ID because the ID column always exists, it's purpose is to force the SQL-Compiler to call this function for each row (thinking, the result of the Random() function is row-dependet). A comparsion like Random(100)<50 is executed just once. Roberts solution works too because he uses Random(100)<ID but this works only for tables where ID is a Integer (>0). You can verify this by just issuing a simple query

select top 10 * fom your.table where SP.Utils_Random(100)<50

You will see (by repeatedly executing the above query) either 10 (subsequente) rows or nothing