Creating classes/tables with more than 999 properties in InterSystems IRIS

Primary tabs

InterSystems IRIS currently limits classes to 999 properties.

But what to do if you need to store more data per object?

This article would answer this question (with the additional cameo of Community Python Gateway and how you can transfer wide datasets into Python).

The answer is very simple actually - InterSystems IRIS currently limits classes to 999 properties, but not to 999 primitives. The property in InterSystems IRIS can be an object with 999 properties and so on - the limit can be easily disregarded.

Approach 1.

Store 100 properties per serial property. First create a stored class that stores a hundred properties.

Class Test.Serial Extends %SerialObject
{

Property col0;
...
Property col99;
}

And in your main class add as much properties as you need:

Class Test.Record Extends %Persistent
{
Property col00 As Test.Serial;

Property col01 As Test.Serial;
...

Property col63 As Test.Serial;
}

This immediately raises your limit to 99900 properties.

This approach offers uniform access for all properties via SQL and object layers (we always know property reference by it's number).

Approach 2.

One $lb property.

Class Test.Record Extends %Persistent
{
Property col As %List;
}

This approach is simpler but does not provide explicit column names.

Use SQL $LIST* Functions to access list elements.

Approach 3.

Use Collection (List Of/Array Of) property.

Class Test.Record Extends %Persistent
{
Property col As List Of %Integer;
}

This approach also does not provide explicit column names for individual values (but do you really need it?). Use property parameters to project the property as SQL column/table.

Docs for collection properties.

Approach 4.

Do not create properties at all and expose them via SQL Stored procedure/%DispatchGetProperty.

Class Test.Record Extends %Persistent
{

Parameter GLVN = {..GLVN("Test.Record")};

/// SELECT Test_Record.col(ID, 123) 
/// FROM Test.Record
///
/// w ##class(Test.Record).col(1, )
ClassMethod col(id, num) As %Decimal [ SqlProc ]
{
    #define GLVN(%class) ##Expression(##class(Test.Record).GLVN(%class))
    quit $lg($$$GLVN("Test.Record")(id), num + 1)
}

/// Refer to properties as: obj.col123 
Method %DispatchGetProperty(Property As %String) [ CodeMode = expression ]
{
..col(..%Id(), $e(Property, 4, *))
}


/// Get data global
/// w ##class(Test.Record).GLVN("Test.Record")
ClassMethod GLVN(class As %Dictionary.CacheClassname = {$classname()}) As %String
{
    return:'$$$comClassDefined(class) ""
    set strategy = $$$comClassKeyGet(class, $$$cCLASSstoragestrategy)
    return $$$defMemberKeyGet(class, $$$cCLASSstorage, strategy, $$$cSDEFdatalocation)
}

The trick here is to store everything in the main $lb and use unallocated schema storage spaces to store your data. Here's an article on global storage.

With this approach, you can also easily transfer the data into Python environment with Community Python Gateway via the ExecuteGlobal method.

This is also the fastest way to import CSV files due to the similarity of the structures.

 

Conclusion

999 property limit can be easily extended in InterSystems IRIS.

 

Do you know other approaches to storing wide datasets? If so, please share them!

 

Replies

While I always advertise CSV2CLASS methods for generic solutions, wide datasets often possess an (un)fortunate characteristic of also being long.

In that case custom object-less parser works better.

Here's how it can be implemented.

1. Align storage schema with CSV structure

2. Modify this snippet for your class/CSV file:

Parameter GLVN = {..GLVN("Test.Record")};

Parameter SEPARATOR = ";";

ClassMethod Import(file = "source.csv", killExtent As %Boolean = {$$$YES})
{
    set stream = ##class(%Stream.FileCharacter).%New()
    do stream.LinkToFile(file)
    
    kill:killExtent @..#GLVN
    
    set i=0
    set start = $zh
    while 'stream.AtEnd {
        set i = i + 1
        set line = stream.ReadLine($$$MaxStringLength)
        
        set @..#GLVN($i(@..#GLVN)) = ..ProcessLine(line)
        
        write:'(i#100000) "Processed:", i, !
    }
    set end = $zh
    
    write "Done",!
    write "Time: ", end - start, !
}

ClassMethod ProcessLine(line As %String) As %List
{
    set list = $lfs(line, ..#SEPARATOR)
    set list2 = ""
    set ptr=0
    
    // NULLs and numbers handling.
    // Add generic handlers here.
    // For example translate "N/A" value into $lb() if that's how source data rolls
    while $listnext(list, ptr, value) {
        set list2 = list2 _ $select($g(value)="":$lb(), $ISVALIDNUM(value):$lb(+value), 1:$lb(value))
    }

    // Add specific handlers here
    // For example convert date into horolog in column4

    // Add %%CLASSNAME
    set list2 = $lb() _ list2
    
    quit list2
}

I have no concrete ideas on how to automate this.

This is a more case-by-case basis.

After more than 42 years of M-programming and in total of 48 years of programming experience I would say, if you need a class with about 1000 or more properties than something is wrong with your (database) design. There is nothing more to say. Period.

Wide datasets are fairly typical for:

  • Industrial data
    • IoT
    • Sensors data
    • Mining and processing data
    • Spectrometry data
  • Analytical data
    • Most datasets after one-hot-encoding applied
    • NLP datasets
    • Any dataset where we need to raise dimensionality
    • Media featuresets
  • Social Network/modelling schemas

I'm fairly sure there's more areas but I have not encountered them myself.

Recently I have delivered a PoC with classes more than 6400 columns wide and that's where I got my inspiration for this article (I chose approach 4).

@Renato Banzai also wrote an excellent article on his project with more than 999 properties.

Overall I'd like to say that a class with more than 999 properties is a correct design in many cases.

You probably right for a majority of tasks. But how do you manage with AI tasks which NEED to manage thousands of features of entities? And features are properties/fields from data storage perspective.

Anyway, I'm really curious how do you deal with AI/ML tasks in IRIS or Caché.

That's good and well for sparse datasets (where say you have a record with 10 000 possible attributes but on average only 50 are filled).

EAV does not help in dense cases where every record actually has 10 000 attributes.

My EAV implementation is the same as your Approach 3, so it will work fine even with fully filled 4.000.000 attributes.
Since the string has a limit of 3,641,144, approaches with serial and %List are dropped.

All other things being equal, everything depends on the specific technical task: speed, support for Objects/SQL, the ability to name each attribute, the number of attributes, and so on.