Creating classes/tables with more than 999 properties in InterSystems IRIS

Article

Eduard Lebedyuk · Aug 3, 2020 3m read

#Globals #Object Data Model #Python #Relational Tables #SQL #Tips & Tricks #InterSystems IRIS

InterSystems IRIS currently limits classes to 999 properties.

But what to do if you need to store more data per object?

This article would answer this question (with the additional cameo of Community Python Gateway and how you can transfer wide datasets into Python).

The answer is very simple actually - InterSystems IRIS currently limits classes to 999 properties, but not to 999 primitives. The property in InterSystems IRIS can be an object with 999 properties and so on - the limit can be easily disregarded.

Approach 1.

Store 100 properties per serial property. First create a stored class that stores a hundred properties.

Class Test.Serial Extends %SerialObject
{

Property col0;
...
Property col99;
}

And in your main class add as much properties as you need:

Class Test.Record Extends %Persistent
{
Property col00 As Test.Serial;

Property col01 As Test.Serial;
...

Property col63 As Test.Serial;
}

This immediately raises your limit to 99900 properties.

This approach offers uniform access for all properties via SQL and object layers (we always know property reference by it's number).

Approach 2.

One $lb property.

Class Test.Record Extends %Persistent
{
Property col As %List;
}

This approach is simpler but does not provide explicit column names.

Use SQL $LIST* Functions to access list elements.

Approach 3.

Use Collection (List Of/Array Of) property.

Class Test.Record Extends %Persistent
{
Property col As List Of %Integer;
}

This approach also does not provide explicit column names for individual values (but do you really need it?). Use property parameters to project the property as SQL column/table.

Docs for collection properties.

Approach 4.

Do not create properties at all and expose them via SQL Stored procedure/%DispatchGetProperty.

Class Test.Record Extends %Persistent
{

Parameter GLVN = {..GLVN("Test.Record")};

/// SELECT Test_Record.col(ID, 123) 
/// FROM Test.Record
///
/// w ##class(Test.Record).col(1, )
ClassMethod col(id, num) As %Decimal [ SqlProc ]
{
    #define GLVN(%class) ##Expression(##class(Test.Record).GLVN(%class))
    quit $lg($$$GLVN("Test.Record")(id), num + 1)
}

/// Refer to properties as: obj.col123 
Method %DispatchGetProperty(Property As %String) [ CodeMode = expression ]
{
..col(..%Id(), $e(Property, 4, *))
}


/// Get data global
/// w ##class(Test.Record).GLVN("Test.Record")
ClassMethod GLVN(class As %Dictionary.CacheClassname = {$classname()}) As %String
{
    return:'$$$comClassDefined(class) ""
    set strategy = $$$comClassKeyGet(class, $$$cCLASSstoragestrategy)
    return $$$defMemberKeyGet(class, $$$cCLASSstorage, strategy, $$$cSDEFdatalocation)
}

The trick here is to store everything in the main $lb and use unallocated schema storage spaces to store your data. Here's an article on global storage.

With this approach, you can also easily transfer the data into Python environment with Community Python Gateway via the ExecuteGlobal method.

This is also the fastest way to import CSV files due to the similarity of the structures.

Conclusion

999 property limit can be easily extended in InterSystems IRIS.

Do you know other approaches to storing wide datasets? If so, please share them!

Parameter GLVN = {..GLVN("Test.Record")}; Parameter SEPARATOR = ";"; ClassMethod Import(file = "source.csv", killExtent As %Boolean = {$$$YES}) { set stream = ##class(%Stream.FileCharacter).%New() do stream.LinkToFile(file) kill:killExtent @..#GLVN set i=0 set start = $zh while 'stream.AtEnd { set i = i + 1 set line = stream.ReadLine($$$MaxStringLength) set @..#GLVN($i(@..#GLVN)) = ..ProcessLine(line) write:'(i#100000) "Processed:", i, ! } set end = $zh write "Done",! write "Time: ", end - start, ! } ClassMethod ProcessLine(line As %String) As %List { set list = $lfs(line, ..#SEPARATOR) set list2 = "" set ptr=0 // NULLs and numbers handling. // Add generic handlers here. // For example translate "N/A" value into $lb() if that's how source data rolls while $listnext(list, ptr, value) { set list2 = list2 _ $select($g(value)="":$lb(), $ISVALIDNUM(value):$lb(+value), 1:$lb(value)) } // Add specific handlers here // For example convert date into horolog in column4 // Add %%CLASSNAME set list2 = $lb() _ list2 quit list2 }

Sylvain Guilbaud · Jul 31, 2023

The approach 1 doesn't "raises your limit to 99900 properties" but rather to 6600 properties.

You can test it through large.utils.cls

w ##class(large.utils).init(66,100)
deleting large.serial100
creating large.serial100
compiling large.serial100

Compilation started on 07/31/2023 14:48:16
Compiling class large.serial100
Compiling routine large.serial100.1
Compilation finished successfully in 0.218s.

creating large.c66
compiling large.c66

Compilation started on 07/31/2023 14:48:16
Compiling class large.c66
Compiling table large.c66
Compiling routine large.c66.1
Compilation finished successfully in 8.356s.

1
w ##class(large.utils).init(67,100)
deleting large.serial100
creating large.serial100
compiling large.serial100

Compilation started on 07/31/2023 14:48:27
Compiling class large.serial100
Compiling routine large.serial100.1
Compilation finished successfully in 0.213s.

creating large.c67
compiling large.c67

Compilation started on 07/31/2023 14:48:27
Compiling class large.c67
Compiling table large.c67
Compiling routine large.c67.1
ERROR #5002: ObjectScript error: <MAXSTRING>CompileRtns+286^%occRoutine
ERROR #5002: ObjectScript error: <NOROUTINE>DescribePhase2+9^%occSysDescriptor *large.c67.1
Detected 2 errors during compilation in 6.896s.

0 a(<MAXSTRING>CompileRtns+286^%occRoutineÛCompileRtns+286^%occRoutine IRISAPP³e^ReturnError+2^%occSystem^1!e^CompileList+229^%occCompile^1e^CompileList+23^%apiOBJ^1e^Compile+1^%apiOBJ^1e^Compile+1^%SYSTEM.OBJ.1^1^init+50^large.utils.1^1e^^^0K0 G¦
large.c67/ IRISAPP#!e^CompileRtns+388^%occRoutine^1
IRISAPP>

0 0

Evgeny Shvarov · Aug 3, 2020

The question is how csvgen could be upgraded to consume csv files with 1000+ cols.

Eduard Lebedyuk · Aug 4, 2020

While I always advertise CSV2CLASS methods for generic solutions, wide datasets often possess an (un)fortunate characteristic of also being long.

In that case custom object-less parser works better.

Here's how it can be implemented.

1. Align storage schema with CSV structure

2. Modify this snippet for your class/CSV file:

2 0

Evgeny Shvarov · Aug 4, 2020

Thanks, Ed!
Could you make a PR?

I have no concrete ideas on how to automate this.

This is a more case-by-case basis.

Julius Kavay · Aug 4, 2020

After more than 42 years of M-programming and in total of 48 years of programming experience I would say, if you need a class with about 1000 or more properties than something is wrong with your (database) design. There is nothing more to say. Period.

5 0

Eduard Lebedyuk · Aug 5, 2020

Wide datasets are fairly typical for:

Industrial data
- IoT
- Sensors data
- Mining and processing data
- Spectrometry data
Analytical data
- Most datasets after one-hot-encoding applied
- NLP datasets
- Any dataset where we need to raise dimensionality
- Media featuresets
Social Network/modelling schemas

I'm fairly sure there's more areas but I have not encountered them myself.

Recently I have delivered a PoC with classes more than 6400 columns wide and that's where I got my inspiration for this article (I chose approach 4).

@Renato Banzai also wrote an excellent article on his project with more than 999 properties.

Overall I'd like to say that a class with more than 999 properties is a correct design in many cases.

3 0

Evgeny Shvarov · Aug 5, 2020

You probably right for a majority of tasks. But how do you manage with AI tasks which NEED to manage thousands of features of entities? And features are properties/fields from data storage perspective.

Anyway, I'm really curious how do you deal with AI/ML tasks in IRIS or Caché.

Herman Slagman · Jul 31, 2023

I fully agree with Julius.
Any class that exceeds 100 properties is dead wrong.
We're not talking tables (relational) (and even then it's wrong) but OO design, which doesn't seem to be much in fashion these days.
Redesign your solution

Vitaliy Serdtsev · Aug 5, 2020

Entity–attribute–value model is usually used for this purpose.

I have already written about this at the time: SQL index for array property elements.

That's good and well for sparse datasets (where say you have a record with 10 000 possible attributes but on average only 50 are filled).

EAV does not help in dense cases where every record actually has 10 000 attributes.

My EAV implementation is the same as your Approach 3, so it will work fine even with fully filled 4.000.000 attributes.
Since the string has a limit of 3,641,144, approaches with serial and %List are dropped.

All other things being equal, everything depends on the specific technical task: speed, support for Objects/SQL, the ability to name each attribute, the number of attributes, and so on.

Eduard Lebedyuk · Jul 31, 2023

I guess we are hitting some other limit.

I went with approach 4 myself.