New post

查找

Article
· Jun 4, 2020 10m read

A Hidden Object Design Consideration “Journal Killer” [... And an even more secret Business Process Parameter]

 

In this article I'd like to share with you a phenomena that is best you avoid - something you should be aware of when designing your data model (or building your Business Processes) in Caché or in Ensemble (or older HealthShare Health Connect Ensemble-based versions).

Background

As a reminder, every (well… almost every) SET and KILL you perform of a global to the database (directly or via some interface like an object %Save() or SQL DDL) gets written to the database Journal.

[You can see this article with more background about this]

In this article I’d like to highlight a consideration that could cause you (might even already been causing you) serious journaling consumption.

I’ll give you a little background of how I got on to this topic (and this could also illustrate the results of this phenomena).

A customer was complaining of extreme journal growth.

We examined the journal usage, here’s a certain visualization of “regular days” journal consumption at this customer –

Each “dot” signifies a 1GB-sized journal file switch.

And this is what the “bad days” looked like –

You can see the “lines” of “connected dots” that signify consecutive heavy journal consumption.

Looking closer (see this article with tips of how to analyze the Journal) we learned the reason for this was the phenomena described below.

Example

The scenario in which this happens is when you have an object with an Array Collection (or some other “Collection” that “behaves” like an Array storage-wise).

Each time, you save the object, regardless of how many items in the collection you changed (or even if you changed them at), the code behind the scenes KILLs the whole Collection global subscript node, and re-SETs.

For example, assume you have the following class –

Class Demo.Data.Test Extends %Persistent
{
/// test type
Property TypeCode As %String;
/// related runs
Property RelatedRuns As array Of %String;
Storage Default
{
<Data name="RelatedRuns">
<Attribute>RelatedRuns</Attribute>
<Structure>subnode</Structure>
<Subscript>"RelatedRuns"</Subscript>
</Data>
<Data name="TestDefaultData">
<Value name="1">
<Value>%%CLASSNAME</Value>
</Value>
<Value name="2">
<Value>TypeCode</Value>
</Value>
</Data>
<DataLocation>^Demo.Data.TestD</DataLocation>
<DefaultData>TestDefaultData</DefaultData>
<IdLocation>^Demo.Data.TestD</IdLocation>
<IndexLocation>^Demo.Data.TestI</IndexLocation>
<StreamLocation>^Demo.Data.TestS</StreamLocation>
<Type>%Library.CacheStorage</Type>
}

Then in the %SaveData label of the generated routine you will find this, with regard to updating the Collection, in case of updating an existing object instance:

kill ^Demo.Data.TestD(id,"RelatedRuns") 
set zzc40v1=$Order(i%RelatedRuns(""),1,val) 
While zzc40v1'="" { 
    Set ^Demo.Data.TestD(id,"RelatedRuns",zzc40v1)=val,zzc40v1 = $Order(i%RelatedRuns(zzc40v1),1,val) 
}

You can see we KILL the whole “previous” node, and then loop over the “current” items and SET them.

Journal-wise this will lead to an entry for each time in the Collection – twice – once during this KILL and once for the SET (assuming it’s still there).

So assume a scenario where we have a loop adding items to a collection, each item each iteration, with a save in each iteration.

The first iteration would have 1 Journal entry (for the Collection, we’ll focus just on this aspect).

The 2nd iteration will have 3 entries – 1 for killing the 1st item, then another 2 for setting the 1st (again) and the 2nd.

The 3rd iteration will have already 5 entries – 2 for killing the 2 existing items, and 3 for (re)setting the previous 2, plus adding the 3rd new one.

The 4th iteration will have now 7 entries - killing 3, setting 4.

You can see that in general each iteration n has 2n-1 journal entries.

And in total after 4 iterations we had: 1 + 3 + 5 + 7 = 16 entries.

And in general, after n iterations we’d have n2 Journal entries.

Here’s what the Journal looks like after running such a loop as per above –

USER>set test=##class(Demo.Data.Test).%New() for i=1:1:4 { do test.RelatedRuns.SetAt(i,i) do test.%Save() }

 

  Address   Proc ID Op Directory        Global & Value
===============================================================================
    364040     22092 BT
    364056     22092 S  c:\intersystems+ Demo.Data.TestD = 1
    364112     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1) = 1
    364184     22092 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")
    364240     22092 CT
    364256     22092 BT                                                             
    364272     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1)
    364344     22092 KT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns")           
    364408     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1) = 1
    364480     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2) = 2
    364552     22092 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")
    364612     22092 CT
    364628     22092 BT
    364644     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1)
    364716     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2) 
    364788     22092 KT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns")           
    364852     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1) = 1     
    364924     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2) = 2
    364996     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",3) = 3
    365068     22092 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")
    365128     22092 CT
    365144     22092 BT
    365160     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1)         
    365232     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2)
    365304     22092 kT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",3)         
    365376     22092 KT c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns")
    365440     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1) = 1
    365512     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2) = 2
    365584     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",3) = 3
    365656     22092 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",4) = 4
    365728     22092 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")
    365788     22092 CT                                                       

 

Now imagine an array with 1000s or 10,000s of items… and say you have 100s or more of such objects… your Journal would get very big very fast.

Note I’m focusing here on the Journal space aspect here, but of course the writing performance impact (to the Journal file and possibly to the CACHE.DAT file) is also an issue.

The Consideration

So, the general rule of thumb for using Collections (and this has been mentioned in different contexts over the years, for various considerations, not just this Journal one) is to use it for small-sized Collections. This way even if there is this Journal-related overhead it should not be significant.

Alternatives could include (but not limited to) using references with Foreign Keys to connection/relation-tables.

Note for InterSystems IRIS

It is important to note that in InterSystems IRIS (as of version 2019.1; including of course IRIS for Health and the related HealthShare Health Connect based on these versions) this behavior has changed (for internal reference the related change is identified as MAK4939). And updating an object with a Collection will not cause all of the sub-node Collection items to be deleted and set all over again.

Referring to the example we used above, this is the generated code in the routine pertaining to the same area of code:

kill nodes merge nodes=i%RelatedRuns
set zzc40v1="" 
for { 
    set zzc40v1 = $Order(^Demo.Data.TestD(id,"RelatedRuns",zzc40v1),1,data) 
    Quit:zzc40v1=""  
    if $data(nodes(zzc40v1),val) { 
        if data=val kill nodes(zzc40v1) 
    } else { 
        kill ^Demo.Data.TestD(id,"RelatedRuns",zzc40v1) 
    } 
}
merge ^Demo.Data.TestD(id,"RelatedRuns")=nodes

You can see here we use a temporary local array, and merge it in (and kill only necessary removal of the Collection items).

The same loop above would generate in InterSystems IRIS the following Journal entries –

 

  Address   Proc ID Op Directory        Global & Value
===============================================================================
   1471332     28064 BT                                                            
   1471348     28064 S  c:\intersystems+ Demo.Data.TestD = 1                       
   1471400     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",1) = 1    
   1471468     28064 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")           
   1471524     28064 CT                                                            
   1471540     28064 BT                                                            
   1471556     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns") = ""     
   1471620     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",2) = 2    
   1471692     28064 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")           
   1471752     28064 CT                                                            
   1471768     28064 BT                                                            
   1471784     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns") = ""     
   1471848     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",3) = 3    
   1471920     28064 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")           
   1471980     28064 CT                                                            
   1471996     28064 BT                                                            
   1472012     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns") = ""
   1472076     28064 ST c:\intersystems+ Demo.Data.TestD(1,"RelatedRuns",4) = 4    
   1472148     28064 ST c:\intersystems+ Demo.Data.TestD(1) = $lb("","")           
   1472208     28064 CT

 

Here you can see the minimum number of entries (n), per each item update.

Real-life Scenarios

At this stage you might be asking yourself what is the probability of running into this scenario, when would I have such a loop (or similar).

 

I’ll give you two examples, both real-life ones.

The first relates to the customer use-case I mentioned above (with the “dots” visualization of the Journal switches).

In their solution they had an entity that was created and updated according to incoming messages (specifically HL7 and DICOM messages), and so, in order to keep track (including display in a UI) the messages pertaining to each entity, these entities had an array of Message IDs that were related to it.

Typically, this number of “Relates Messages” would be less than 10 and no more than 20 or 30. But occasionally one of the sending systems would get into a “sending spree” and send continuously many messages, for several entities, causing the Collection to grow to 10,000s.

Here's a small snippet of what the Jounral looked like:

Business Process Use-case

The second real-life use-case is a much more common one, one that probably anyone who uses Ensemble Business Processes runs into (knowingly or unknowingly). And will bring us to the sub title of this article – “An even more secret Business Process Parameter”.

It might be little unknown that a Business Process is a Persistent entity. I assume many are aware of this fact (as it has an ID for example which is sometimes references, and there is a page in the Management Portal: View -> “Business Process Log/Instances” that shows the records).

But I assume much less are aware that this entity holds two Collections (with a sub-node storage):

%MessagesRecieved & %MessagesSent

As their names hint these Collections hold all the messages a Business Process has sent or received during its life span. They are intended to make it easier for developers to perform searches on what has been already sent or received for a given BP in a performant manner, without having to traverse all BPs.

But as per above this is prone to the issue described above.

Imagine a BP that makes a call to a BO to select rows from a database, then inside the BP there is a loop over the rows returned, and it sends each row’s data as a request to another BO to perform some action on it. Quite a common scenario for various use-cases. Each iteration will add message IDs to the Collections mentioned above, and these will grow per the numbers of rows processed. In some cases of course this could be 100s or more.

Here a snippet from a Journal of a customer system where this happened:

Now, unlike the general case mentioned above, where you can control your own data model and address this issue, keeping this consideration in mind – what you can do about the Business Process Collections…?

 

Here comes the Class Parameter, from the Ens.BusinessProcess class reference

parameter SKIPMESSAGEHISTORY = 0;

If this parameter is TRUE, then arrays %MessagesSent and %MessagesReceived will not be populated.

 

So if you want to avoid these Collections from being populated (and accumulating and bloating your Journals) set this parameter to the value of 1.

Note that if you use the graphic BPL Designer for your Business Process, you will need to open the class definition (in your favorite IDE) and add this parameter.

And a reminder – in InterSystems IRIS (and IRIS for Health, and IRIS-based HealthShare Health Connect) this is less of a concern as per above, still if you don’t need these Collections, you can use this parameter as well.

Now you can't say you were not aware... 

 

 

 

2 Comments
Discussion (2)1
Log in or sign up to continue
Discussion (0)0
Log in or sign up to continue
Article
· Jun 1, 2020 4m read

IRIS Python Suite - A Hitchhiker's guide to the ^Global - 1

Making a Chart using Intersystems IRIS + Python

How to use the IRIS Native API in Python to access globals and plot some charts.

Why Python?

With a large adoption and use in the world, Python have a great community and a lot of accelerators | libraries to deploy any kind of application.
If you are curious (https://www.python.org/about/apps/)

1 Comment
Discussion (1)1
Log in or sign up to continue
Article
· May 28, 2020 7m read

Entity-attribute-value model in relational databases. Should globals be emulated on tables? Part 2.

A More Industrial-Looking Global Storage Scheme

In the first article in this series, we looked at the entity–attribute–value (EAV) model in relational databases, and took a look at the pros and cons of storing those entities, attributes and values in tables. We learned that, despite the benefits of this approach in terms of flexibility, there are some real disadvantages, in particular a basic mismatch between the logical structure of the data and its physical storage, which causes various difficulties.

To solve these issues, we decided to see whether using globals — which are optimized for storing hierarchical information — for tasks the EAV approach typically handles would work.

In Part 1, we created a catalog for an online store, first using tables, then using just one global. Now, let's try to implement the same structure for several globals.

In the first global, ^catalog, we’ll store the directory structure. In the second global, ^good, we’ll store goods. And in the global ^index, we’ll store indexes. Since our properties are tied to a hierarchical catalog, we won’t create a separate global for them.

With this approach, for each entity (except for properties), we have a separate global, which is good from the point of view of logic. Here’s the global catalog structure:

 


Set ^сatalog(root_id, "Properties", "capacity", "name") = "Capacity, GB"
Set ^сatalog(root_id, "Properties", "capacity", "sort") = 1

Set ^сatalog(root_id, sub1_id, "Properties", "endurance", "name") = "Endurance, TBW"
Set ^сatalog(root_id, sub1_id, "Properties", "endurance", "sort") = 2

Set ^сatalog(root_id, sub1_id, "goods", id_good1) = 1
Set ^сatalog(root_id, sub1_id, "goods", id_good2) = 1

Set ^сatalog(root_id, sub2_id, "Properties", "avg_seek_time", "name") = "Rotate speed, ms"
Set ^сatalog(root_id, sub2_id, "Properties", "avg_seek_time", "sort") = 3

Set ^сatalog(root_id, sub2_id, "goods", id_good3) = 1
Set ^сatalog(root_id, sub2_id, "goods", id_good4) = 1

 

A global with goods will look something like this:

Set ^good(id_good, property1) = value1
Set ^good(id_good, property2) = value2
Set ^good(id_good, property3) = value3
Set ^good(id_good, "catalog") = catalog_id

 

Of course, we need indexes so that for any section of the catalog with goods, we can sort by the properties we need. The index global will have a structure something like this:

Set ^index(id_catalog, property1, id_good) = 1
; To quickly get the full path to concrete sub-catalog
Set ^index("path", id_catalog) = "^catalog(root_id, sub1_id)"

 

Thus, in any section of the catalog, we can get a sorted list. An index global is optional. It’s useful only if the number of products in this section of the catalog is large.

ObjectScript Code for Working with Demo Data

Now we’ll use ObjectScript to work with our data. To start, let’s get the properties of a specific good. We have the ID of a particular good and we need to display its properties in the order given by the sort value. Here’s the code for that:

get_sorted_properties(path, boolTable)
{
  ; remember all the properties in the temporary global
  While $QLENGTH(@path) > 0 {
    if ($DATA(@path("Properties"))) {
      set ln=""
      for {
	    Set ln = $order(@path("Properties", ln))
	    Quit: ln = ""
	    
        IF boolTable & @path("Properties", ln, "table_view") = 1 {
  	      Set ^tmp(@path("Properties", ln, "sort"), ln) = @path("Properties", ln, "name")
	    }
	  ELSE {
  	    Set ^tmp(@path("Properties", ln, "sort"), ln) = @path("Properties", ln, "name")
	  }
    }
  }
}
 
print_sorted_properties_of_good(id_good)
{
  Set id_catalog = ^good(id_good, "catalog")
  Set path = ^index("path", id_catalog)
 
  Do get_sorted_properties(path, 0)
 
  set ln =""
  for {
   Set ln = $order(^tmp(ln))
   Quit: ln = ""
   Set fn = ""
   for {
 	Set fn = $order(^tmp(ln, fn))
 	Quit: fn = ""
 	Write ^tmp(ln, fn), " ", ^good(id_good, fn),!
   }
  }
}

 

Next, we want to get products from the catalog section in the form of a table, based on id_catalog:

 


print_goods_table_of_catalog(id_catalog)
{ 
  Set path = ^index("path", id_catalog)
  Do get_sorted_properties(path, 1)
 
  set id=""
  for {
    Set id = $order(@path("goods"), id)
    Quit: id = ""
 
    Write id," ", ^good(id, "price"), " "
 
    set ln =""
    for {
      Set ln = $order(^tmp(ln))
      Quit: ln = ""
      Set fn = ""
      for {
 	    Set fn = $order(^tmp(ln, fn))
 	    Quit: fn = ""
 	    Write ^tmp(ln, fn), " ", ^good(id, fn)
      }
      Write !
    }
  }
}

 

Readability: EAV SQL Versus Globals

Now let’s compare the use of EAV and SQL against using globals. With regard to code clarity, it’s clear that this is a subjective parameter. But let's look, for example, at creating a new product.

We’ll start with the EAV approach, using SQL. First, we need to get a list of object properties. This is a separate task and quite time-consuming. Assume we already know the IDs of these three properties: capacity, weight, and endurance.


START TRANSACTION
INSERT INTO good (name, price, item_count, catalog_id) VALUES ('F320 3.2TB AIC SSD', 700, 10, 15);
 
SET @last_id = LAST_INSERT_ID ();
 
INSERT INTO NumberValues ​​Values​​(@last_id, @id_capacity, 3200);
INSERT INTO NumberValues ​​Values​​(@last_id, @id_weight, 0.4);
INSERT INTO NumberValues ​​Values​​(@last_id, @id_endurance, 29000);
COMMIT

 

In this example, we have only three properties, and therefore the example doesn’t look so scary. In the general case, we’d still have a few inserts into the text table inside the transaction:

 

INSERT INTO TextValues ​​Values​​(@last_id, @ id_text_prop1, 'Text value of property 1');
INSERT INTO TextValues ​​Values​​(@last_id, @ id_text_prop2, 'Text value of property 2');
...
INSERT INTO TextValues Values (@last_id, @id_text_propN, 'Text value of property N');

 

Of course, we could simplify the SQL version a little if we used text notation instead of ID properties, such as “capacity” instead of a number. But in the SQL world, this isn’t acceptable. It’s customary instead to use a numeric ID to enumerate entity instances. This results in faster indexes (you need to index fewer bytes), it’s easier to track uniqueness, and it’s easier to automatically create a new ID. In this case, the insert fragment would look like this:

 

INSERT INTO NumberValues ​​Values​​(@last_id, 'capacity', 3200);
INSERT INTO NumberValues ​​Values​​(@last_id, 'weight', 0.4);
INSERT INTO NumberValues ​​Values​​(@last_id, 'endurance', 29000);

 

Here’s the same example using globals:

 

TSTART
Set ^good(id, "name") = "F320 3.2TB AIC SSD"
Set ^("price") = 700, ^("item_count") = 10, ^("reserved_count") = 0, ^("catalog") = id_catalog
Set ^("capacity") = 3200, ^("weight") = 0.4, ^("endurance") = 29000
TCOMMIT

 

Now let’s delete a good using the EAV approach:

 

START TRANSACTION
DELETE FROM good WHERE id = @ good_id;
DELETE FROM NumberValues ​​WHERE good_id = @ good_id;
DELETE FROM TextValues ​​WHERE good_id = @ good_id;
COMMIT

 

And then do the same with globals:

 

Kill ^good(id_good)

We can also compare the two approaches in terms of code length. As you can see from the previous examples, when you use globals, the code is shorter. This is good. The shorter the code, the fewer the errors and the easier it is to understand and maintain.

Generally, shorter code is also faster. And, in this case, that’s certainly true, since globals are a lower-level data structure than relational tables.

Scaling Data with EAV and Globals

Next, let’s look at horizontal scaling. With the EAV approach we have to at least distribute the three largest tables to the servers: Good, NumberValues,​​and TextValues. Tables with entities and attributes can simply be completely copied to all servers, since they have little information.

On each server, with horizontal scaling, different products would be stored in the Good, NumberValues, ​​and TextValues ​​tables. We’d have to allocate certain ID blocks for products on each server so that there’s no duplication of IDs for different products.

For horizontal scaling with globals, we’d have to configure ID ranges in the global and assign a global range to each server.

The complexity is approximately the same for EAV and for globals, except that for the EAV approach we’d have to configure ID ranges for three tables. With globals, we’d configure IDs for just one global. That is, it’s easier to organize horizontal scaling for globals.

Data Loss with EAV and Globals

Finally, let’s consider the risk of data loss due to corrupted database files. Where is it easier to save all the data: in five tables or in three globals (including an index global)?

I think it's easier in three globals. With the EAV approach, the data for different goods is mixed in tables, while for globals the information is stored more holistically. The underlying branches are stored and sorted sequentially. Therefore, corruption of part of the global is less likely to lead to damage than corruption of any of the tables in the EAV approach, where data is stored like intertwined pasta.

Another headache in data recovery is the display of information. With the EAV approach, information is divided among several tables and special scripts are required to assemble it into a single whole. In the case of globals, you can simply use the ZWRITE command to display all the values and the underlying branches of the node.

InterSystems IRIS Globals: A Better Approach?

The EAV approach has emerged as a trick for storing hierarchical data. Tables weren’t originally designed to store nested data. The de facto EAV approach is the emulation of globals in tables. Given that tables are a higher-level and slower data storage structure than globals, the EAV approach fails in comparison with globals.

In my opinion, for hierarchical data structures, globals are more convenient and more comprehensible in terms of programming, and they’re faster.

If you’ve been planning an EAV approach for your project, I suggest you consider using InterSystems IRIS globals to store hierarchical data.

Discussion (0)1
Log in or sign up to continue
Question
· May 25, 2020

The specified cast from a materialized 'System.Int64' type to the 'System.Int32' type is not valid DatabaseTable creation Id field is created with [xDBC Type = BIGINT]

Hi !

I am getting below error in my .NET MVC project, I am IRIS Entity Framwork, in the database table filed and model having the same datatype int.

The specified cast from a materialized 'System.Int64' type to the 'System.Int32' type is not valid db Table creation Id field is created with [xDBC Type = BIGINT]

 

Please kindly advice me.

 

Thank you

1 Comment
Discussion (1)0
Log in or sign up to continue