Question
· Aug 5

Cache \ Database 'size' \ RESIZING - UNIVERSE / UNIDATA

Ok, I am attempting to clarify the required use of the RESIZE command, when you have a UNIVERSE or a UNIDATA DB attached to Cache. Traditional UNIVERSE / UNIDATA databases 'require' that hashed files be re-sized according to historic use, to prevent file overflow (aka. performance issues). It is not clear if that requirement is fully eliminated by attaching one of these databases to Cache. It is clear that, if you do not allocate 'enough' space on Cache for the database 'as a whole', that is a problem. But, Cache space allocation for the database 'as a whole' seems to be separate and distinct from the need to properly size hashed files, on an attached UNIVERSE or UNIDATA database.

 

Can anyone clarify this requirement, and, specify the 'why' to that answer?

Product version: Caché 2018.1
Discussion (6)2
Log in or sign up to continue

Robert,

I'd be happy to respond.  First I want to confirm/clarify some things.  When you say that you have a Universe database "attached" to Cache I am assuming you mean that you have migrated said database to cache and the data is now stored natively in Cache.

So the short answer is that you never need to do resizing in Cache as was required in Multivalued databases like Universe and Unidata.  The reason for this is that the fundamental storage methodology is quite different.  

In Multivalue (MV) each table/file/Global (to use a Cache term) is a hashed entity.  When created it  starts with a known number of groups that comprise the storage for that table.  Every record stored is "hashed" to determine which group it belongs in.   Another part of defining a hashed table in what MV calls a separation.  This is an indication of the database block size.  As data is inserted into the table groups will tend to fill up.  At this time a new block is added to the group so that new records can be stored.  In a hashed table a give record key will only ever hash to the same existing group.  When looking up a record hashing to the group is very fast.  However once in the group the lookup is a linear function that examines every record in every block of the group to find the record being retrieved. Inserts are always appended to the end of the group.   This is what tends to slow down an MV system.  Pick a poor initial group size or separation size and performance will suffer.  This is the reason for the need to RESIZE.

Cache, at its base table storage level, behaves similar to a key-value store which is internally implemented as a high-performance binary structure.   No hashing is preformed and key look-ups and inserts are consistently extremely fast.  I would also like to correct one thing you stated.  You don't have to be concerned with the initial size of the Cache database as you indicated though getting size close to what is needed is always recommended.  The database can grow dynamically as records are added.  It will grow up to the maximum size of a Cache database within any OS level limitations.  The only performance concern would be disk level fragmentation.  

I hope this helps.  If you have any further questions please reach out. 

Regards,

Rich Taylor

Thank you Rich. To clarify, you are saying that a select on Universe/ Cache is non-linear, correct? I am curious about the key/value/non-hashed/binary details. My experience has been that selects can be very time consuming on Universe/ Cache, as files grow very large. Can you expand on this part? Also, do you know what happens, at a cellular level, when you do a CREATE-FILE from TCL?

Robert, 

Sorry for the slow reply.  I was tied up last week and didn't get a chance to look at the community. 

Not really sure what you are looking for when you say "non-linear'.   As far as how records are stored in the database a multivalued records is a delimited string.  In Cache when stored in the database this string is placed in a global node.  The key for the node is the item id of the record stored in the MV file.  So if I create a file called TEST and add a record to it when viewed with a LIST-ITEM command you might see this.

     1
0001 ATTR1
0002 ATTR2
0003 ATT3
0004 ATTR4.1ýATTR4.2
 

If you were to view the ^TEST global where this data is stored vai the System Management Portal you would see:

^TEST    =    $lb(0)
^TEST(1)    =    "ATTR1þATTR2þATT3þATTR4.1ýATTR4.2"

See the CREATE-FILE example below.

By default, when no sort is indicated a select will select the records in key order.   Note that ALL records within the database are stored the same way.  It is the content and key structure that changes.   In this case storing a MV record.

The speed of selects or any query in Cache is going to depend on several factors.  What is the nature of the SELECT command?  Are you doing any BY-EXP options here?   I would suggest that if this is a serious reduction in performance I would call into the WRC or consult with your Sales Engineer.   You can use the (Y and/or (Z to get more information on how this command will be executed that may help.  
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=RVCL_commands#RVCL_commands_select

As to what CREATE-FILE does there is also some nuance to this depending on what type of file you are creating.  A basic mv file created like this:  CREATE-FILE TEST  
will result in a VOC (MD) entry that looks like this:

    TEST
0001 F
0002 ^TEST
0003 ^DICT.TEST
0004
0005
0006
0007
0008
0009
0010

Lines 1 and 2 identify globals in the Cache database the contain the data and dictionary respectively.   There is a lot more information that may be recorded here too depending on the type of file, indexes, and so on.  I would recommend reviewing the documentation at 

https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls

Under Technology on the left side pick 'Multivalue'

Hope all this helps

Thank you Rich. I will take a little time to dig into your reply.
 

Regarding non-linear, in your first reply, you stated this, when describing how Cache does lookups:

‘However once in the group the lookup is a linear function that examines every record in every block of the group to find the record being retrieved.’

You followed up with concerns about sizing issues and poor performance of Universe/Unidata, when inserting new data. In a linear system, processing time increases proportionally with the size of the data set being retrieved. Conversely, a non-linear system behaves differently: as data size grows, the processing time does not increase in a straightforward manner. Non-linear systems can become complex, and their characteristics differ from linear systems. I’m looking to better understand what you are inferring about Universe (and, its non-linear behavior) in contrast to Cache.

I think there is a little confusion here.  That quoted description of look ups is how Multivalue works with its Hashed storage structure.  Cache does NOT use a hashed storage model.

The point I really wanted to make is that  Cache data files, which we refer to as Globals, do not require you to perform any kind of resizing the same way you did with Universe/Unidata.  Thus reducing your operational maintenance.  In addition the database as a whole, which contains all the 'globals', will grow dynamically as more data is loaded.    

If you want to PM me I could set aside some time to do a short teams call with you.

Regards,

Rich Taylor