Performant SQL For Paging Results (DataTables, Select2, etc.)

Question

Question

#Big Data #Databases #Data Model #JavaScript #SQL #InterSystems IRIS #InterSystems IRIS for Health

Hello,

Our software commonly returns a full result set to the client and we use the DataTables plugin to display table data. This has worked well, but at datasets grow larger, we are trying to move some of these requests server-side so the server handles the bulk of the work rather than the client. This has had me scratching my head in so many ways.

I'm hoping I can get a mix of general best practice advice but also maybe some IRIS specific ideas.

Some background

- I would admit our tables aren't best optimized for SQL in the sense that we don't utilize parent-child relationship in tables rather we point to other tables as property object references and often index on those pointers. I don't think this is uncommon, yeah? I think it becomes an issue when we are accessing a property by the pointer pointing to another pointer and so on until we make the object reference we want.

- I assume this next concept is common: depending on the table, some of the data reported to the client is calculated rather than specifically derived from the table or pointers. Or the data itself may be a link so there's an <a>Data Name</a> around it. This means if we need to filter data based on the calculated data we can't rely on just the SQL. We call a base query and then load the resulting object IDs into an array and then loop through the array and process, transform and filter the data in ObjectScript.

- Maybe that previous point isn't so common? Do I need to buckle down and learn how to get a lot fancier in SQL? We have methods that transform data in ObjectScript, but I don't think I can call those within ISC SQL. But really, at the end of the day, the SQL translates to a bunch of $ORDER-ing through indexes and such to get the data we need, right? So by calling a base query and then further processing in ObjectScript, I'm just doing what the SQL would have been doing in the guts. I know I'm oversimplifying, but does that make sense?

- A use case for example is that we have a table of lets say 1 million patients. A base SQL query can performantly get that down to let's say 100,000 patients that belong to a specific facility. I only need the first 10 to show on my client side DataTables view, but since I'm processing it all server side and don't want to return 100,000 results to the client, just 10, I still have to return the count of 100,000 to client. I think that's easily done with SQL COUNT, but then things start to get trickier when I add a filter or search text. Now I have to tell DataTables on the client side how many filtered records I have. So if someone searches for "Sally" the query will count that there's 100,000 records, return 50 of them that have "Sally" (another point on this later), and then only return the first 10. Or 11-20 if it's page two. And so on. Ok, that's still possibly all in SQL. But what if "Sally" is in the patient name? Or in the responsibly party name? Or the street name ("Sally St.")? Or in the city name (ok, no US cities name Sally's Cove in Newfoundland). Now my SQL query becomes more complex (WHERE FirstName Like %SALLY% or Street1 Like %SALLY% or Street2 Like %SALLY% . . .). Start adding other optional filters as well as search text. Now the filters are filtering on calculated data - now the SQL query can't determine the TOTAL count and the FILTERED count and my pagination breaks. Also, we have to consider the ordering of the results which can be done in the query if you can point to or calculate data in the query, but as soon as one this can't be determined from the query with ordering you either have to disable ordering on that column or write the simplest query and handle all the ordering, sorting, filtering, in ObjectScript.

I hope this help paint a picture and I hope folks have some ideas or experiences they can share. Some other ideas swimming in my head:

- Maybe we need to create temp tables or derived tables so the data is there and waiting as we want it for the request - this seems like a big project though. How/when do the tables refresh/update. Do they live in the same namespace? What if it's being updated and the user calls it are there locking issues?

- As I said before, maybe I just need to get better at SQL - I know people write way more complex and performant queries that I can even imagine.

- Create better indexes and table relationships - if we know we are pointing to a pointer 4 or five tables deep and we somehow connect the source table to that table with a new object property and index it appropriately?

- Get better at reexamine client versus server activities - in the case of Datatables, I'm trying to do all the table init config in the payload from the server so that I'm just passing in a block of JSON to the init and no having to fiddle with client side config. Same with formatting. Maybe I need to be better at letting the client handle the things the client it better at handling (e.g. forming data into URLs) and let the server just focus on raw data.

I would LOVE to hear your thoughts, folks. Thank you for any resources or best practices or real word experience or whims you may have!

Product version: IRIS 2024.1

$ZV: IRIS for UNIX (Ubuntu Server LTS for ARM64 Containers) 2024.1.2 (Build 398U) Thu Oct 3 2024 14:29:04 EDT

Discussion (6)3

Log in or sign up to continue

Ali Nasser · Apr 10

Hi Michael,

I'm not an expert nor a programmer so take the below with your most skeptical grains of salt:

When it comes to SQL pagination of results, do you know about the new SQL server-side pagination in 2025.1? This can help you limit what you are sending to the client. If you are on older versions then you can use the TOP clause to control the number of returned results

"I would admit our tables aren't best optimized for SQL in the sense that we don't utilize parent-child relationship in tables rather we point to other tables as property object references and often index on those pointers. I don't think this is uncommon, yeah? I think it becomes an issue when we are accessing a property by the pointer pointing to another pointer and so on until we make the object reference we want."

I can't speak to how common this is but what you are describing here is impacted by swizzling. It can become an issue by taking up a lot of memory as you load deeply nested objects. How big of an issue depends on the size of your data, how long are objects kept in memory, how capable is your system etc.

Two more issues, you need to watch out for, are concurrency and privilege checking, but these aren't specific to accessing nested objects

"- I assume this next concept is common: depending on the table, some of the data reported to the client is calculated rather than specifically derived from the table or pointers. Or the data itself may be a link so there's an <a>Data Name</a> around it. This means if we need to filter data based on the calculated data we can't rely on just the SQL. We call a base query and then load the resulting object IDs into an array and then loop through the array and process, transform and filter the data in ObjectScript. "

You can have computed properties so that instead of querying on the field that has "<a>Data</a>", you instead do a SQL query on a computed field that does the processing you need on "<a>Data</a>"

You can also do that processing before your data is saved so that your INSERTS or UPDATES only save the processed data, and you save yourself some compute time when querying your data. If you want something elaborate then you can setup triggers to execute that processing step on UPDATEs or INSERTs

"- Maybe that previous point isn't so common? Do I need to buckle down and learn how to get a lot fancier in SQL? We have methods that transform data in ObjectScript, but I don't think I can call those within ISC SQL. But really, at the end of the day, the SQL translates to a bunch of $ORDER-ing through indexes and such to get the data we need, right? So by calling a base query and then further processing in ObjectScript, I'm just doing what the SQL would have been doing in the guts. I know I'm oversimplifying, but does that make sense?"

You can call ObjectScript methods from SQL (and vice versa). If the methods you are using are class methods, then they can be exposed as stored procedures by adding the SqlProc keyword to the method signature. You can also create a stored procedure (that uses either SQL or ObjectScript in its code body).

What you are saying makes sense but I don't think is true. If you call a base query then use ObjectScript to further process the results then aren't you looping over the data twice? The first time is when you get it and the second when you are processing it in ObjectScript. I think that by using computed fields and sql procs, then your SQL query will be more efficient than using a base query + objectscript processing.

"A use case for example..."

I'm a little confused by this. Does the user have to specify which field does they want to look for "Sally" in or do they have the freedom to just say "Return all rows with all of their fields if any of their fields have the word Sally in it". If we are talking about the former then you can use Dynamic SQL to construct SQL queries on the fly that only target the columns that need to be filtered. If it is the latter then I think we're getting into the territory of full-text search or even vectors/embeddings and that's out of my league

0 0

score 0 · Answer 1 · 2025-04-10T17:13:08-04:00

@Ali Nasser thank you so much for this thoughtful response!

- I did see the new LIMIT and OFFSET SQL functionality, but we aren't yet on 2025 but we are soon to be there.

- Thanks for the reference to swizzling - this is indeed I think what tends to slow things down.

- I did not know you could compute properties like that! I will have to experiment with this. I can see it working for many properties. It does become a bit cumbersome in the class definition as they get larger and larger. I think we could use a mix of the triggers and the query runtime calcs, because some things static but others might changed based on other data.

- Ok I admit I thought twice when I said you can't call methods from the SQL queries and I do recall seeing SQLProc keyword documentation and I think I need to dig into this more along with the calculated data.

- My ancedetoal though on base query plus ObjectScript was that the base query would run REALLY quick to just return a base set of ObjectIds that is the key for all the ObjectScript processing. I just did try this approach and you're right, it does take a lot more processing.

- Since we are replicating the DataTables client side behavior, yes, we are trying to search the whole row for "Sally" in that example. Right now I'm hacking this with a concatenated string in my ObjectScript processing and a search text, but it sounds like full-text search is a specialty and more delicate kind of processing that I shouldn't try to hack like I am.

We use a LOT of embedded SQL in our application and I have had to move to dynamic to make this work, however since dynamic queries aren't cached, I guess I thought they were less performant. But compared to a base embedded query with additional ObjectScript processing, I think you have given me some ideas to go back and make this work with Dynamic SQL without any additional COS processing.

Thank you for your help here! How wonderful!

score 0 · Answer 2 · 2025-04-11T16:47:01-04:00

you stated

"We use a LOT of embedded SQL in our application and I have had to move to dynamic to make this work, however since dynamic queries aren't cached, "

Dynamic queries have always been cached, ie the query code but not the results.

Embedded query prior to the Universal Query Cache produced code in line inside of your class/mac routines. With the advent of the Universal query cache all sql statements, whether with embedded sql or otherwise are cached(the query code again not the results). The Universal Query cache addressed several issues with embedded sql. For embedded SQL the query plan/code was produced at compilation time of of your class/mac routine. If the table statistics changed, indices were added/removed, or other events occured to the table definition your embedded query would continue to use the old code unless recompiled(for better or worse).

Global reads might find that running a query more than once is faster but that is because the data is in the cache and results in a logical read vs. a phsical read to disk.

Analytics/MDX queries are different from SQL in that there is a caching mechanism for query results.

score 0 · Answer 3 · 2025-04-11T10:27:39-04:00

@Ali Nasser digging into this more this morning I am trying to determine the best use cases be between SQLProcs and Calculated data.

There are some obvious differences like do you want something to be calculated when saved or updated.

But in general since a lot of my data exists, it's going to be calculating on the fly (I suppose I could write a script to save/update all the existing records for force a calculated field). I start to wonder though, are calculated fields and SQLProcs savinging me query performance or am I just offloading my post processing into the query itself? Does that make sense?

The reality is that our swizzled references will knock us down a few pegs no matter how I configure it but I do hope wrapping it all in the query can be more efficient. I have a feeling we'll see some gains from the filtering and sorting end since I'll be able to simply select this, that, this and that and then the filters and sorts are working with that table data already calculated.

score 0 · Answer 4 · 2025-04-13T22:00:15-04:00

If you add a calculated field to a class definition you don't have to "Update" your data for the field to be populated. It will get calculated on record access, i.e. when the record gets selected with the field included in the select.

score 0 · Answer 5 · 2025-04-14T08:09:12-04:00

I think the benefit of SQLProcs is that your stored procedure is executed as your data is retrieved. This means that data retrieval and processing happen in one fell swoop rather than being broken into two stages where you loop over your table to retrieve data then loop again over the results to process them. This, I hope, can save you a lot of compute time. Of course, this is all conjecture, the best way to know is to benchmark both approaches.

Wrapping your application logic in a query is a great idea. You are basically delegating the task of retrieving your data in the most optimal way to the SQL Engine, so that the engine has to figure out whether to use indexes, parallelize data retrieval, create temp file to store intermediary results, clean up etc rather than you having to do it.