Uncovering Clues by Querying the Interoperability Message tables |

Article

Ben Schlanger · Mar 15, 2024 4m read

#Best Practices #Ensemble #HealthShare #InterSystems IRIS for Health

When using InterSystems IRIS as an interoperability engine, we all know and love how easy it is to use the Message Viewer to review message traces and see exactly what's going on in your production. When a system is handling millions of messages per day, you may not know exactly where to begin your investigation though.

Over my years supporting IRIS productions, I often find myself investigating things like...

What sort of throughput does this workflow have?
Where is the bottleneck?
What are my most common errors?

One of my favorite places to look for clues is the Message Header table, which stores metadata about every message running through the system. These are the same messages that appear in the Message Viewer and the Visual Traces.

I've built up a collection of handy SQL queries, and I'd love to share them with you. My examples are mostly from HealthShare or IRIS for Health use cases, but they can be easily adapted for whatever workflow you have...

-- SQL query to find the # of messages through a component per day
select {fn SUBSTRING(timeprocessed,1,10)} AS day, count(*) MessagesThisDay 
FROM Ens.MessageHeader
where TargetConfigName = 'HS.Hub.Push.Evaluator' 
GROUP BY {fn SUBSTRING(timeprocessed,1,10)}
ORDER BY day ASC

-- SQL query to find long-running messages through particular components
SELECT PReq.SessionID as SessionId, 
  PReq.TimeCreated as pReqTimeCreated, 
  PRes.TimeCreated as pResTimeCreated, 
  {fn TIMESTAMPDIFF(SQL_TSI_SECOND, PReq.TimeCreated,PRes.TimeCreated)} as TimeDelay
FROM (
  SELECT ID, SessionId, TimeCreated
  FROM Ens.MessageHeader
  WHERE MessageBodyClassName = 'HS.Message.PatientSearchRequest'
  AND SourceConfigName = 'HS.Hub.MPI.Manager'
  AND TargetConfigName = 'HUB'
) as PReq
INNER JOIN (
  SELECT ID, SessionId, TimeCreated
  FROM Ens.MessageHeader
  WHERE MessageBodyClassName = 'HS.Message.PatientSearchResponse'
  AND SourceConfigName = 'HS.Hub.MPI.Manager'
  AND TargetConfigName = 'HS.IHE.PIX.Manager.Process'
) as PRes on pReq.SessionId = PRes.SessionId
WHERE {fn TIMESTAMPDIFF(SQL_TSI_SECOND, PReq.TimeCreated,PRes.TimeCreated)} > 1
ORDER BY SessionId desc ----------------------------------------------------------

/*-- Query to find the bottleneck message through a particular component
  -- set your threshold for "how long is too long (e.g. 20 seconds)
  -- look for clusters of messages that are longer than that (e.g. the first cluster started at 3:22:00, then there was a second cluster at 5:15:30)
  -- in each cluster, look at the first message in that cluster (chronologically). That is likely to be the bottleneck message, and all messages after it are victims of its bottleneck 
*/
SELECT %NOLOCK req.TargetConfigName, req.MessageBodyClassName, req.SessionId, req.TimeCreated, req.TimeProcessed, {fn TIMESTAMPDIFF(SQL_TSI_SECOND, req.TimeCreated, req.TimeProcessed)} as TimeToProcess
FROM Ens.MessageHeader AS req
WHERE req.TargetConfigName = 'HS.Hub.Management.Operations'
  AND req.TimeCreated BETWEEN '2021-04-21 00:00:00' AND '2021-04-21 11:00:00'
  AND {fn TIMESTAMPDIFF(SQL_TSI_SECOND, req.TimeCreated, req.TimeProcessed)} > 20

/* If you have a particular error that you're investigating, try this one. It scans through the Ensemble Error Log for "Object to Load not found" entries, then returns some key fields from the relevant PatientSearchRequest message */
SELECT l.SessionId, mh.MessageBodyID, mh.TimeCreated, psr.SearchMode, psr.RequestingUser, psr.FirstName, psr.MiddleName, psr.LastName, psr.SSN, psr.Sex, psr.DOB
FROM Ens_Util.Log as l
INNER JOIN Ens.MessageHeader as mh on l.SessionId = mh.SessionId
INNER JOIN HS_Message.PatientSearchRequest as psr on mh.MessageBodyID = psr.ID
WHERE l.Type = 'Error'
AND l.ConfigName = 'HSPI.Server.APIOperation'
AND l.Text like 'ERROR #5809: Object to Load not found%'
AND mh.MessageBodyClassName = 'HS.Message.PatientSearchRequest'
AND mh.SourceConfigName = 'HSPI.Server.APIWebService'
AND mh.TargetConfigName = 'HSPI.Server.APIOperation'

-- Scan the Ensemble Error Log for a particular timeframe. Count up the different types of errors
SELECT substring(text,1,80) as AbbreviatedError, count(*) as NumTheseErrors
FROM Ens_Util.Log
WHERE Type = 'Error'
AND TimeLogged > '2022-03-03 00:00:00' -- when the last batch started
AND TimeLogged < '2022-03-03 16:00:00' -- when we estimate this batch might end
GROUP BY substring(text,1,80)
ORDER BY NumTheseErrors desc

-- Find the Gateway Processing Time for each StreameltRequest / ECRFetchResponse pair
SELECT sr.Gateway,request.sessionid, response.sessionid, request.timecreated AS starttime, response.timecreated AS stoptime, 
  datediff(ms,request.timecreated,response.timecreated) AS ProcessingTime, 
  Avg(datediff(ms,request.timecreated,response.timecreated)) AS AverageProcessingTimeAllGateways
FROM Ens.MessageHeader request
INNER JOIN Ens.MessageHeader AS response ON response.correspondingmessageid = request.id
INNER JOIN HS_Message.StreamletRequest AS sr ON sr.ID = request.MessageBodyId
WHERE request.messagebodyclassname = 'HS.Message.StreamletRequest'
AND response.messagebodyclassname = 'HS.Message.ECRFetchResponse'