Question
· Nov 14, 2023

Cleaning text, removing characters which break XPATH

Hi All

I'm having a problem with cleaning user inputted text from a HealthCare system my HealthConnect system interfaces with.

The input can be anything posted into an RTF box on an app which stored in oracle, and extracted by HealthConnect from oracle via an XML based API.

When the XML is returned, various values are read out of it using %XML.XPATH.Document and it is the presence certain characters entered into the RTF fields cause XPATH to throw an error. For example,

  • ASCII character  8211 (en dash) causes XPATH to give an exception of  ERROR #6901: XSLT XML Transformer Error: invalid character 0x19 in at line n offset nnnnn
  • 8217 (right single quotation mark) -  #6901: XSLT XML Transformer Error: invalid character 0x19 in at line n offset nnnnn

(Note that the character code causing the error is incorrectly identified)

Obviously, I can find and swap out the individual characters using $REPLACE, or even work out which characters break XPATH and write a routine to operate on the XML and clean it. These solutions seem inelegant and clumsy to me, can anyone suggest anything simpler or better?

Cheers

Andy

Product version: IRIS 2022.1
$ZV: IRIS for Windows (x86-64) 2021.2.1 (Build 654U) Fri Mar 18 2022 06:09:35 EDT
Discussion (4)2
Log in or sign up to continue

I suspect you have some inconsistency in the Character Encoding in your XML.

Is the XML Character Encoding declared? If yes, how?

i.e. does the first line contains something like "<?xml version="1.0" encoding="utf-8"?>" ?

How are you crating the %XML.XPATH.Document instance from your XML?

It would be helpful if you can post a tiny code to reproduce the issue.

Enrico

That's for your prompt reply. I can't post anything without editing text and changing XML structure as it's patient confidential data from a proprietary system. I'll look into what I can do.

I can answer a few though:

 XML Character Encoding is declared as <?xml version="1.0" encoding="UTF-8"?>

This is the XPATH I'm using

//code salient points
#dim tDocument as %XML.XPATH.Document
Set tSC=##class(%XML.XPATH.Document).CreateFromString(pXML,.tDocument)
Set tSC=tDocument.EvaluateExpression(pContext, pExpression,.tResults)

The XML response is being retrieved in the form of a string from Operation with an EnsLib.SOAP.OutboundAdapter adapater, and here's the salient code

// Salient code
set ..Adapter.WebServiceURL  = ..URL
Set ..Adapter.WebServiceClientClass = "rocessMessageSoap"
Set tSC = ..Adapter.InvokeMethod("ProcessMessage",.ProcessMessageResult,tRequestMessage.requestMessageXml)  Quit:$$$ISERR(tSC) tSC
Set tSC = tRequestMessage.NewResponse(.pResponse)  Quit:$$$ISERR(tSC) tSC
Set pResponse.ProcessMessageResult=$get(ProcessMessageResult)

//where pResponse.ProcessMessageResult contains the XML response we are analysing

It seems that character 8211 (en dash) is not utf-8 but utf-16, google is your best friend and I'm not an expert in unicode, utf-8, utf-16 etc.! 😊

Set xml="<?xml version=""1.0"" encoding=""UTF-8""?>"
Set xml=xml_"<Text>This is n-dash "_$wc(8211)_" in xml</Text>"
Set xml=$ZCONVERT(xml,"O","UTF8")
Set sc=##class(%XML.XPATH.Document).CreateFromString(xml, .xmlDoc)
Write sc
Set sc=xmlDoc.EvaluateExpression("/Text","text()",.result)
Write result.GetAt(1).Value,!

Result:

This is n-dash – in xml

Enrico