Filtering non-printable characters

Hi-

I have a REST client that calls a REST service and as a response gets a stream containing a JSON structure.  The service is placing some weird non-printable characters into some places in the JSON document that is throwing off parsing of a down-stream XML document.

What I would like to do is just remove the non-printable characters from the response stream that comes back from my call to the REST service.

Does anyone have a handy utility or method for removing all non-printable characters from a character stream?

  • 0
  • 0
  • 836
  • 8
  • 1

Answers

Comments

Not sure if the solution is perfect or not , but this code should work

ClassMethod ProcessJsonPayload(pInput As %Library.AbstractStream, Output pOutput As %Stream.Object = {$$$NULLOREF}) As %Status
{
  set tSC=$$$OK
  set oTempStream=##class(%GlobalBinaryStream).%New()
  set tSC=oTempStream.CopyFrom(%request.Content)
  set oCleanStream=##class(%GlobalBinaryStream).%New()
  while 'oTempStream.AtEnd {
    set content=oTempStream.Read()
    set content=$ztsrip(content,"*C")
    set tSC=oCleanStream.Write(content)
  }
  set tSC= ##class(%ZEN.Auxiliary.jsonProvider).%ConvertJSONToObject(oCleanStream,"%DynamicObject",.oJsonDynamicObject)
  quit tSC
}

Please let me if this does not solve the issue.

Ken,

Can you be more specific about the non-printing characters? If you are serializing  data to XML, you need to be clear about whether you are correctly XML-escaping the characters, or whether you are actually receiving values that are not permitted in XML. Is it clear as to which of these problems you are running into?

Here's what is happening.  The REST service is returning data that appears to have been encoded using escape sequences.  For example, line feeds are changed to \n, carriage returns changed to \r, and other non printable characters are changed to character codes, such as $c(26) is changed to \u001a

These sequences are then (apparently) automatically translated back to their regular ascii characters when I use the %DynamicObject's %FromJSON method to convert the JSON stream to an object.

We then take that object and pass it to a DTL transform that converts it to another object (in this case SDA3 set of containers) and the non printable characters (specifically $c(26)) is throwing off the generation of XML which in itself seems correct because I don't think a $c(26) is allowed in XML.

What I want to do is get rid of these non-printable characters (other than things like CR and LF so that the JSON can be converted to XML properly

So for a first check if this  is the only issue just throw the full string into a $TRANSLATE
$tr(jsonstring,$c(26)) and whipe it out.

Problem is this-  At the time that I have access to do this the data is contained in a stream.  Further complicating things is that the $C(26) is actually encoded as \u001a because the contentType of the response from the REST service is Application/JSON which automatically encodes these characters using the JSON notation.

I wanted to implement a more general solution for this, removing all non-printable characters, however, it seems that I need to implement this stripping in the transform where the contents of the data actually contain the non-encoded non-printable characters.

I have a simple method that I have created to remove these characters:

 

ClassMethod stripNonPrintables(string As %String) As %String

{

f i=0:1:8,11,12,14:1:31 set chars=$g(chars)_$c(i)

quit $tr(string, chars)

}

So, in places where we are seeing these characters, we can simply call this strip method.  Not the solution I wanted to implement, but it will work.

 

You can decode JSON escaped characters:

set string = $zcvt(string, "I", "JSON")

and remove special symbols after that.

The only issue I have with this is that in the case of a stream containing a very large amount of data, as I read through the data there is no guarantee that I'm going to get the entire coded entity.  For example: given the following block of text:

{freeText: This is some free text \u001a}

As I read through the stream using .Read() the first read could return "{freeText: this is some free text \u" and then the second call to .Read() could return "001a}".

Can $zcvt work on stream data?

In order to implement this I think I would have to put together some pretty good code to handle these cases.  The documents I am trying to remove characters from are very large and wouldnt be able to be stored in a single string for use with $zcvt, I think.

At the point that the data is accessible as distinct strings the decoding has already been done.  At this point I can just do a $tr to get rid of the non printable characters, which is what I have done.

for this cases a possible solution could be 

%Stream.Global has a FindAt method that could give you a position of  "\u00"

[Find the first occurrence of target in the stream starting the search at position. ]

http://docs.intersystems.com/latest/csp/documatic/%25CSP.Documatic.cls?P...

But: if you are on the decoded stream all non printables are just single characters. No issue to cut it in pieces

  • read your source stream in reasonable sized junks
  • clean out what ever you need
  • append it to a temporary stream
  • loop on source until you hit AtEnd condition
  • finally replace your source either by "copyFromSteam" method [temp -> source]
    or replace source stream reference by temp stream reference

I guess the whole code is shorter than this description.

I'd suggest not to touch the global under the source steam.