Question
Rubens Silva · Apr 15, 2020

Exporting sources as XML to stream

Hello,

Recently I have been required to work with a method called ExportToStream.

The situation asks me to export a UTF-8-encoded JSON as a XML to be imported on old releases. Here's how I attempted to fulfill this request:

do $System.OBJ.ExportToStream("path/to/my/json/file.json", .stream,,,"UTF8")

The file is indeed encoded as UTF-8 and although the XML header denotes that it has been exported as UTF8:

<?xml version="1.0" encoding="UTF8"?>

The body content seems to differ:

"text": "Condição de pagamento sujeito a análise de crédito: "

I will say it again, the original file is encoded as UTF-8 and it displays correctly if seen from any editor.
Both the editor and the file utility identifies the file as UTF-8 without BOM (and that is correct).

With that said, can anyone figure what am I doing wrong? Or is that a bug?

I used Caché 2017 and IRIS 2019.3, both presented the same issue.

0
0 558
Discussion (10)2
Log in or sign up to continue

Hi Rubens.

Works fine for me on IRIS 2020.1 with rusw locale, see below.

Perhaps you can try to export directly to file, instead of using stream.

USER>!type ..\..\csp\user\test.json

{
"a":"русский текст"
}
USER>set stream=##class(%Stream.FileCharacter).%New()   

USER>set stream.Filename = "c:\temp\qq.xml"                                      

USER>do $System.OBJ.ExportToStream("/csp/user/test.json", .stream,,,"UTF8")      
Exporting to XML started on 04/16/2020 12:13:33
Exporting CSP/CSR or file: /csp/user/test.json
Export finished successfully.

USER>w stream.%Save()
1

USER>!type c:\temp\qq.xml

<?xml version="1.0" encoding="UTF8"?>
<Export generator="IRIS" version="26" zv="IRIS for Windows (x86-64) 2020.1 (Build 215U)" ts="2020-04-16 12:13:33">
<CSP name="test.json" application="/csp/user/" default="1"><![CDATA[
{
"a":"русский текст"
}]]></CSP>
</Export>

Hello @Alexander Koblov.

I also did the test using Export instead of ExportToStream and got the same result.

Now first thing, you must make sure that the file you used is indeed written using UTF-8.

You can check it by using the following command:

file -bi ..\..\csp\user\test.json

It should display:

charset=utf-8

Now regarding more tests I did, it seems like there's an imposed transcoding step when exporting the file. I ran several simulations with many type of combinations:

  • When the original is file written using UTF-8 and I exported using UTF8 it broke the encoding.
  • When the original is file written using UTF-8 and I exported using RAW (which is ISO-8859-1 in my case), it DID NOT broke the encoding.
  • When the original is file written using ISO-8859-1 and I exported using RAW, it DID NOT broke the encoding.
  • When the original is file written using ISO-8859-1 and I exported using UTF8 it DID NOT broke the encoding.

This is very strange.

@Rubens Silva 
That sounds to me like double encoding.
I'd suggest using a HEX Editor  (e.g. PSpad) to examine your files.
UTF-8 means that some characters have more than 8 bit.
By converting an already converted string you may get those strange effects.  
And you found the way to avoid this already yourself.

 

@Robert.Cemper

Certainly it's not a hand-made double encoding.

Because we also made sure of that by writing a new file for both charsets to simulate the issue.

see this example to reproduce and explain that there is an unnecessary conversion on the way

as you showed in your example
              "text": "Condição de pagamento sujeito a análise de crédito: "

Yes, what I meant to say is that the original file is correct. It's not us who did the double transcoding. The resulting output that I posted:

"text": "Condição de pagamento sujeito a análise de crédito: "

Is straight from the call from Export and/or ExportToStream. Which is why I said that these methods seems to impose a transcoding step.

This is weird, I shouldn't have to convert a file to RAW in order to export to UTF-8. But instead provide the same charset for both input/output so that the engine actually knows which encoding to use (but not transcode).

Unless there's is a way to effectively disable that hidden transcoding step that these method do, this make these methods really misleading.

So I'd suggest involving WRC to check the sources where the double translation comes from.
(probably since ever)

just an idea to understand:
what do you see if your .stream is a %Stream.GlobalBinary

Yes, but if you provide a pre-object like let's say: %Stream.FileCharacter, it outputs to it as well.
I also tried setting the TranslateTable to UTF-8 and used the OutputToDevice method to see the result, but that brought me  the same result.