Charset problem: UTF-8 file downloaded via sFTP displays incorrectly

Question

Question

Mikael Toivonen · Aug 17, 2017

My issue is that I don't know why Ensemble processes incoming data in the wrong encoding.

Update: Issue was solved, I think, see my answer below.

our customer uploads UTF-8 encoded CSV files to our sFTP server. When I view the files on the server side, the scandinavian characters Ä and Ö (A and O with two dots on top of themselves, respectively) display correctly in the source file. When Ensemble downloads the files using a FTP Service, the characters display incorrectly in Ensemble.

I am unable to pinpoint the reason for this behavior and I was hoping this is easily solvable.

Edit: two additional details that I did not remember to mention but Gertjan brought up:

In the inbound FTP adapter's settings I can't set "UTF-8" as the CHARSET. When connecting to a SFTP server (this is done by setting the "SSL Configuration" value in the service's "Additional Settings" category to "!SFTP"), the "CHARSET" field in the FTP service's "Additional settings" category cannot be "UTF-8". It causes an error message ("SFTP does not support ascii"). If you chooose "Binary", it works, but displays the scandinavian characters incorrectly.
In the RecordMap's settings I have set "UTF-8" as encoding.

Here is a sample of the original file:

ITEM_CATEGORY|ITEM_CAT2|
BR06002 VERISUONEN KANNATINNAUHA|SYDÄN/VERISUONI LEIKKAUSTARV.|

Here is the data that the Ensemble operation received:

<?xml version="1.0" ?>
   <!-- type: x.x.x.Record  id: 1930 -->
   <Record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:s="http://www.w3.org/2001/XMLSchema">
      <ITEMCATEGORY>BR06002 VERISUONEN KANNATINNAUHA</ITEMCATEGORY>
      <ITEMCAT2>SYDÃN/VERISUONI LEIKKAUSTARV.</ITEMCAT2>

The character " 'A' with two dots on it " is displayed as " 'A' with a wave on it "

Discussion (6)0

Log in or sign up to continue

Mikael Toivonen · Aug 26, 2017

We solved the encoding issue by setting the CHARSET value in the inbound FTP adapter's settings to "@UTF8".

I do not know if this was the right way to fix this and I need to do more research with it. Setting the CHARSET value to "cp1252" or "hebrew" or "windows-1252" or "cyrillic" - basically anything other than "binary" or "UTF-8" - seems to work just fine. I really don't understand character sets right now.

Here are some pages I ran into while researching this:

What does InterSystems say about the CHARSET options for the inbound FTP adapter:
https://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=EFTP_Charset
What about Translation Tables?
http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=GORIENT_ch_localization
The "Character Encoding" value in the Recordmap's properties seems to apply only for the sample file used to build the Recordmap, but not for files going through the compiled recordmap later on:
http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=EGDV_recmap

0 0

score 0 · Answer 1 · 2017-08-17T04:38:49-04:00

One of the settings for this service (category Additional settings) is Charset. This allows you to specify the file charset. Did you try setting this to UTF-8?

score 0 · Answer 2 · 2017-08-17T07:47:54-04:00

I wrote my comment below as a new answer, because apparently I had a "I don't know how to community.intersystems" moment.

score 0 · Answer 3 · 2017-08-17T05:22:53-04:00

It appears to be enabled when I try here, but I haven't used recordmaps much so I don't have one fully configured. I see another setting for the encoding in the record map properties itself; perhaps this is where things should be configured? (That would also explain why the adapter setting is disabled: you want to configure this only once.)

score 0 · Answer 4 · 2017-08-17T04:51:18-04:00

Yes, normally that does sound like the logical setting, but sFTP connection only allows binary. UTF-8 or ansi etc. are not selectable.

score 0 · Answer 5 · 2017-08-17T07:44:54-04:00

A good point. Unfortunately I have that already set as "UTF-8". I added both the points you mentioned into my original post and in hindsight should have done that initially, but forgot.