Question
· Aug 17, 2017

Charset problem: UTF-8 file downloaded via sFTP displays incorrectly

My issue is that I don't know why Ensemble processes incoming data in the wrong encoding.

Update: Issue was solved, I think, see my answer below.

our customer uploads UTF-8 encoded CSV files to our sFTP server. When I view the files on the server side, the scandinavian characters Ä and Ö (A and O with two dots on top of themselves, respectively) display correctly in the source file. When Ensemble downloads the files using a FTP Service, the characters display incorrectly in Ensemble. 

I am unable to pinpoint the reason for this behavior and I was hoping this is easily solvable.

Edit: two additional details that I did not remember to mention but Gertjan brought up:

  1. In the inbound FTP adapter's settings I can't set "UTF-8" as the CHARSET. When connecting to a SFTP server (this is done by setting the "SSL Configuration" value in the service's "Additional Settings" category to "!SFTP"), the "CHARSET" field in the FTP service's "Additional settings" category cannot be "UTF-8". It causes an error message ("SFTP does not support ascii").  If you chooose "Binary", it works, but displays the scandinavian characters incorrectly.  
  2. In the RecordMap's settings I have set "UTF-8" as encoding. 

Here is a sample of the original file:

ITEM_CATEGORY|ITEM_CAT2|
BR06002 VERISUONEN KANNATINNAUHA|SYDÄN/VERISUONI LEIKKAUSTARV.|

Here is the data that the Ensemble operation received:

<?xml version="1.0" ?>
   <!-- type: x.x.x.Record  id: 1930 -->
   <Record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:s="http://www.w3.org/2001/XMLSchema">
      <ITEMCATEGORY>BR06002 VERISUONEN KANNATINNAUHA</ITEMCATEGORY>
      <ITEMCAT2>SYDÃN/VERISUONI LEIKKAUSTARV.</ITEMCAT2>

The character " 'A' with two dots on it " is displayed as " 'A' with a wave on it "

Discussion (8)2
Log in or sign up to continue

We solved the encoding issue by setting the CHARSET value in the inbound FTP adapter's settings to "@UTF8".

I do not know if this was the right way to fix this and I need to do more research with it.   Setting the CHARSET value to "cp1252" or "hebrew" or "windows-1252" or "cyrillic" - basically anything other than "binary" or "UTF-8" - seems to work just fine. I really don't understand character sets right now.

Here are some pages I ran into while researching this:

Mikael, you mentioned here that your problem was solved. I'experiencing the samen problem but your solution doesnt seem to work..? Maybe i am doeing somethin wrong here.. ? I have a CSV file with patient info that is in UTF-8 and I'm trying to load this into my system using a ftp service. so far so good... the charset of mij FTP service is 'binary' When i view the I/O trace I can see that everything is stil fine. The service then create a Recordmap bases on the data from the CSV. The Recordmap and the related Batch are also in UTF- 8. But when I check the trace I can see that a few characters are scrambled..

the string  Schatorjé  is scrambled to  Schatorjé

The character é (is stored in two bytes in UTF-8 0xC3 0xA9) is shown as Ã©

So basically I'sending a file in UTF-8 to IRIS and somewhere in the FTP protocol or in the recordmap the charset changes? Do you have any clues?

Hello @Rob Schoenmakers 

I am experiencing the exact same issue. I currently have an open case with the WRC, but we haven't found a solution yet.

From our testing, the issue seems strictly isolated to SFTP retrieval via EnsLib.RecordMap.Service.FTPService. If I retrieve the exact same .CSV file locally using a EnsLib.RecordMap.Service.FileService, I have no encoding issues whatsoever.

The baffling part is that this behavior is inconsistent across environments: the exact same production works perfectly on other instances (like an IRIS Community container). Even InterSystems Support deployed my production on their local instance, connected to my SFTP server, and could not reproduce the encoding error.

If anyone has found a fix or a workaround, I'm all ears! :D

Best regards,