Question
· Aug 17, 2017

Charset problem: UTF-8 file downloaded via sFTP displays incorrectly

My issue is that I don't know why Ensemble processes incoming data in the wrong encoding.

Update: Issue was solved, I think, see my answer below.

our customer uploads UTF-8 encoded CSV files to our sFTP server. When I view the files on the server side, the scandinavian characters Ä and Ö (A and O with two dots on top of themselves, respectively) display correctly in the source file. When Ensemble downloads the files using a FTP Service, the characters display incorrectly in Ensemble. 

I am unable to pinpoint the reason for this behavior and I was hoping this is easily solvable.

Edit: two additional details that I did not remember to mention but Gertjan brought up:

  1. In the inbound FTP adapter's settings I can't set "UTF-8" as the CHARSET. When connecting to a SFTP server (this is done by setting the "SSL Configuration" value in the service's "Additional Settings" category to "!SFTP"), the "CHARSET" field in the FTP service's "Additional settings" category cannot be "UTF-8". It causes an error message ("SFTP does not support ascii").  If you chooose "Binary", it works, but displays the scandinavian characters incorrectly.  
  2. In the RecordMap's settings I have set "UTF-8" as encoding. 

Here is a sample of the original file:

ITEM_CATEGORY|ITEM_CAT2|
BR06002 VERISUONEN KANNATINNAUHA|SYDÄN/VERISUONI LEIKKAUSTARV.|

Here is the data that the Ensemble operation received:

<?xml version="1.0" ?>
   <!-- type: x.x.x.Record  id: 1930 -->
   <Record xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:s="http://www.w3.org/2001/XMLSchema">
      <ITEMCATEGORY>BR06002 VERISUONEN KANNATINNAUHA</ITEMCATEGORY>
      <ITEMCAT2>SYDÃN/VERISUONI LEIKKAUSTARV.</ITEMCAT2>

The character " 'A' with two dots on it " is displayed as " 'A' with a wave on it "

Discussion (6)0
Log in or sign up to continue

We solved the encoding issue by setting the CHARSET value in the inbound FTP adapter's settings to "@UTF8".

I do not know if this was the right way to fix this and I need to do more research with it.   Setting the CHARSET value to "cp1252" or "hebrew" or "windows-1252" or "cyrillic" - basically anything other than "binary" or "UTF-8" - seems to work just fine. I really don't understand character sets right now.

Here are some pages I ran into while researching this: