Question
Michoel Reach · Apr 8

Converting ISO-8859-1 input document

We have a Unicode installation of Cache'. A client wants to send us documents that will be machine-read and loaded, automatically. They want to create the documents in ISO-8859-1 ("Latin-1"). We'd need to convert the text to UTF8 for our system. I saw the documentation on the $ZCONVERT function, but I didn't see this option. How should it be done?

Thanks!

Product version: Caché 2018.1
$ZV: Cache for Windows (x86-64) 2018.1.4 (Build 505_1_20258U) Thu Sep 10 2020 10:22:22 EDT
00
1 0 10 246
Log in or sign up to continue

In namespace %SYS you have a utility NLS that shows your installed conversion table and its short names.

%SYS>d ^NLS
2) Select defaults
2) I/O tables
Items marked with (*) represent the locale's original default
 I/O table              Current default
---------------------  --------------------
1) Process             RAW (*)
2) Cache Terminal      UTF8 (*)
3) Other terminal      UTF8 (*)
4) File                RAW (*)
5) Magtape             RAW (*)
6) TCP/IP              RAW (*)
7) System call         RAW (*)
8) Printer             RAW (*)
 
I/O table: 4
 
 1) RAW (*)                              2) UTF8
 3) UnicodeLittle                        4) UnicodeBig
 5) CP1250                               6) CP1251
 7) CP1252                               8) CP1253
 9) CP1255                              10) CP437
11) CP850                               12) CP852
13) CP866                               14) CP874
15) EBCDIC                              16) Latin2
17) Latin9                              18) LatinC
19) LatinG                              20) LatinH
21) LatinT

So you see the shortnames but no Latin1  but CP1252 which is almost identical.
the related problem is described here:
 https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

"ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points."

and Encoding Problem: ISO-8859-1 vs Windows-1252
So you should check what your customer really does (some hide the fact they use Windows)

The appropriate table can be used in

  • $ZCONVERT(),
  • ##class(%Stream.FileCharacter ) property TranslateTable
  • OPEN command parameter /IOTABLE=

Thank you! I followed what you said, right up to the last step, "The appropriate table can be used..." How does one use these tables, just by the name "CP1252", as in  str0=$zconvert(str,direction,"CP1252"? Or .../IOTABLE="CP1252"

Thanks!

the format ./IOTABLE="CP1252" applies only using the OPEN command

$ZCONVERT and %Stream.FileCharacter  just use "CP1252" just by the name

If you get data as ISO-8859-1 (aka Latin1) and have a Unicode (IRIS/Cache) installation then usually you have nothing to do (except, to process the data). What do you mean with "convert the text to UTF-8"? In IRIS/Cache you have  (and work with) Unicode codepoints, UTF-8 comes into play only when you export your data but in your case, it will rather be ISO-8859-1 or do I something misunderstand?

By the way, if you return your data back to your Latin1 source (as Latin1) then you have to take some precautions because you have an unicode installation, so during the data processing you could mix your Latin1 data with true unicode data from other sources!

See: https://unicode.org/charts/

Also, you may download and read:

https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf

Sorry, could you explain? If there are special characters (i.e., non-ASCII) in the input stream from ISO-8859-1, would they load correctly into our database - that is, as the correct corresponding Unicode characters - without a conversion process? Thanks.

Counterquestion, do you have an example of a 'non-ASCII' char?

Codepoints 0x00-0x7F (0 - 127) are the C0 controls, aka Basic Latin (ASCII)

Codepoints 0x80-0xFF (128-255) are the C1 controls, aka Latin1

Take a look on https://www.unicode.org/charts/PDF/U0080.pdf

For example, Ä or ä are the german umlaut-A respective umlaut-a,

$ascii("Ä") --> 196 and $ascii("ä") --> 228 type in a terminal session on your system: write $char(196) --> Ä

Download and compare the above pdf with your iso-8859-1 data, there should be no difference.

Huh. Well, that would simplify matters. You're saying that Latin-1 is actually a subset of Unicode, backwards compatible. If so, never mind, sounds like we're good!

not so terrible... but one more thing

set ascii=$char(65,66,196)

set wide=$char(65,66,352)

write $ziswide(ascii)," ",$ziswide(wide)

zzdump ascii,wide

as I wrote in my first answer, you have to care, always to return ASCII data and not WIDE data