Converting ISO-8859-1 input document

Question

Question

Michoel Reach · Apr 8, 2021

We have a Unicode installation of Cache'. A client wants to send us documents that will be machine-read and loaded, automatically. They want to create the documents in ISO-8859-1 ("Latin-1"). We'd need to convert the text to UTF8 for our system. I saw the documentation on the $ZCONVERT function, but I didn't see this option. How should it be done?

Thanks!

Product version: Caché 2018.1

$ZV: Cache for Windows (x86-64) 2018.1.4 (Build 505_1_20258U) Thu Sep 10 2020 10:22:22 EDT

Discussion (10)0

Log in or sign up to continue

Julius Kavay · Apr 8, 2021

If you get data as ISO-8859-1 (aka Latin1) and have a Unicode (IRIS/Cache) installation then usually you have nothing to do (except, to process the data). What do you mean with "convert the text to UTF-8"? In IRIS/Cache you have (and work with) Unicode codepoints, UTF-8 comes into play only when you export your data but in your case, it will rather be ISO-8859-1 or do I something misunderstand?

By the way, if you return your data back to your Latin1 source (as Latin1) then you have to take some precautions because you have an unicode installation, so during the data processing you could mix your Latin1 data with true unicode data from other sources!

See: https://unicode.org/charts/

Also, you may download and read:

https://www.unicode.org/versions/Unicode13.0.0/UnicodeStandard-13.0.pdf

1 0

score 0 · Answer 1 · 2021-04-08T13:51:00-04:00

In namespace %SYS you have a utility NLS that shows your installed conversion table and its short names.

%SYS>d ^NLS 2) Select defaults 2) I/O tables Items marked with (*) represent the locale's original default I/O table Current default --------------------- -------------------- 1) Process RAW (*) 2) Cache Terminal UTF8 (*) 3) Other terminal UTF8 (*) 4) File RAW (*) 5) Magtape RAW (*) 6) TCP/IP RAW (*) 7) System call RAW (*) 8) Printer RAW (*) I/O table: 4 1) RAW (*) 2) UTF8 3) UnicodeLittle 4) UnicodeBig 5) CP1250 6) CP1251 7) CP1252 8) CP1253 9) CP1255 10) CP437 11) CP850 12) CP852 13) CP866 14) CP874 15) EBCDIC 16) Latin2 17) Latin9 18) LatinC 19) LatinG 20) LatinH 21) LatinT

So you see the shortnames but no Latin1 but CP1252 which is almost identical.
the related problem is described here:
https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

"ISO-8859-1 (also called Latin-1) is identical to Windows-1252 (also called CP1252) except for the code points 128-159 (0x80-0x9F). ISO-8859-1 assigns several control codes in this range. Windows-1252 has several characters, punctuation, arithmetic and business symbols assigned to these code points."

and Encoding Problem: ISO-8859-1 vs Windows-1252
So you should check what your customer really does (some hide the fact they use Windows)

The appropriate table can be used in

$ZCONVERT(),
##class(%Stream.FileCharacter ) property TranslateTable
OPEN command parameter /IOTABLE=

score 0 · Answer 2 · 2021-04-08T14:54:47-04:00

Thank you! I followed what you said, right up to the last step, "The appropriate table can be used..." How does one use these tables, just by the name "CP1252", as in s str0=$zconvert(str,direction,"CP1252") ? Or .../IOTABLE="CP1252"

Thanks!

score 0 · Answer 3 · 2021-04-08T15:11:35-04:00

the format ./IOTABLE="CP1252" applies only using the OPEN command

$ZCONVERT and %Stream.FileCharacter just use "CP1252" just by the name

score 0 · Answer 4 · 2021-04-08T15:38:44-04:00

Michoel Reach · Apr 8, 2021

Thanks again!

0 0

score 0 · Answer 5 · 2021-04-08T17:46:14-04:00

Sorry, could you explain? If there are special characters (i.e., non-ASCII) in the input stream from ISO-8859-1, would they load correctly into our database - that is, as the correct corresponding Unicode characters - without a conversion process? Thanks.

score 1 · Answer 6 · 2021-04-08T18:14:34-04:00

Counterquestion, do you have an example of a 'non-ASCII' char?

Codepoints 0x00-0x7F (0 - 127) are the C0 controls, aka Basic Latin (ASCII)

Codepoints 0x80-0xFF (128-255) are the C1 controls, aka Latin1

Take a look on https://www.unicode.org/charts/PDF/U0080.pdf

For example, Ä or ä are the german umlaut-A respective umlaut-a,

$ascii("Ä") --> 196 and $ascii("ä") --> 228 type in a terminal session on your system: write $char(196) --> Ä

Download and compare the above pdf with your iso-8859-1 data, there should be no difference.

score 0 · Answer 7 · 2021-04-08T18:20:31-04:00

Huh. Well, that would simplify matters. You're saying that Latin-1 is actually a subset of Unicode, backwards compatible. If so, never mind, sounds like we're good!

score 1 · Answer 8 · 2021-04-08T18:36:10-04:00

not so terrible... but one more thing

set ascii=$char(65,66,196)

set wide=$char(65,66,352)

write $ziswide(ascii)," ",$ziswide(wide)

zzdump ascii,wide

as I wrote in my first answer, you have to care, always to return ASCII data and not WIDE data

score 0 · Answer 9 · 2021-04-08T23:14:20-04:00

Michoel Reach · Apr 8, 2021

Thank you for your help.

0 0