ÍàØâàÞæØâë and you

Article

Eduard Lebedyuk · Dec 9, 2019 1m read

If you work with anything other than English, you would earlier or later encounter the characters from the title or just plain ??????????.

Encodings are usually known, but sometimes you just get gibberish and need to make sense of it.

In this cases $zcvt is your friend, the three argument form specifically.

But there are a lot of options. So here's an utility script to check how the text would look like in different encodings:

Zn "%SYS"
Set Text = "ÍàØâàÞæØâë"
Set Ref = ##class(Config.NLS.Locales).OpenCurrent(.Sc)
Write "Locale name: ",Ref.Name, !
Do Ref.GetTables(.Tables)
Set Key = ""
For { Set Key = $O(Tables("XLT", Key)) Quit:Key=""  Write Key," ",$zcvt(Text, "I", Key),!}

And here's a sample output:

As you can see the correct encoding in this case is LatinC.

Discussion (4)0

Marat Nutis · Dec 9, 2019

You can see it's the correct incoding only if know LatinC (Russian), otherwise all the gibberish lines look the same :)

3 0

Alexey Maslov · Dec 11, 2019

...otherwise all the gibberish lines look the same

While you are basically right, some euristics can help to find an answer. A line that matches a pattern: line?1(1U.L,.L,.U) is more likely encoded correctly than "camel case" one. After modifying Eduard's sample a bit:

AutoCode
 new $namespace set $namespace="%SYS"
 Set Text = "ÍàØâàÞæØâë"
 Set Ref = ##class(Config.NLS.Locales).OpenCurrent(.Sc)
 Write "Locale name: ",Ref.Name, !!
 Do Ref.GetTables(.Tables)
 Set Key = ""
 For { Set Key = $O(Tables("XLT", Key)) Quit:Key=""  s line=$zcvt(Text, "I", Key) if line?1(1U.L,.L,.U) Write Key," ",line,!}
 q

we are getting (likely) correct answer without knowing a target language:

USER>d AutoCode^ztest
Locale name: yruw

LatinC Эритроциты 1

There are some other problems, e.g. system built-in tables such as UTF8 are not included, but they can be solved. Writing universal cyrillic decoder is not so easy task, but as there are some already exist in web, it's possible to write another one.

0 0

Marat Nutis · Dec 11, 2019

This is an interesting solution, thanks.

In our case it'll need to be something slightly different, as Hebrew (and probably several other languages) doesn't have an upper case for letters.

0 0

Alexey Maslov · Dec 11, 2019

Marat,

Maybe it's possible to build a dictionary of wrong decoding of typical words and use it for encoding guessing. E.g., a word

לרפא
will probably be a typical one in a medicine text. Wrong decodings can be collected using a tool like this. Using pronouns, articles or prepositions as "universal" typical words can even be a better idea.

0 0