Article
· Dec 9, 2019 1m read

ÍàØâàÞæØâë and you

If you work with anything other than English, you would earlier or later encounter the characters from the title or just plain ??????????.

Encodings are usually known, but sometimes you just get gibberish and need to make sense of it.

In this cases $zcvt is your friend, the three argument form specifically.

But there are a lot of options. So here's an utility script to check how the text would look like in different encodings:

Zn "%SYS"
Set Text = "ÍàØâàÞæØâë"
Set Ref = ##class(Config.NLS.Locales).OpenCurrent(.Sc)
Write "Locale name: ",Ref.Name, !
Do Ref.GetTables(.Tables)
Set Key = ""
For { Set Key = $O(Tables("XLT", Key)) Quit:Key=""  Write Key," ",$zcvt(Text, "I", Key),!}

And here's a sample output:

As you can see the correct encoding in this case is LatinC.

Discussion (4)0
Log in or sign up to continue

...otherwise all the gibberish lines look the same
 

While you are basically right, some euristics can help to find an answer. A line that matches a pattern: line?1(1U.L,.L,.U) is more likely encoded correctly than "camel case" one. After modifying Eduard's sample a bit: 

AutoCode
 new $namespace set $namespace="%SYS"
 Set Text = "ÍàØâàÞæØâë"
 Set Ref = ##class(Config.NLS.Locales).OpenCurrent(.Sc)
 Write "Locale name: ",Ref.Name, !!
 Do Ref.GetTables(.Tables)
 Set Key = ""
 For Set Key = $O(Tables("XLT", Key)) Quit:Key=""  line=$zcvt(Text, "I", Key) if line?1(1U.L,.L,.U) Write Key," ",line,!}
 q

we are getting (likely) correct answer without knowing a target language:

USER>d AutoCode^ztest
Locale name: yruw

LatinC Эритроциты 1

There are some other problems, e.g. system built-in tables such as UTF8 are not included, but they can be solved. Writing universal cyrillic decoder is not so easy task, but as there are some already exist in web, it's possible to write another one.

Marat, 

Maybe it's possible to build a dictionary of wrong decoding of typical words and use it for encoding guessing. E.g., a word 

לרפא
will probably be a typical one in a medicine text. Wrong decodings can be collected using a tool like this. Using pronouns, articles or prepositions as "universal" typical words can even be a better idea.