Written by

Senior Cloud Architect at InterSystems
MOD
Question Eduard Lebedyuk · Jun 3, 2016

How to detect a character encoding?

Let's say I open a stream/file.

If it's not in UTF8 i need to call $ZCVT, and if it's already in UTF8, then I don't need to call $ZCVT.

Is there any way to determine character encoding for input stream/file?

Comments

Markus Mechnich · Jun 4, 2016

In the IHE context I onced faced the scenario that there was no given encoding in the XML declaration.
The rule was to use in such a case the character encoding which could be found in the HTTP HEADER.

For example: Content-Type: text/html; charset=utf-8

0
Rubens Silva · Jun 21, 2017

Is the file using BOM? If so you can check the header for the following signature: EF BB BF


This can be described as: $c(239, 187, 191)

Now keep in mind that most of editors abandoned the use of BOM in favor of digraphs and trigraphs detection heuristics as a fallback, yes, fallback. Because many assume you're already working with UTF-8 and won't work well with some charsets neither output BOM characters unless you tell it to use the desired charset.
 

You can try checking it against the US-ASCII table that goes from 0 to 127 code points, however that still wouldn't be 100% assertive about the stream containing UTF-8 characters.

0
Colin Brough · Nov 20, 2024

We've got the same issue, but with an incoming HL7 feed with embedded, encoded characters - would be nice to be able to detect what's coming in, but I take it from this discussion that's not (reliably) possible. Don't really want to scan the whole text of every incoming message to heuristically look for possible encodings. Upstream say/think they are sending UTF-8, but we seem to be getting Window-1252, for the characters we've seen in the (limited) testing. Who knows what will come through the feed once it goes live!

0
Colin Brough  Nov 22, 2024 to Jani Hurskainen

If only MSH-18 were set... 🙄 Up-stream system isn't setting it! And until yesterday supplier of upstream system was claiming they were sending UTF-8 when we thought the feed looked awfully like Windows-1252. Yesterday they admitted/confirmed they are sending Windows-1252, so at least now we know!!

0
Jani Hurskainen  Nov 22, 2024 to Colin Brough

I feel your pain. It's very frustrating to work with systems like that 😟I'd recommend to put as much "pressure" as possible on the upstream system to fix their MSH18. Or course that might not be a realistic option in your case.

0