Question
· Jun 3, 2016

How to detect a character encoding?

Let's say I open a stream/file.

If it's not in UTF8 i need to call $ZCVT, and if it's already in UTF8, then I don't need to call $ZCVT.

Is there any way to determine character encoding for input stream/file?

Discussion (8)2
Log in or sign up to continue

Is the file using BOM? If so you can check the header for the following signature: EF BB BF


This can be described as: $c(239, 187, 191)

Now keep in mind that most of editors abandoned the use of BOM in favor of digraphs and trigraphs detection heuristics as a fallback, yes, fallback. Because many assume you're already working with UTF-8 and won't work well with some charsets neither output BOM characters unless you tell it to use the desired charset.
 

You can try checking it against the US-ASCII table that goes from 0 to 127 code points, however that still wouldn't be 100% assertive about the stream containing UTF-8 characters.

We've got the same issue, but with an incoming HL7 feed with embedded, encoded characters - would be nice to be able to detect what's coming in, but I take it from this discussion that's not (reliably) possible. Don't really want to scan the whole text of every incoming message to heuristically look for possible encodings. Upstream say/think they are sending UTF-8, but we seem to be getting Window-1252, for the characters we've seen in the (limited) testing. Who knows what will come through the feed once it goes live!