Eduard Lebedyuk · Jun 3, 2016

How to detect a character encoding?

Let's say I open a stream/file.

If it's not in UTF8 i need to call $ZCVT, and if it's already in UTF8, then I don't need to call $ZCVT.

Is there any way to determine character encoding for input stream/file?

0 1,772
Discussion (4)0
Log in or sign up to continue

In the IHE context I onced faced the scenario that there was no given encoding in the XML declaration.
The rule was to use in such a case the character encoding which could be found in the HTTP HEADER.

For example: Content-Type: text/html; charset=utf-8

Is the file using BOM? If so you can check the header for the following signature: EF BB BF

This can be described as: $c(239, 187, 191)

Now keep in mind that most of editors abandoned the use of BOM in favor of digraphs and trigraphs detection heuristics as a fallback, yes, fallback. Because many assume you're already working with UTF-8 and won't work well with some charsets neither output BOM characters unless you tell it to use the desired charset.

You can try checking it against the US-ASCII table that goes from 0 to 127 code points, however that still wouldn't be 100% assertive about the stream containing UTF-8 characters.