How to detect a character encoding?
Let's say I open a stream/file.
If it's not in UTF8 i need to call $ZCVT, and if it's already in UTF8, then I don't need to call $ZCVT.
Is there any way to determine character encoding for input stream/file?
Comments
I think googling "encoding heuristics" should help.
For example, https://gist.github.com/TaoK/945127
In the IHE context I onced faced the scenario that there was no given encoding in the XML declaration.
The rule was to use in such a case the character encoding which could be found in the HTTP HEADER.
For example: Content-Type: text/html; charset=utf-8
up
Is the file using BOM? If so you can check the header for the following signature: EF BB BF
This can be described as: $c(239, 187, 191)
Now keep in mind that most of editors abandoned the use of BOM in favor of digraphs and trigraphs detection heuristics as a fallback, yes, fallback. Because many assume you're already working with UTF-8 and won't work well with some charsets neither output BOM characters unless you tell it to use the desired charset.
You can try checking it against the US-ASCII table that goes from 0 to 127 code points, however that still wouldn't be 100% assertive about the stream containing UTF-8 characters.
We've got the same issue, but with an incoming HL7 feed with embedded, encoded characters - would be nice to be able to detect what's coming in, but I take it from this discussion that's not (reliably) possible. Don't really want to scan the whole text of every incoming message to heuristically look for possible encodings. Upstream say/think they are sending UTF-8, but we seem to be getting Window-1252, for the characters we've seen in the (limited) testing. Who knows what will come through the feed once it goes live!
Can't you check MSH.18 - Character Set? See e.g. https://hl7-definition.caristix.com/v2/HL7v2.5/Fields/MSH.18
Or are we talking about something else ?
If only MSH-18 were set... 🙄 Up-stream system isn't setting it! And until yesterday supplier of upstream system was claiming they were sending UTF-8 when we thought the feed looked awfully like Windows-1252. Yesterday they admitted/confirmed they are sending Windows-1252, so at least now we know!!
I feel your pain. It's very frustrating to work with systems like that 😟I'd recommend to put as much "pressure" as possible on the upstream system to fix their MSH18. Or course that might not be a realistic option in your case.