How to decode the binary data produced by in the event log ?
I am using the event logger from CSP gateway with level "v9r" enabled, in order to dump the raw content of some HTTP requests.
I would like to decode the body response data, when it's in binary form. AFAIK the event logger will convert characters outside the 32-127 range (EBCDIC) to the "\xff" notation (where ff is a hexadecimal value). Here is an example :
Content | What is written in the log | Remark |
helloworld | helloworld | |
hello £ world | hello \xc2\xa3 world | £ is C2A3 in Unicode |
hello \xc2\xa3 world | hello \xc2\xa3 world | same as line above ! |
The way it works seems to be (pseudocode) :
for (int data : byteBuffer) {
if (data >= 32 && data <= 127) {
output((char)data);
}
else {
output("\x" + toHex(data));
}
}
JavaJava
The main issue is data is just left as it is if it already contains any \x characters inside (which occurs very often with GZIP content). AFAIK this make any decoding impossible. Is there a way to fix this ? (by specifying the event logger should use another format, or should uncompress GZIP content before dump).
EDIT : Apache use a similar way to encode binary data to EBCDIC when dumping HTTP content, but it escapes data correctly :
hello £ world | hello \xc2\xa3 world |
hello \xc2\xa3 world | hello \\xc2\\xa3 world |
Unless I miss something, IRIS implementation is incorrect.
That's quite a strange task, what exactly do you need to achieve?
I have no idea what the format is there.
I would like to get back to record the HTTPS requests sent to a server in their RAW format (which is exactly what "V9r" is doing. There is probably other ways (eg: using packet dumper or apache module) but this is the only working one I have found so far. The other methods have also their own disadvantages and quirks.
I'm hoping this will be enough to at least "get you started" - so here's a programming example that will handle either a single or dual \x notation code:
ZUNICODE ; DECODE HEX/UNICODE SEQUENCES IN A STRING Q ; CVT(TEXT) ; ENTRY POINT FOR EXTRINSIC FUNCTION TO CONVERT HEX CODE(S) ; IN A STRING. ONLY CONVERTS FIRST SET - IMPROVEMENT OF ; ROUTINE LEFT AS AN EXERCISE TO THE READER. N P1,P2,C1,C2,OUT I $L(TEXT,"\x")=1 W "NO USABLE HEX MARKERS FOUND - DID NOT CONVERT TEXT.",! Q TEXT I $L(TEXT,"\x")=2 W "SINGLE HEX MARKER FOUND. CONVERTING.",! D . S P1=$P(TEXT,"\x",1) . S P2=$P(TEXT,"\x",2) . S C1=$E(P2,1,2) . S P2=$E(P2,3,$L(P2)) . S OUT=P1_$C($ZHEX(C1))_P2 I $G(OUT)'="" Q OUT I $L(TEXT,"\x")=3 W "DUAL UNICODE MARKER FOUND. CONVERTING.",! D . S P1=$P(TEXT,"\x",1) . S C1=$P(TEXT,"\x",2) . S P2=$P(TEXT,"\x",3) . S C2=$E(P2,1,2) . S P2=$E(P2,3,$L(P2)) . S OUT=P1_$C($ZHEX(C1_C2))_P2 I $G(OUT)'="" Q OUT W "MORE THAN TWO UNICODE MARKERS FOUND - LOOPS WOULD HELP. DID NOT CONVERT TEXT.",! Q TEXT
Here's some examples on how it's called:
USER>S BB="Hello World" USER>S BB="Hello \xc2\xa3 World" USER>S OUT=$$CVT^ZUNICODE(BB) DUAL UNICODE MARKER FOUND. CONVERTING. USER>W OUT Hello 슣 World USER>S BB="Hello \xc2 World" USER>S OUT=$$CVT^ZUNICODE(BB) SINGLE HEX MARKER FOUND. CONVERTING. USER>W OUT Hello  World USER>S BB="Hello \xa3 World" USER>S OUT=$$CVT^ZUNICODE(BB) SINGLE HEX MARKER FOUND. CONVERTING. USER>W OUT Hello £ World USER>S BB="Hello World" USER>S OUT=$$CVT^ZUNICODE(BB) NO USABLE HEX MARKERS FOUND - DID NOT CONVERT TEXT. USER>W OUT Hello World
Now, the output above seems to show that the pound symbol is a single hex character \xa3, at least on the Windows system that I am using. Maybe the \xa3 means "special character coming next" as at least in the charmap utility on Windows Server 2016 shows the U+C2A3 character as "Hangul Syllable Sios Yu Hieuh" (font: Gulim) and on my Ubuntu Linux 20.04 system it says "HANGUL SYLLABLE SYUH" (fonts: Trebuchet MS and Noto Serif CJK SC, pasted into LibreOffice Writer).
There are limitations to this code that could be improved. For one example, it doesn't handle multiple codes (single or unicode)... but hopefully this will give you (if nothing else) a good troubleshooting tool to get you started.
Thanks for the code.
The pound symbol is indeed the entry U+00A3 in the Unicode table, but it's always encoded as 0xc2 0xa3 in UTF-8. See this page. In UTF-8, anything above U+007F will be encoded with 2 bytes.
When you see \xc2 inside CSP gateway logs, you have now clue if it was originally 0xc2 or if it was already \xc2 (0x5c 0x78 0x63 0x32 in hexa) because there is no escaping made. Apache will instead double the backslash (\\xc2) so you know it was originally \xc2 and not 0xc2.