Question
· Dec 20, 2023

How to decode the binary data produced by in the event log ?

I am using the event logger from CSP gateway with level "v9r" enabled, in order to dump the raw content of some HTTP requests.

I would like to decode the body response data, when it's in binary form. AFAIK the event logger will convert characters outside the 32-127 range (EBCDIC) to the "\xff" notation (where ff is a hexadecimal value). Here is an example :

Content What is written in the log Remark
helloworld helloworld  
hello £ world hello \xc2\xa3 world £ is C2A3 in Unicode
hello \xc2\xa3 world hello \xc2\xa3 world same as line above !

The way it works seems to be (pseudocode) :

for (int data : byteBuffer) {
    if (data >= 32 && data <= 127) {
        output((char)data);
    } 
    else { 
        output("\x" + toHex(data));
    }
}
Java
Java

The main issue is data is just left as it is if it already contains any \x characters inside (which occurs very often with GZIP content). AFAIK this make any decoding impossible. Is there a way to fix this ? (by specifying the event logger should use another format, or should uncompress GZIP content before dump).

EDIT : Apache use a similar way to encode binary data to EBCDIC when dumping HTTP content, but it escapes data correctly :

hello £ world hello \xc2\xa3 world
hello \xc2\xa3 world hello \\xc2\\xa3 world

Unless I miss something, IRIS implementation is incorrect.

Product version: IRIS 2021.1
$ZV: IRIS for Windows (x86-64) 2021.1 (Build 215U) Wed Jun 9 2021 09:39:22 EDT
Discussion (4)1
Log in or sign up to continue

I'm hoping this will be enough to at least "get you started" - so here's a programming example that will handle either a single or dual \x notation code:

ZUNICODE ; DECODE HEX/UNICODE SEQUENCES IN A STRING
 Q
 ;
CVT(TEXT) ; ENTRY POINT FOR EXTRINSIC FUNCTION TO CONVERT HEX CODE(S)
 ; IN A STRING. ONLY CONVERTS FIRST SET - IMPROVEMENT OF
 ; ROUTINE LEFT AS AN EXERCISE TO THE READER.
 N P1,P2,C1,C2,OUT
 I $L(TEXT,"\x")=1 W "NO USABLE HEX MARKERS FOUND - DID NOT CONVERT TEXT.",! Q TEXT
 I $L(TEXT,"\x")=2 W "SINGLE HEX MARKER FOUND. CONVERTING.",! D
 . S P1=$P(TEXT,"\x",1)
 . S P2=$P(TEXT,"\x",2)
 . S C1=$E(P2,1,2)
 . S P2=$E(P2,3,$L(P2))
 . S OUT=P1_$C($ZHEX(C1))_P2
 I $G(OUT)'="" Q OUT
 I $L(TEXT,"\x")=3 W "DUAL UNICODE MARKER FOUND. CONVERTING.",! D
 . S P1=$P(TEXT,"\x",1)
 . S C1=$P(TEXT,"\x",2)
 . S P2=$P(TEXT,"\x",3)
 . S C2=$E(P2,1,2)
 . S P2=$E(P2,3,$L(P2))
 . S OUT=P1_$C($ZHEX(C1_C2))_P2
 I $G(OUT)'="" Q OUT
 W "MORE THAN TWO UNICODE MARKERS FOUND - LOOPS WOULD HELP. DID NOT CONVERT TEXT.",!
 Q TEXT
ObjectScript
ObjectScript

Here's some examples on how it's called:

USER>S BB="Hello World"
 
USER>S BB="Hello \xc2\xa3 World"
 
USER>S OUT=$$CVT^ZUNICODE(BB)
DUAL UNICODE MARKER FOUND. CONVERTING.
 
USER>W OUT
Hello 슣 World
USER>S BB="Hello \xc2 World"
 
USER>S OUT=$$CVT^ZUNICODE(BB)
SINGLE HEX MARKER FOUND. CONVERTING.
 
USER>W OUT
Hello  World
USER>S BB="Hello \xa3 World"
 
USER>S OUT=$$CVT^ZUNICODE(BB)
SINGLE HEX MARKER FOUND. CONVERTING.
 
USER>W OUT
Hello £ World

USER>S BB="Hello World"
 
USER>S OUT=$$CVT^ZUNICODE(BB)
NO USABLE HEX MARKERS FOUND - DID NOT CONVERT TEXT.
 
USER>W OUT
Hello World
ObjectScript
ObjectScript

Now, the output above seems to show that the pound symbol is a single hex character \xa3, at least on the Windows system that I am using. Maybe the \xa3 means "special character coming next" as at least in the charmap utility on Windows Server 2016 shows the U+C2A3 character as "Hangul Syllable Sios Yu Hieuh" (font: Gulim) and on my Ubuntu Linux 20.04 system it says "HANGUL SYLLABLE SYUH" (fonts: Trebuchet MS and Noto Serif CJK SC, pasted into LibreOffice Writer).

There are limitations to this code that could be improved. For one example, it doesn't handle multiple codes (single or unicode)... but hopefully this will give you (if nothing else) a good troubleshooting tool to get you started.

Thanks for the code.

The pound symbol is indeed the entry U+00A3 in the Unicode table, but it's always encoded as 0xc2 0xa3 in UTF-8. See this page. In UTF-8, anything above U+007F will be encoded with 2 bytes.

When you see \xc2 inside CSP gateway logs, you have now clue if it was originally 0xc2 or if it was already \xc2 (0x5c 0x78 0x63 0x32 in hexa) because there is no escaping made. Apache will instead double the backslash (\\xc2) so you know it was originally \xc2 and not 0xc2.