Question
· Apr 15

Extract XML from text

Hello!

I wonder if anyone has a smart idea to extract an XML fragment inside a text document (incoming from a stream)?

The XML fragment is surrounded by plain text.

Example:

text...........
text...........
<?xml version="1.0" encoding="UTF-8 ?>
<Start>
...etc
</Start>
text...........
text...........

The XML is not represented by any class or object in the Namespace.

The XML can look different from time to time

Appreciated if anyone knows how to use Objectscript to extract the XML content.

Regards Michael

Product version: IRIS 2023.1
Discussion (9)3
Log in or sign up to continue

As you wrote,  %XML.TextReader is used to read arbtrary XML documents. "A text where in the middle a little bit xml-structure sits" isn't XML!

Maybe there is a Pyhton library for extracting XML from a text. If not, probably you have to read char-after-char, count each "<" (+1) and ">" (-1) and if the counter is 0 then between the first "<"  and the last ">" probably you have a correct XML structure. Oh, and don't forget for <![CDATA[...]]> sequences, which makes the reading more challenging.

Hi Michael,

Something like this ?

Search where "<?xml " starts

Search where it ends (first >)

Get first tag after xml header

Find where this tag ends

Remove characters in the middle.

test
	set complex=1
	set crlf=$c(13,10)
	set file="text 1"
	set file=file_crlf_"text 2"
	set file=file_crlf_"<?xml version=""1.0"" encoding='UTF-8'?>"
	
	if complex {
		set file=file_crlf_"<Results xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'"
    	set file=file_crlf_"     xmlns='urn:tcleDoctorReport'"
		set file=file_crlf_"         xsi:schemaLocation='urn:tcleDoctorReport DoctorReport.xsd'>"
	} else {
		set file=file_crlf_"<Results>"
	}
	
	set file=file_crlf_"	<ReportPageFormat/>"
	set file=file_crlf_"	<Department>"
	set file=file_crlf_"		<Section>"
	set file=file_crlf_"			<TestSet>"
	set file=file_crlf_"				<TestSetDesc>Blood Culture (Aerobic+Anaerobic)</TestSetDesc>"
	set file=file_crlf_"			</TestSet>"
	set file=file_crlf_"			<TestSet>"
	set file=file_crlf_"				<TestSetDesc>Blood Culture Positive Result</TestSetDesc>"
	set file=file_crlf_"			</TestSet>"
	set file=file_crlf_"		</Section>"
	set file=file_crlf_"	</Department>"
	set file=file_crlf_"	<EpisodeData>"
	set file=file_crlf_"		<EpisodeNumber>240000100</EpisodeNumber>"
	set file=file_crlf_"		<FirstName>Lily</FirstName>"
	set file=file_crlf_"	</EpisodeData>"
	set file=file_crlf_"</Results>"
	set file=file_crlf_"text 3"
	set file=file_crlf_"text 4"
	
	set xmlheadstart=$f(file,"<?xml ")-6
	set xmlheadend=$f(file,">",xmlheadstart)-1
	
	;zzdump $e(file,xmlheadstart,xmlheadend)
	set firsttag=$tr($p($e(file,xmlheadend+1,*),">",1)_">",$c(13,10))
	;zzdump firsttag
	set tag=$p($e($p(firsttag," ",1),2,*),">",1)
	;write !,tag
	
	set xmlend=$f(file,"</"_tag_">")
	
	zzdump $e(file,1,xmlheadstart-1)_$e(file,xmlend,*)

What I get:

USER>d ^test2
 
0000: 74 65 78 74 20 31 0D 0A 74 65 78 74 20 32 0D 0A         text 1..text 2..
0010: 0D 0A 74 65 78 74 20 33 0D 0A 74 65 78 74 20 34         ..text 3..text 4
USER>

Regards

Manel

Thanks Manel

It worked great. Admittedly, I got the surrounding text out when I actually wanted the XML out. But by your example I was able to turn it around and get the XML out.

Working string: XMLstr

set xmlheadstart=$f(XMLstr,"<?xml ")-6
set xmlheadend=$f(XMLstr,">",xmlheadstart)-1
set firsttag=$tr($p($e(XMLstr,xmlheadend+1,*),">",1)_">",$c(13,10))
set tag=$p($e($p(firsttag," ",1),2,*),">",1)
set xmlend=$f(XMLstr,"</"_tag_">")
set NewXMLstr = $EXTRACT(XMLstr,xmlheadstart,xmlend-1)

Quit NewXMLstr

The NewXMLstr variable now contains the entire XML fragment.
 
Many thanks!

Regards Michael