Extract XML from text

Question

Question

Michael Lundberg · Apr 15, 2024

Hello!

I wonder if anyone has a smart idea to extract an XML fragment inside a text document (incoming from a stream)?

The XML fragment is surrounded by plain text.

Example:

text...........
text...........
<?xml version="1.0" encoding="UTF-8 ?>
<Start>
...etc
</Start>
text...........
text...........

The XML is not represented by any class or object in the Namespace.

The XML can look different from time to time

Appreciated if anyone knows how to use Objectscript to extract the XML content.

Regards Michael

Product version: IRIS 2023.1

Discussion (9)3

Log in or sign up to continue

Julius Kavay · Apr 15, 2024

As you wrote, %XML.TextReader is used to read arbtrary XML documents. "A text where in the middle a little bit xml-structure sits" isn't XML!

Maybe there is a Pyhton library for extracting XML from a text. If not, probably you have to read char-after-char, count each "<" (+1) and ">" (-1) and if the counter is 0 then between the first "<" and the last ">" probably you have a correct XML structure. Oh, and don't forget for <![CDATA[...]]> sequences, which makes the reading more challenging.

0 0

Manel Trèmols · Apr 16, 2024

Hi Michael,

Something like this ?

Search where "<?xml " starts

Search where it ends (first >)

Get first tag after xml header

Find where this tag ends

Remove characters in the middle.

test
	set complex=1
	set crlf=$c(13,10)
	set file="text 1"
	set file=file_crlf_"text 2"
	set file=file_crlf_"<?xml version=""1.0"" encoding='UTF-8'?>"
	
	if complex {
		set file=file_crlf_"<Results xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'"
    	set file=file_crlf_"     xmlns='urn:tcleDoctorReport'"
		set file=file_crlf_"         xsi:schemaLocation='urn:tcleDoctorReport DoctorReport.xsd'>"
	} else {
		set file=file_crlf_"<Results>"
	}
	
	set file=file_crlf_"	<ReportPageFormat/>"
	set file=file_crlf_"	<Department>"
	set file=file_crlf_"		<Section>"
	set file=file_crlf_"			<TestSet>"
	set file=file_crlf_"				<TestSetDesc>Blood Culture (Aerobic+Anaerobic)</TestSetDesc>"
	set file=file_crlf_"			</TestSet>"
	set file=file_crlf_"			<TestSet>"
	set file=file_crlf_"				<TestSetDesc>Blood Culture Positive Result</TestSetDesc>"
	set file=file_crlf_"			</TestSet>"
	set file=file_crlf_"		</Section>"
	set file=file_crlf_"	</Department>"
	set file=file_crlf_"	<EpisodeData>"
	set file=file_crlf_"		<EpisodeNumber>240000100</EpisodeNumber>"
	set file=file_crlf_"		<FirstName>Lily</FirstName>"
	set file=file_crlf_"	</EpisodeData>"
	set file=file_crlf_"</Results>"
	set file=file_crlf_"text 3"
	set file=file_crlf_"text 4"
	
	set xmlheadstart=$f(file,"<?xml ")-6
	set xmlheadend=$f(file,">",xmlheadstart)-1
	
	;zzdump $e(file,xmlheadstart,xmlheadend)
	set firsttag=$tr($p($e(file,xmlheadend+1,*),">",1)_">",$c(13,10))
	;zzdump firsttag
	set tag=$p($e($p(firsttag," ",1),2,*),">",1)
	;write !,tag
	
	set xmlend=$f(file,"</"_tag_">")
	
	zzdump $e(file,1,xmlheadstart-1)_$e(file,xmlend,*)

What I get:

USER>d ^test2
 
0000: 74 65 78 74 20 31 0D 0A 74 65 78 74 20 32 0D 0A         text 1..text 2..
0010: 0D 0A 74 65 78 74 20 33 0D 0A 74 65 78 74 20 34         ..text 3..text 4
USER>

Regards

Manel

1 0

score 0 · Answer 1 · 2024-04-15T09:06:21-04:00

Cristiano Silva · Apr 15, 2024

Hi Michael.

The class %XML.TextReader is used to read arbtrary XML documents.

0 0

score 0 · Answer 2 · 2024-04-17T06:07:21-04:00

Cristiano Silva · Apr 17, 2024

Hi @Julius Kavay you are correct. I miss the part:

The XML fragment is surrounded by plain text.

0 0

score 0 · Answer 3 · 2024-04-16T03:07:33-04:00

Hello and thanks for your answers. However, it is not possible to parse the stream to %XML.TextReader as it is without the status reporting error. This is due to the fact that it is not a pure XML but rubbish from other content.

I probably have to sit and extract the XML content manually as Julius describes. Thought I could get away with it :0)

score 0 · Answer 4 · 2024-04-16T03:19:30-04:00

If XML content is well formatted
it might be sufficient to remove all trailing text before
<?xml version="1.0" encoding="UTF-8 ?>

score 0 · Answer 5 · 2024-04-16T04:04:59-04:00

Hello

Yes probably. For the text that is before the XML block. The problem is that it is also text after the end tag. And the end tag can have different names.

score 0 · Answer 6 · 2024-04-16T05:55:54-04:00

The start tag would be right after the XML declaration, i.e. <StartTag (the element name ends when a space is encountered), the end-tag would then be </StartTag. From there find the closing bracket >

score 0 · Answer 7 · 2024-04-17T16:38:45-04:00

Thanks Manel

It worked great. Admittedly, I got the surrounding text out when I actually wanted the XML out. But by your example I was able to turn it around and get the XML out.

Working string: XMLstr

set xmlheadstart=$f(XMLstr,"<?xml ")-6
set xmlheadend=$f(XMLstr,">",xmlheadstart)-1
set firsttag=$tr($p($e(XMLstr,xmlheadend+1,*),">",1)_">",$c(13,10))
set tag=$p($e($p(firsttag," ",1),2,*),">",1)
set xmlend=$f(XMLstr,"</"_tag_">")
set NewXMLstr = $EXTRACT(XMLstr,xmlheadstart,xmlend-1)

Quit NewXMLstr

The NewXMLstr variable now contains the entire XML fragment.

Many thanks!

Regards Michael