Efficient way to convert a large xml file to csv

Primary tabs

Hi,

I have a large XML zen report file that I want to convert to a CSV, what is the most efficient way to do it.

XML file:

<Details>
<Col1> aa1 </Col1>
<Col2> bb1</Col2>
<Col3> cc1 </Col3>
</Details>
<Details>
<Col1> aa2 </Col1>
<Col2> bb2</Col2>
<Col3> cc3 </Col3>
</Details>

Expected Output:

Col1,Col2,col3
aa1,bb1,bb1
aa2,bb2,bb2

Thank you.

Answers

It will depend on the structure of your xml file (next time a short sample would be helpfull).

Usually, you can do it with the help of %XLM.TextReader  / %XML.Reader class.

 

 

 

If you "donate" your XML-File a version info and an extra root-node:

<?xml version="1.0" encoding="UTF-8" ?>
<Data>
  <Details>
   ...
  </Details>
</Data>

and use, for example, the %XML.Textreader class (see belov).

Then with few lines of code the job is done:

XML ; XML to CSV
#define ROOT "Data"
#define DEL    ";"        ##; your CSV-delimiter, $c(9) or ";" or ...
#;
   set inpFile="c:\temp\example.xml"
   set outFile="c:\temp\example.csv"
   if ##class(%XML.TextReader).ParseFile(inpFile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=$$$ROOT {
         open outFile:"nw":1
         if $t {
            use outFile
            while rdr.Read() {
               if rdr.NodeType="element",rdr.Name="Details" {
                  set line=""
               } elseif rdr.NodeType="chars" {
                 set line=line_$lb(rdr.Value)
               } elseif rdr.NodeType="endelement",rdr.Name="Details" {
                 w $lts(line,$$$DEL),!
               } elseif rdr.NodeType="endelement",rdr.Name=$$$ROOT {
                  close outFile
                  quit
               }
            }
         } else { w "file open problem",! }
      } else { w "XML root-element problem",! }
   } else { w "XML structure problem",! }

I'm not sure for how much this is efficient, but you can use XSLT to do the transformation.

Class CMT.XmlToCsv [ Abstract ]
{

ClassMethod transform(
    infile As %String,
    outfile As %String) As %Status
{
    Set tXSL=##class(%Dictionary.CompiledXData).%OpenId(..%ClassName(1)_"||XmlToCsv").Data
    Set tSC=##class(%XML.XSLT.CompiledStyleSheet).CreateFromStream(tXSL,.tCompiledStyleSheet)
    If $$$ISERR(tSC) Quit tSC
    quit ##class(%XML.XSLT.Transformer).TransformFileWithCompiledXSL(infile,tCompiledStyleSheet,outfile)
}

XData XmlToCsv
{
<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" />
    <xsl:variable name="separator" select="','" />
    <xsl:variable name="newline" select="'&#10;'" />

    <xsl:template match="/">
        <xsl:text>Col1,Col2,Col3</xsl:text>
        <xsl:value-of select="$newline" />
        <xsl:for-each select="//Details">
            <xsl:value-of select="Col1" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col2" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col3" />
            <xsl:value-of select="$newline" />
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>
}

}

And then call it from terminal:

set p=##class(CMT.XmlToCsv).transform("c:\temp\input.xml","c:\temp\output.txt") 
zw p

I took XSLT from https://stackoverflow.com/a/46176699/82675

I think most XSLT implementations read the whole XML input file into memory before beginning to process it.  You need to test if an XSLT approach works with files of the size with which you deal.  Of course, you also need to test how files of a particular size work in the particular machine environment which is running the application.  There is not a one size fits all solution for parsing XML files using XSLT when memory issues are a concern.  I believe ISC supports a SAX parser. SAX parsers put less burdens on memory than XLST.

SAX is described here: https://docs.oracle.com/javase/tutorial/jaxp/intro/simple.html

thank you for the solution provided, i will definitely try this, the challenge is that in the XML file is generated, will always not follow the same amount of columns, so i need to generate the xdata stylesheet on runtime.

 

Thank you, i tried your solution, one issue that i noticed is;

Suppose if <Col1> </Col1> is empty, with the above solution 

rdr.NodeType="chars"

will not evaluate, and the CSV file will be wrong.

tweaked it as follows;

#define DEL ","  
 #define DC """"

 if ##class(%XML.TextReader).ParseFile(myfile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=ROOT {
         if $t {
            set (nodeType,nodeName,line,header,childElementValue) = ""
            set cnt=1
            while rdr.Read() {
           set nodeType = rdr.NodeType, nodeName = rdr.Name, length = $length(rdr.Path,"/")
     
               if nodeType="element",nodeName=NAME {
                  set line=""
               }
               elseif cnt=1,nodeType="element",(length>3){
               set header = header_$$$DC_nodeName_$$$DC_$$$DEL
               
               elseif ((nodeType="chars") & (length>3)) {
               set childElementValue=rdr.Value
               
               elseif nodeType="endelement",length=4 {
               set line=line_$$$DC_childElementValue_$$$DC_$$$DEL,childElementValue=""
               }
               elseif nodeType="endelement",nodeName=NAME {
               if cnt=1 {
               do stream.WriteLine(header)
               set cnt=0
               }
                   do stream.WriteLine(line)
               
               elseif nodeType="endelement",nodeName=ROOT {
                  quit
               }
               
            }
         else "file open problem",! }
      else "XML root-element problem",! }
   else "XML structure problem",! }

Although when it comes to a large file ParseFile method takes a significant amount of time to process the file, and i want to improve the code to run faster.

for example calling stream.WriteLine for each line can slow down the process,  i want to try and batch the data and send to the stream to make the process much faster, and there is a concern for the <MAXLENGH> error.

Any Ideas?