Efficient way to convert a large xml file to csv

Question

Question

Rizmaan Marikar · Sep 8, 2019

Hi,

I have a large XML zen report file that I want to convert to a CSV, what is the most efficient way to do it.

XML file:

<Details>
<Col1> aa1 </Col1>
<Col2> bb1</Col2>
<Col3> cc1 </Col3>
</Details>
<Details>
<Col1> aa2 </Col1>
<Col2> bb2</Col2>
<Col3> cc3 </Col3>
</Details>

Expected Output:

Col1,Col2,col3
aa1,bb1,bb1
aa2,bb2,bb2

Thank you.

Discussion (12)2

Log in or sign up to continue

Julius Kavay · Sep 10, 2019

The intention behind my post was, to give you one idea (of many other possibilities), how to convert XML to CSV. A empty (chars) element is just one of some other unhandled cases (missing tags, other tags inside of COLx tag, etc.).

If you need some speed and your XML-File obeys following constraints:

- the file is a correct XML-File
- contains only the 'Data', 'Details' and 'ColX' tags
- no selfclosing tags, like <sometag/>

then you could try the QAD-way (quick-and-dirty) of conversion.

Again, below an example routine (without excessive testing).
All ISC people and ordinary programer, please look the other way ;-))

Convert() Public
{
   set len=32000 // chunk size, a safe value is:
                 // 32767 - longestColumnValue - tagSizes
   set fi="c:\temp\example-t.xml" // inp file (xml)
   set fo="c:\temp\example-t.csv" // output file (csv)

   open fi:"ru":1 open:$t fo:"nw":1
   if '$t close fi quit

   set xml=$$inp(fi,len) // first xml-chunk

   set i=0, p=0
   while 1 {
      set i=$locate(xml,"\<\/?\w+\>",i,j,v) // next (opening or closing) tag

      if i {                                            // let see, what we got 
         if v="<Details>" { set row="", p=-1 // start a new row
         } elseif v="</Details>" { d out(fo,row) s p=0 // complete, write out
         } elseif p,v["<Col" { s p=j, o=$zstrip(v,"*AP") // new column, keep start
         } elseif p,v["</Col" {s $li(row,o)=$$val($e(xml,p,i-1)) // get value
         }                                   // everything else is don't care
         set i=j

      } else {
         set tmp=$$inp(fi,len) // next xml-chunk
         if tmp="" quit                                   // done

         // remove processed data, add new one
         if p>0 { set xml=$e(xml,p,*)_tmp,p=1,i=0 } else { s xml=$e(xml,i,*)_tmp,i=0 }
      }
   }
   close fi
   close fo
}

val(val)
{
   quit $zstrip(val,"<>w") // add handling of charcter entities like &lt; etc.
}

out(fo,row)
{
   use fo
   write $listtostring(row,";",1),! // delimiter!
}

inp(fn,len)
{
   use fn
   try { read xml#len } catch { s xml="" } // in case, $zeof-mode is off
   quit xml
}

The above converter reads a simple test XML-file with two million 'ColX' items in 5 seconds and creates a CSV file with 100000 rows and 20 columns (in each row).

Instead of file you can also use an stream.

1 0

score 0 · Answer 1 · 2019-09-08T14:57:33-04:00

It will depend on the structure of your xml file (next time a short sample would be helpfull).

Usually, you can do it with the help of %XLM.TextReader / %XML.Reader class.

score 0 · Answer 2 · 2019-09-09T07:38:46-04:00

Rizmaan Marikar · Sep 9, 2019

i have created a sample and expected output

0 0

score 0 · Answer 3 · 2020-01-28T15:37:45-05:00

Apache Tika is another option. Without writing any code, it can be run from the command-line and output an XLSX as a tab-separated file.

https://tika.apache.org/

java -jar tika-app-1.23.jar -t sample.xlsx > sample.tsv

score 2 · Answer 4 · 2019-09-09T14:19:47-04:00

I think most XSLT implementations read the whole XML input file into memory before beginning to process it. You need to test if an XSLT approach works with files of the size with which you deal. Of course, you also need to test how files of a particular size work in the particular machine environment which is running the application. There is not a one size fits all solution for parsing XML files using XSLT when memory issues are a concern. I believe ISC supports a SAX parser. SAX parsers put less burdens on memory than XLST.

SAX is described here: https://docs.oracle.com/javase/tutorial/jaxp/intro/simple.html

score 0 · Answer 5 · 2019-09-10T09:06:44-04:00

thank you for the solution provided, i will definitely try this, the challenge is that in the XML file is generated, will always not follow the same amount of columns, so i need to generate the xdata stylesheet on runtime.

score 0 · Answer 6 · 2020-01-28T09:26:25-05:00

You can use the same approaches that were suggested above, though you'll need to spend time figuring out the structure of the .xslx file so you can write your extraction logic.

To save the time/effort, the Apache POI project provides a Java library that can read/write MS Office formats, including XLSX:
https://poi.apache.org/
https://kalliphant.com/poi-convert-xlsx-to-csv-example/

score 0 · Answer 7 · 2020-01-29T03:53:49-05:00

Check

Apache POI would be faster and more specialized, LibreOffice is the best bet on converting arbitrary documents between arbitrary formats.

score 0 · Answer 8 · 2019-09-09T18:31:37-04:00

If you "donate" your XML-File a version info and an extra root-node:

<?xml version="1.0" encoding="UTF-8" ?>
<Data>
  <Details>
   ...
  </Details>
</Data>

and use, for example, the %XML.Textreader class (see belov).

Then with few lines of code the job is done:

XML ; XML to CSV
#define ROOT "Data"
#define DEL    ";"        ##; your CSV-delimiter, $c(9) or ";" or ...
#;
   set inpFile="c:\temp\example.xml"
   set outFile="c:\temp\example.csv"
   if ##class(%XML.TextReader).ParseFile(inpFile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=$$$ROOT {
         open outFile:"nw":1
         if $t {
            use outFile
            while rdr.Read() {
               if rdr.NodeType="element",rdr.Name="Details" {
                  set line=""
               } elseif rdr.NodeType="chars" {
                 set line=line_$lb(rdr.Value)
               } elseif rdr.NodeType="endelement",rdr.Name="Details" {
                 w $lts(line,$$$DEL),!
               } elseif rdr.NodeType="endelement",rdr.Name=$$$ROOT {
                  close outFile
                  quit
               }
            }
         } else { w "file open problem",! }
      } else { w "XML root-element problem",! }
   } else { w "XML structure problem",! }

score 0 · Answer 9 · 2019-09-09T10:35:08-04:00

I'm not sure for how much this is efficient, but you can use XSLT to do the transformation.

Class CMT.XmlToCsv [ Abstract ]
{

ClassMethod transform(
    infile As %String,
    outfile As %String) As %Status
{
    Set tXSL=##class(%Dictionary.CompiledXData).%OpenId(..%ClassName(1)_"||XmlToCsv").Data
    Set tSC=##class(%XML.XSLT.CompiledStyleSheet).CreateFromStream(tXSL,.tCompiledStyleSheet)
    If $$$ISERR(tSC) Quit tSC
    quit ##class(%XML.XSLT.Transformer).TransformFileWithCompiledXSL(infile,tCompiledStyleSheet,outfile)
}

XData XmlToCsv
{
<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" />
    <xsl:variable name="separator" select="','" />
    <xsl:variable name="newline" select="'&#10;'" />

    <xsl:template match="/">
        <xsl:text>Col1,Col2,Col3</xsl:text>
        <xsl:value-of select="$newline" />
        <xsl:for-each select="//Details">
            <xsl:value-of select="Col1" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col2" />
            <xsl:value-of select="$separator" />
            <xsl:value-of select="Col3" />
            <xsl:value-of select="$newline" />
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>
}

}

And then call it from terminal:

set p=##class(CMT.XmlToCsv).transform("c:\temp\input.xml","c:\temp\output.txt") 
zw p

I took XSLT from https://stackoverflow.com/a/46176699/82675

score 0 · Answer 10 · 2019-09-10T07:39:07-04:00

Thank you, i tried your solution, one issue that i noticed is;

Suppose if <Col1> </Col1> is empty, with the above solution

rdr.NodeType="chars"

will not evaluate, and the CSV file will be wrong.

tweaked it as follows;

 #define DEL ","  
 #define DC """"

 if ##class(%XML.TextReader).ParseFile(myfile, .rdr) {
      if rdr.Read(), rdr.NodeType="element", rdr.Name=ROOT {
         if $t {
            set (nodeType,nodeName,line,header,childElementValue) = ""
            set cnt=1
            while rdr.Read() {
           set nodeType = rdr.NodeType, nodeName = rdr.Name, length = $length(rdr.Path,"/")
     
               if nodeType="element",nodeName=NAME {
                  set line=""
               }
               elseif cnt=1,nodeType="element",(length>3){
               set header = header_$$$DC_nodeName_$$$DC_$$$DEL
               } 
               elseif ((nodeType="chars") & (length>3)) {
               set childElementValue=rdr.Value
               } 
               elseif nodeType="endelement",length=4 {
               set line=line_$$$DC_childElementValue_$$$DC_$$$DEL,childElementValue=""
               }
               elseif nodeType="endelement",nodeName=NAME {
               if cnt=1 {
               do stream.WriteLine(header)
               set cnt=0
               }
                   do stream.WriteLine(line)
               } 
               elseif nodeType="endelement",nodeName=ROOT {
                  quit
               }
               
            }
         } else { w "file open problem",! }
      } else { w "XML root-element problem",! }
   } else { w "XML structure problem",! }

Although when it comes to a large file ParseFile method takes a significant amount of time to process the file, and i want to improve the code to run faster.

for example calling stream.WriteLine for each line can slow down the process, i want to try and batch the data and send to the stream to make the process much faster, and there is a concern for the <MAXLENGH> error.

Any Ideas?

score 0 · Answer 11 · 2020-01-28T09:00:13-05:00

What if my XML is an Excel file? Here is an example section. Is it even possible to read this in so I can create a CSV file as output?

<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:z="#RowsetSchema" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmfile" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
   <OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
       <DownloadComponents/>
   </OfficeDocumentSettings>
   <ss:Styles>
       <ss:Style ss:ID="BrdrLftWrp">
           <ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
           <ss:Borders>
               <ss:Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1"/>
           </ss:Borders>
           <ss:NumberFormat ss:Format="@"/>
       </ss:Style>

.....