Efficient way to convert a large xml file to csv
Hi,
I have a large XML zen report file that I want to convert to a CSV, what is the most efficient way to do it.
XML file:
<Details> <Col1> aa1 </Col1> <Col2> bb1</Col2> <Col3> cc1 </Col3> </Details> <Details> <Col1> aa2 </Col1> <Col2> bb2</Col2> <Col3> cc3 </Col3> </Details>
Expected Output:
Col1,Col2,col3 aa1,bb1,bb1 aa2,bb2,bb2
Thank you.
It will depend on the structure of your xml file (next time a short sample would be helpfull).
Usually, you can do it with the help of %XLM.TextReader / %XML.Reader class.
i have created a sample and expected output
Apache Tika is another option. Without writing any code, it can be run from the command-line and output an XLSX as a tab-separated file.
https://tika.apache.org/
I think most XSLT implementations read the whole XML input file into memory before beginning to process it. You need to test if an XSLT approach works with files of the size with which you deal. Of course, you also need to test how files of a particular size work in the particular machine environment which is running the application. There is not a one size fits all solution for parsing XML files using XSLT when memory issues are a concern. I believe ISC supports a SAX parser. SAX parsers put less burdens on memory than XLST.
SAX is described here: https://docs.oracle.com/javase/tutorial/jaxp/intro/simple.html
thank you for the solution provided, i will definitely try this, the challenge is that in the XML file is generated, will always not follow the same amount of columns, so i need to generate the xdata stylesheet on runtime.
The intention behind my post was, to give you one idea (of many other possibilities), how to convert XML to CSV. A empty (chars) element is just one of some other unhandled cases (missing tags, other tags inside of COLx tag, etc.).
If you need some speed and your XML-File obeys following constraints:
- the file is a correct XML-File
- contains only the 'Data', 'Details' and 'ColX' tags
- no selfclosing tags, like <sometag/>
then you could try the QAD-way (quick-and-dirty) of conversion.
Again, below an example routine (without excessive testing).
All ISC people and ordinary programer, please look the other way ;-))
The above converter reads a simple test XML-file with two million 'ColX' items in 5 seconds and creates a CSV file with 100000 rows and 20 columns (in each row).
Instead of file you can also use an stream.
You can use the same approaches that were suggested above, though you'll need to spend time figuring out the structure of the .xslx file so you can write your extraction logic.
To save the time/effort, the Apache POI project provides a Java library that can read/write MS Office formats, including XLSX:
https://poi.apache.org/
https://kalliphant.com/poi-convert-xlsx-to-csv-example/
Check
Apache POI would be faster and more specialized, LibreOffice is the best bet on converting arbitrary documents between arbitrary formats.
If you "donate" your XML-File a version info and an extra root-node:
and use, for example, the %XML.Textreader class (see belov).
Then with few lines of code the job is done:
I'm not sure for how much this is efficient, but you can use XSLT to do the transformation.
And then call it from terminal:
I took XSLT from https://stackoverflow.com/a/46176699/82675
Thank you, i tried your solution, one issue that i noticed is;
Suppose if <Col1> </Col1> is empty, with the above solution
will not evaluate, and the CSV file will be wrong.
tweaked it as follows;
Although when it comes to a large file ParseFile method takes a significant amount of time to process the file, and i want to improve the code to run faster.
for example calling stream.WriteLine for each line can slow down the process, i want to try and batch the data and send to the stream to make the process much faster, and there is a concern for the <MAXLENGH> error.
Any Ideas?
What if my XML is an Excel file? Here is an example section. Is it even possible to read this in so I can create a CSV file as output?
<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet" xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:x2="http://schemas.microsoft.com/office/excel/2003/xml" xmlns:dt="uuid:C2F41010-65B3-11d1-A29F-00AA00C14882" xmlns:rs="urn:schemas-microsoft-com:rowset" xmlns:s="uuid:BDC6E3F0-6DA3-11d1-A2A3-00AA00C14882" xmlns:z="#RowsetSchema" xmlns:udc="http://schemas.microsoft.com/data/udc" xmlns:udcs="http://schemas.microsoft.com/data/udc/soap" xmlns:udcxf="http://schemas.microsoft.com/data/udc/xmfile" xmlns:c="urn:schemas-microsoft-com:office:component:spreadsheet" xmlns:html="http://www.w3.org/TR/REC-html40">
<OfficeDocumentSettings xmlns="urn:schemas-microsoft-com:office:office">
<DownloadComponents/>
</OfficeDocumentSettings>
<ss:Styles>
<ss:Style ss:ID="BrdrLftWrp">
<ss:Alignment ss:Vertical="Top" ss:WrapText="1"/>
<ss:Borders>
<ss:Border ss:Position="Left" ss:LineStyle="Continuous" ss:Weight="1"/>
</ss:Borders>
<ss:NumberFormat ss:Format="@"/>
</ss:Style>
.....
Social networks
InterSystems resources
Log in or sign up
Log in or create a new account to continue
Log in or sign up
Log in or create a new account to continue
Log in or sign up
Log in or create a new account to continue