Excel in the IRIS .xls world
Working with files often starts off simple. open the file, read, and process. That approach works perfectly well, until the file happens to be an Excel file.
A Common Assumption
At first, an Excel file (.xlsx) looks like just another data file, rows, columns and values. nothing unusual. So it's natural to assume it can be read the same way as a .txt ot .csv file. But that's where things start to break.
Why Excel files behave differently
The key difference is how the data is stored:
-> .txt / .csv - plain text, line-by-line.
-> .xlsx - compressed, structured format (not plain text)
An excel file is not actually a simple stream of readable lines. Internally, it is a packaged file that contains structured data, which the standard file-reading commands are not designed to interpret.
What happens if you treat it like a text file?
- Errors during execution
- Output that looks unreadable
Something important --> This is not a limitation, it is a mismatch between the tool and the file format.
Practical ways to handle
Instead of using only the text based approach, there are better options,
- Convert to CSV
- Embedded Python (libraries like pandas or openpyxl)
- IRIS Interoperability tools
If anybody here have worked with Excel files in IRIS or have a different approach that works well, feel free to share. :)
Mentioned few examples below.
// Example for Reading excel and storing in a text file
ClassMethod ReadDataToFile() [ Language = python ]
{
#; w ##class(test).ReadDataToFile()
import pandas as pd
import iris
source = r"[input-file-path]\Test.xlsx"
outfile = r"[output-file-path]\Output.txt"
df = pd.read_excel(source)
f = open(outfile, "w", encoding="utf-8")
data = []
for index, row in df.iterrows():
line = " ".join([str(x) for x in row.tolist()])
line = "\t".join(cleaned_values)
f.write(line + "\n")
f.close()
}
/// Example for storing in a global
ClassMethod ReadData() [ Language = python ]
{
#; w ##class(test).ReadData()
import pandas as pd
import iris
source = r"[input-file-path]\Test.xlsx"
df = pd.read_excel(source)
g = iris.gref("^zReadFile")
for index, row in df.iterrows():
cleaned_values = [str(x).replace("\n", " ").replace("\r", " ") for x in row.tolist()]
line = "\t".join(cleaned_values)
g[jobId,index]=line
}
Comments
If you really want to dig into the structure behind the XLSX files, change the extension on one of them from .xlsx to .zip and you can navigate through the file structure.