Written by

Developer at Symplr

Article Vachan C Rannore · Mar 26 2m read

Excel in the IRIS .xls world

#Caché #InterSystems IRIS #InterSystems IRIS for Health #CSV #Embedded Python #Interoperability #Python #Tools

Working with files often starts off simple. open the file, read, and process. That approach works perfectly well, until the file happens to be an Excel file.

A Common Assumption

At first, an Excel file (.xlsx) looks like just another data file, rows, columns and values. nothing unusual. So it's natural to assume it can be read the same way as a .txt ot .csv file. But that's where things start to break.

Why Excel files behave differently

The key difference is how the data is stored:

-> .txt / .csv - plain text, line-by-line.

-> .xlsx - compressed, structured format (not plain text)

An excel file is not actually a simple stream of readable lines. Internally, it is a packaged file that contains structured data, which the standard file-reading commands are not designed to interpret.

What happens if you treat it like a text file?

Errors during execution
Output that looks unreadable

Something important --> This is not a limitation, it is a mismatch between the tool and the file format.

Practical ways to handle

Instead of using only the text based approach, there are better options,

Convert to CSV
Embedded Python (libraries like pandas or openpyxl)
IRIS Interoperability tools

If anybody here have worked with Excel files in IRIS or have a different approach that works well, feel free to share. :)

Mentioned few examples below.

	// Example for Reading excel and storing in a text file
ClassMethod ReadDataToFile() [ Language = python ]
{
        #; w ##class(test).ReadDataToFile()
    import pandas as pd
    import iris

    source = r"[input-file-path]\Test.xlsx"
    outfile = r"[output-file-path]\Output.txt"

    df = pd.read_excel(source)
    f = open(outfile, "w", encoding="utf-8")
    data = []
    for index, row in df.iterrows():
        line = " ".join([str(x) for x in row.tolist()])
        line = "\t".join(cleaned_values)
        f.write(line + "\n")
    f.close()
}


/// Example for storing in a global
ClassMethod ReadData() [ Language = python ]
{
        #; w ##class(test).ReadData()
    import pandas as pd
    import iris

    source = r"[input-file-path]\Test.xlsx"
    
    df = pd.read_excel(source)
    g = iris.gref("^zReadFile")
    for index, row in df.iterrows():
        cleaned_values = [str(x).replace("\n", " ").replace("\r", " ") for x in row.tolist()]
        line = "\t".join(cleaned_values)
        g[jobId,index]=line
}

Discussion (2)0

Comments

David Hockenbroch · Mar 26

If you really want to dig into the structure behind the XLSX files, change the extension on one of them from .xlsx to .zip and you can navigate through the file structure.

1 0

Vachan C Rannore Mar 30 to David Hockenbroch

Hi David,

Can you provide an example?

0 0