Schematron XML documents validation using Python

Article

Ricardo Paiva · Feb 9, 2023 3m read

#Python #XML #Health Connect #HealthShare #InterSystems IRIS #InterSystems IRIS for Health

Schematron is a rule-based validation language for making assertions about the presence or absence of certain patterns in XML documents. A schematron refers to a collection of one or more rules containing tests. Schematrons are written in a form of XML, making them relatively easy for everyone, even non-programmers, to inspect, understand, and write

Essentially, a Schematron performs two actions in sequence:

Find context nodes of interest in the document. A "context node" can be an element of a particular type or a specific element at a particular place in the document, an attribute, or an attribute value. For example, suppose you want to check if the sum of the <Percent> elements within each context node is 100%. In this case, the context node would be the <Total> element. For each of those nodes of interest, it checks whether a specific statement is true or false. For example, you might have a rule written to answer the question "Is the sum total 100%?"

The ideal resource to look for more detail on the subject would be: https://www.schematron.com/. What matters to us is that we can validate our XML document based on a Schematron definition. For this, it must be taken into account that there are multiple open source projects with Schematron implementations for XSLT. One of the most interesting is available at https://github.com/schxslt/schxslt.git.

This article aims to leverage the Python capabilities available in InterSystems IRIS (for Health) or HealthShare (Health Connect).

For this we need an instance of InterSystems IRIS or HealthShare Health Connect. For our example we will use a container with the latest community edition of InterSystems IRIS for Health. We have to start the instance by publishing the default ports and mapping the current directory to the durable folder in the container.

$ docker run --name iris4health -d --publish 51773:51773 --publish 52773:52773 --volume $(pwd):/durable containers.intersystems.com/intersystems/irishealth-community:2022.3.0.589.0

Now that we have our instance running we can start a console in the container.

$ docker exec -it <containerID> bash

Now we can focus on the Python module. We will use lxml. This is a Python binding for the C libraries libxml2 and libxslt. It is unique as it combines the speed and completeness of the XML functions of these libraries with the simplicity of a native Python API, mostly compatible with the familiar ElementTree API. For more information about lxml https://lxml.de/index.html

Assuming that the pip3 package manager (and of course Python 3) is already installed on the instance, the appropriate module will need to be installed.

$ pip3 install --target /usr/irissys/mgr/python lxml

The example method that we will use will be coded in Python and will be in charge of parsing and validating the schematron rules. The code of the class that we will use is the following:

Class dc.schematron Extends %RegisteredObject
{

/// Description
ClassMethod simpleTest() [ Language = python ]
{
        from lxml import isoschematron
        from lxml import etree

        print("Validating File...\n")

        # def runsch(rulesFile, xmlFile):
        #open files
        rules = open('/durable/test-schema.sch', 'rb') # Schematron schema
        XMLhere = open('/durable/test-file.xml', 'rb') # XML file to check 
        #Parse schema
        sct_doc= etree.parse(rules)
        schematron=isoschematron.Schematron(sct_doc, store_report=True)

        #Parse XML
        doc = etree.parse(XMLhere)

        #Validate against schema
        validationResult = schematron.validate(doc)
        report = schematron._validation_report
    
        #Check result
        if validationResult:
            print("passed")
        else:
            print("failed")
            print(report)
}

}

The truth is that it is a fairly simple method. It opens 2 files – the file with the rules (schematron) and an example file. The rule is to check if the sum of the <Percent> elements within each <Total> node is 100%. To execute it, you will have to launch the following command from the console:

d ##class(dc.schematron).simpleTest()

The result will be presented in the console. The same logic can be used in an interop production.

The source code containing all the elements is available here.

Go to the original post written by @Ricardo Paiva