Article
Jack Huser · Sep 13 6m read

Use $system.external Interface for Python

Since I saw many posts on Developer Community related to Python, and the very good articles and application written by @Eduard Lebedyuk I was wondering: "As a Object Script developer, why would I want to use an other  language in Object Script? If I ever need to execute something in Object Script, I would do it in Object Script!".

I thought those functionalities to use other languages in Object Script were made only for other languages developers who have to write Object Script code.

Recently I had to parse a huge CSV file : 1.7Gb and more than 5 millions lines.

I did it in Object Script:

ClassMethod ReadFile(strINReadFile As %String = "") As %Status
{
     #dim tSC As %Library.Status = $$$OK
     #dim FileReader As %Library.File
     try {
          set FileReader = ##class(%Library.File).%New(strINReadFile)
          set tSC = FileReader.Open("RU")
          if $$$ISERR(tSC) quit }
          set FileReader.LineTerminator = $$$NL
          set nbLigne = 0
          set time1 = $zh
          while (FileReader.AtEnd = 0) {
               set len = 32000
               set (strBuffer, eol) = ""
               set strBuffer = FileReader.ReadLine(.len, .tSC, .eol)
               if $$$ISERR(tSC) quit }
               // do something with strBuffer
          }
          quit:$$$ISERR(tSC)
          set time2 = $zh
          set diff = time2 - time1
          write "execution: "_diff, !
     catch (SysEx) {
          set tSC = SysEx.AsStatus()
     }
     if (($data(FileReader)>0) && (FileReader'="")) {
          do FileReader.Close()
     }
     quit tSC
}

Result was disappointing

USER>W ##class(JHU.Test).ReadFile("C:\Temp\GigaFile.csv")
execution: 892.108104s
1

Almost 15 minutes !!!

Using @Robert Cemper  (Thank you so far) code results are

/// Read quit
ClassMethod ReadQuick(strINReadFile As %String = "") As %Status
{
     #dim tSC As %Library.Status = $$$OK
     #dim SysEx As %Exception.AbstractException
     try {
          open strINReadFile::1 
          else  set tSC=$$$ERROR($$$GeneralError, "Missing File") quit
          set eof=##class(%SYSTEM.Process).SetZEOF(1)
          use strINReadFile
          set time1=$zh 
          for line=0:1 {
               read strBuffer if $zeof set diff=$zh-time1 quit
               // do something with strBuffer
           }
          close strINReadFile
          do ##class(%SYSTEM.Process).SetZEOF(eof)
          write !,"execution: "_diff,!,"lines: ",line,!
     catch (SysEx) {
          set tSC = SysEx.AsStatus()
     }
     quit tSC
}

Results are

USER>W ##class(JHU.Test).ReadQuick("C:\Temp\GigaFile.csv")
 
execution: 10.047812
lines: 5000000
1

The same file parsing in Python would be

from datetime import datetime

class Test1:

    def ReadFile(self,strINFileName="") :
        if strINFileName=="":
            print("file name is empty"
            quit()
        file = open(strINFileName,"r")
        atEnd = False
        time1 = round(datetime.timestamp(datetime.now()) * 1000)
        while not atEnd:
            line=file.readline()
            if not line :
                atEnd = True
        time2 = round(datetime.timestamp(datetime.now()) * 1000)
        file.close()
        print("Execution: ",((time2-time1)/1000),"s")

Result was far beyond expectation

obj = Test1()
obj.ReadFile("C:\Temp\GigaFile.csv")
Execution:  5.222 s

So I wanted to parse the huge file in Object Script but using Python.

With IRIS 2021.1 comes the Interface for external languages with Python: Working with External Languages.

The call for Python Gateway using $system.external Interface is:

/// Read File using Python
ClassMethod ReadFileWithPython(strINFilename As %String = "")
{
  #dim tSC As %Library.Status = $$$OK
  #dim SysEx As %Exception.AbstractException
  try {
      set gateway = $system.external.getPythonGateway()
      do gateway.addToPath("C:\Projet\Python\test1.py")
      set fooProxy = gateway.new("test1.Test1")
      do fooProxy.ReadFile(strINFilename)
   catch (SysEx) {
      set tSC = SysEx.AsStatus()
   }
   if $$$ISERR(tSC) write $system.Status.GetErrorText(tSC), ! }
}

Result is as expected

USER>do ##class(JHU.Test).ReadFileWithPython("C:\Temp\GigaFile.csv")
Execution:  4.387 s

In fact it makes Object Script more attractive and makes me want to learn more of Python.

And I'm looking forward for Embedded Python within Object Script Class or ClassMethod.
As an example the excellent article from @Henry Pereira

31
1 0 16 228
Log in or sign up to continue

Hi I wrote an article for the tech article competition called "Why I love ObjectScript and why I think I might love Python more". It discuses my 35 years of working with MUMPS, then joining ISC as Cache was introduced and my journey through all of the major stages in the evolution of Cache through to IRIS as well as ISC as a company. One of the problems for many companies who use IRIS is the shortage of experienced ObjectScript developers. On the one hand that is advantageous for those of us who have that experience but on the other hand it does mean that I am part of a very narrow niche of specialists. Python is an extremely popular language. It ranks #1 on many lists of top programming languages. It is an interpreted language just like ObjectScript and most importantly it can handle sparse multidimensional arrays i.e. global's and so it is a natural fit to add to our catalogue of language support. R and Julia are the other two. Python has a vast library of programs and functions covering everything from the mundane through to complex mathematical modelling and data science. My view is that for IRIS developers this is just another option available to us along with ObjectScript, SQL, Direct Global Access that we can deploy in the applications we develop where we use the tool that best suits the task we are solving. iSC have embedded other languages such as Basic and MVBasic into their platforms with varying degrees of success. I was around when we were trying to convert Microsoft developers to move to Cache and likewise MVBasic for PICK developers and at the time it made sense to offer scripting languages that made the transition to Cache more attractive but the ultimate aim was to get those developers to adopt ObjectScript. We did the same with support for TSQL for SQL Server users. Where Python, R and Julia are different is that there is no intention to convert developes in those languages to ObjectScript, rather the aim is to introduce these languages to handle the sorts of logic and processing that ObjectScript is not so good at. Given the choice of extending ObjectScript to handle the mathematical underpinnings of Data Science, ML and AI it made far more sense to bring the languages with a proven track record in those areas into our stable of Native Language Support.

Nigel 

Almost 15 minutes !!!

What about the second pass with the same ObjectScript code w/o restarting IRIS? 1.7GB is not too much to be stored in the file system cache, so it should be ~5 times quicker. Besides, it's well-known that Cache (IRIS) as a record reader is not so good as a block reader. Everybody knows that %GIF is 5-10 times quicker than %GI. I have the same experience with my own developments: processing files at the block level (read block#$$$BLOCKSIZE) with records parsing at the COS level is quicker than read it record by record. `5*5=25` is close to the time difference between your ObjectScript and Python code runs.

I've written this just to emphasize that we should love Python mostly for the power of its ecosystem, as Nigel wrote. As to speed, it would be better if you InterSystems guys do something to improve file i/o record level operations in IRIS engine...

@Alexey Maslov you see with some tweaking the factor is at 300. 
It was initially even faster but hard to read.

This proves my observation that some %-classes are like American cars:

Not for speed, just for comfort!

Hi @Jack Huser,
I assume you would agree it is fair to compare apples to apples
but not horse coaches to formula-1 cars.


Being proud that I never lost a benchmark by speed I rewrote your code
that is nice to read and maintain but not very efficient, going for something
more speedy to show the limits and then checked it against your class.
My test file has only 47MB containing 181566 lines.

DEMO>write ##class(JUH.Test).ReadFile(file)
execution: 214.931024
1
DEMO>write ##class(JUH.Test).ReadQuick(file)
execution: .753696
lines: 181566
1
DEMO>write .753696/214.931024*100_" %"
.3506687801385062028 %

I think 0.35% is quite an eye catcher.

And his is the class:

ClassMethod ReadQuick(strINReadFile As %String = "") As %Status
 {
  open strINReadFile::1 
  else  write "Missing File",! quit '$$$OK 
  set eof=##class(%SYSTEM.Process).SetZEOF(1)
  use strINReadFile
  set time1=$zh 
  for line=0:1 {
    read strBuffer if $zeof set diff=$zh-time1 quit
    // do something with strBuffer
  }
  close strINReadFile
  do ##class(%SYSTEM.Process).SetZEOF(eof)
  write !,"execution: "_diff,!,"lines: ",line,!
  quit $$$OK
 }

I just couldn't resist my nature.

Thank you very much, I never knew it was possible to read a file that way.
I tried too with MUMPS doing "open for {use read}" but had the same result. Thank you very much.

the   use   inside  for   makes it slower

I never knew about SetZEOF(), I would have done it in a try/catch and in the catch check to see if

catch ex {set tSC= $s($system.Status.GetErrorText(ex.AsStatus())["<ENDOFFILE>":$$$OK,1:ex.AsStatus())}

Quit tSC

And if I were still using $ztrap then I would test $ze["<ENDOFFiLE>"

But I don't use $ztrap anymore 

Nigel

I never knew about SetZEOF(), I would have done it in a try/catch and in the catch check to see if

catch ex {set tSC= $s($system.Status.GetErrorText(ex.AsStatus())["<ENDOFFILE>":$$$OK,1:ex.AsStatus())}

Quit tSC

And if I were still using $ztrap then I would test $ze["<ENDOFFiLE>"

But I don't use $ztrap anymore 

Nigel

As a matter of interest why do people use #dim? The only time I use #dim is if I am using $classmethod() where I am referencing a classmate that I will only know at runtime, for example I may have several classes that all inherit from some superclass and the sub classes at to all intents and purposes identical and I want intellisense in which ever IDE I am using to list the possible properties or methods because I use long property names which I don't always remember when I come to reference those properties after the $classmethod() and once I have written the code and tested it I remove or comment out the #dim. Is there any other compelling reason to use it?

I have been reading a pdf called the Zen of Python which I am going to upload in a seperate article post. The particle lists all of the language constructs in Python and as I was working my way through the document it occurred to me to write a corresponding Zen of ObjectScript which would give the ObjectScript equivalent of each Python example. And by and large much of Python is earily similar to ObjectScript but then I came across some things in Python where two python statements does some really wierd thing and to mimic the.behaviour in ObjectScript would take several lines of code and in all likelihood would be slower than the python code. I'm tempted to whet your appetite here but you'll have to wait until I've done the entire article. I have to confess that some of the python functionality is very clever from an ideological point of view but  I struggled to think of a scenario where I would need such functionality in real life code but there again one of the examples of how objectScript treats strings that I have 10000 times over the years is the statement 

Write "15Apples" + "25Pears"

40

To explain how objectScript treatsthe expression as a numeric 'plus' on the two strings and I giggle quietly as the classroom of developers who know basic or JS expect the result to be either a concatenation or a data type mismatch but I often use the expression 

If '+variable where the variable is either a non 0 numeric or "" especially if variable is passed as a parameter that in the method parameter list defaults the Parana to "" to avoid have to use $get()

Oh ok, I give in, write the following statements in a python program 

Import print 

message = 'It was a bright cold day in April, and the clocks were striking 

thirteen.'

count = {} 

For character in message: 

count.setdefault(character, 0) 

count[character] = count[character] + 1 

pprint.pprint(count)

Note that thirteen.' is on a new line 

It is a nice piece of functionality and I suspect I could write it in two lines of ObjectScript and though I have to tested the execution time I suspect that ObjectScript would be fractionally faster, which if the message was a far longer string

I also will insert my five kopecks.

  • The %Library package also includes stream classes, but those are deprecated. The class library includes additional stream classes, but those are not intended for general use.
    Working with Streams
    I have %Stream.FileCharacter was an order of magnitude faster than %[Library.]File
  • If you rewrite the line-by-line reading to read blocks with further parsing of lines, the speed will more increase by an order of magnitude.
     
    Sample
  • On the Internet, you can find a lot of materials about comparing the speed of reading files (in particular CSV) for different programming languages (Python, C/C++, R, C#, Java, etc.), for example (this is machine translation). Often, those who make such comparisons do not always know all these languages equally well, so sometimes casus happen.
     
    Who do you think in the article above was faster when reading 1e7+ lines: Fortran or C++ ?
  • If we approach the issue formally, then the advantage will be given to compiled languages, not interpreted ones, as well as the implementation that uses all the capabilities of the operating system and hardware.

Fortran would be my guise. Fortran is getting a bit of a makeover and there is renewed interest in it. I was reading somewhere the other day and they mentioned that fortran and some of the earliest languages are making a comeback because of their lowlevel interaction with devices and hardware. An art we have long forgotton but there is renewed desire for people to get to the low bits and bytes and fortran is one way

Sorry to say, but your comparsion has some sore points:

a) The very first mistake is, you are comparing programing languages!
This is a disputable attempt because each programing language was created with a specific aim (i.e. use case) in mind. So there are very few languages which can be compared to each other. ObjecStript was created as an Operating-System-and-Database-and-Programing-Language. Nowdays, it's a Database-and-Programming-Language. Python is just a programming language! 

b) The second mistake is, comparing methods with the same name (one may think, they do the same thing) but they have different internal behaviour.

Your Python code reads one line from the file (i.e. characters until the next LF):   
  

line = file.readline()

but your ObjecScript code:

set strBuffer = FileReader.ReadLine(.len, .tSC, .eol)

reads a chunk of data from a stream and tries to extract a line from this chunk (take a look at the source code!). Here you loose (probably) most of the time. 
   
The more comparable statement should have been:

read strBuffer

assuming, the file was opened as   

open filename:"R":0

c) Apart from OS, CPU and Disk (HDD, SSD), where there is no information, did you made both runs under the same buffer state (cold or warm)?

In my opinion you're comparing apples with oranges.

I apologize for the misunderstanding, my goal wasn't comparaing two languages, but rather using external languages Interface in ObjectScript ;)

Hello Everyone,

Thank you for all your really interesting comments. It helped a lot, especially on using ObjectScript classes I wasn't aware of.
But there could be a misunderstanding on the subject of this article.

The goal was not comparing several languages, but rather : "I did it with ObjectScript and the performances the way I did was not matching my expectations, so I would like to use an other language I know that match my expectations".
And to do so, before this language is embedded in ObjectScript I will use the external language Interface.

However, if I have known to use  %Stream.FileCharacter, I still would have done it in Python so that I can use $system.external Interface :)