Article
YURI MARX GOMES · Nov 14, 2020 11m read

Using Tesseract OCR and Java Gateway

The InterSystems IRIS can be extended using Java or .NET components and its frameworks inside Object Script source code.

I created an application called OCR Service. It built with Docker and installs Google Tesseract inside docker instance configured with english and portuguese dialects, but is possible install more than other 100 dialects. Google Tesseract can receive images and return text extracted from it, using OCR. The results are very good with the trained dialects. But you can train Tesseract to read car plates and any other textual patterns and load it to extract text. Java has a framework called Tess4J to enable Java call Tesseract instances and functions.  See running:

OCR in Action

 

See the Java code to the OCR service:

private String extractTextFromImage(File tempFilethrows TesseractException {
 
        ITesseract tesseract = new Tesseract();
        tesseract.setDatapath("/usr/share/tessdata/"); //directory to trained models
        tesseract.setLanguage("eng+por"); // choose your language/trained model
 
        return tesseract.doOCR(tempFile); //call tesseract function doOCR() 
                                          //passing the file to be processed with OCR technique
 
    }

 

Tess4J take care all low level communication with Tesseract with ITesseract interface class and we need only set the OCR model to be used and call doOCR() passing the file. Easy!

For our luck, we don't have to create a Tess4ObjectScript framework to interact with Tesseract, it is possible, using C callin/callout, but you need to understand how to talk at low level with Tesseract, I prefer use Java Gateway to save many work days!

To use Java Gateway, you have to install Java into your IRIS OS or docker instance, set JAVA_HOME (variable to know where Java is installed), set the CLASSPATH (variable to point to your Java class, Tess4J java archive and other dependencies) and execute a Java Gateway instance, a proxy to enable ObjectScript and Java to talk. See my dockerfile instructions to install Java and Tesseract to IRIS instance:

ARG IMAGE=store/intersystems/iris-community:2020.1.0.204.0
ARG IMAGE=intersystemsdc/iris-community:2020.1.0.209.0-zpm
ARG IMAGE=intersystemsdc/iris-community:2020.2.0.204.0-zpm
ARG IMAGE=intersystemsdc/irishealth-community:2020.3.0.200.0-zpm
ARG IMAGE=intersystemsdc/iris-community:2020.3.0.200.0-zpm
ARG IMAGE=intersystemsdc/iris-community:2020.3.0.221.0-zpm
ARG IMAGE=intersystemsdc/iris-community:2020.4.0.521.0-zpm
FROM $IMAGE
 
USER root   
        
WORKDIR /opt/irisapp
RUN chown ${ISC_PACKAGE_MGRUSER}:${ISC_PACKAGE_IRISGROUP} /opt/irisapp
USER ${ISC_PACKAGE_MGRUSER}
 
COPY  Installer.cls .
COPY  src src
 
COPY iris.script /tmp/iris.script
 
RUN iris start IRIS \
    && iris session IRIS < /tmp/iris.script \
    && iris stop IRIS quietly
 
USER root   
 
# Install Java 8 using apt-get from ubuntu repository
RUN apt-get update && \
    apt-get install -y openjdk-8-jdk && \
    apt-get install -y ant && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    rm -rf /var/cache/oracle-jdk8-installer;
    
# Fix certificate issues, found as of 
RUN apt-get install -y ca-certificates-java && \
    apt-get clean && \
    update-ca-certificates -f && \
    rm -rf /var/lib/apt/lists/* && \
    rm -rf /var/cache/oracle-jdk8-installer;
 
# Setup JAVA_HOME, to enable apps to know where the Java was installed
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JAVA_HOME
ENV JRE_HOME /usr/lib/jvm/java-8-openjdk-amd64/
RUN export JRE_HOME
 
# Setup classpath, to enable apps to know where java classes and java jar libraries was installed
ENV classpath .:/usr/irissys/dev/java/lib/JDK18/*:/opt/irisapp/*:/usr/irissys/dev/java/lib/gson/*
:/usr/irissys/dev/java/lib/jackson/*:/jgw/*
RUN export classpath
ENV CLASSPATH .:/usr/irissys/dev/java/lib/JDK18/*:/opt/irisapp/*:/usr/irissys/dev/java/lib/gson/*
:/usr/irissys/dev/java/lib/jackson/*:/jgw/*
RUN export CLASSPATH
 
USER root
 
ARG APP_HOME=/tmp/app
 
COPY src $APP_HOME/src
 
# Tess4J and another java libraries used, are into jgw folder and jgw folder is in the classpath
COPY jgw /jgw
 
# Copy our Java OCR program, packaged into a jar, to the jgw
COPY target/ocr-pex-1.0.0.jar /jgw/ocr-pex-1.0.0.jar  
 
COPY jgw/* /usr/irissys/dev/java/lib/JDK18/
 
# Install tesseract using ubuntu apt-get
RUN apt-get update && apt-get install tesseract-ocr -y
 
USER root
 
# Copy trained models eng and por to the models folder
COPY tessdata /usr/share/tessdata
 
# Install and config default OS locale - it is required to tesseract works fine
RUN apt-get update && apt-get install -y locales && rm -rf /var/lib/apt/lists/* \
 && locale-gen "en_US.UTF-8"
ENV LANG=en_US.UTF-8 \
    LANGUAGE=en_US:en \
    LC_ALL=en_US.UTF-8

 

When your ObjectScript class is an interoperability business class or operation, set a Java Gateway Business Service (JavaGateway item into my production).

Class dc.ocr.OcrProduction Extends Ens.Production
{
 
XData ProductionDefinition
{
<Production Name="dc.ocr.OcrProduction" LogGeneralTraceEvents="false">
  <Description></Description>
  <ActorPoolSize>2</ActorPoolSize>
  <Item Name="OcrService" Category="" ClassName="dc.ocr.OcrService" PoolSize="1" Enabled="true" 
Foreground="false" Comment="" LogTraceEvents="false" Schedule="">
  </Item>
  <Item Name="JavaGateway" Category="" ClassName="EnsLib.JavaGateway.Service" PoolSize="1" 
Enabled="true" Foreground="false" Comment="" LogTraceEvents="false" Schedule="">
    <Setting Target="Host" Name="ClassPath">.:/usr/irissys/dev/java/lib/JDK18/*:/opt/irisapp/*
:/usr/irissys/dev/java/lib/gson/*
:/usr/irissys/dev/java/lib/jackson/*:/jgw/ocr-pex-1.0.0.jar
</Setting>
    <Setting Target="Host" Name="JavaHome">/usr/lib/jvm/java-8-openjdk-amd64/</Setting>
  </Item>
</Production>
}
 
}

Into the production we have our ObjectScript OCRService too. This ObjectScript class receive the file uploaded from a HTTP multipart request and uses JavaGateway to load the Java class, pass the file and get the text from OCR process, so returns the response. See:

Class dc.ocr.OcrService Extends Ens.BusinessService
{
 
// extends Ens.BusinessService to create a custom Business service using Object Script
 
// This class receive a file from a multipart http request and save 
// to the folder configured into folder parameter
 
// choose an adapter to get data from a source of data
 
// HTTP.InboundAdapter allows you get data from an http request
 
Parameter ADAPTER = "EnsLib.HTTP.InboundAdapter";
 
// custom parameter to allows production user set destination folder to multipart file uploaded 
 
Property Folder As %String(MAXLEN = 100);
 
// when you set parameter Folder to SETTINGS parameter, the production IRIS interface 
// create a field to the user fills
// so the user will inform host path for the uploaded file 
 
Parameter SETTINGS = "Folder,Basic";
 
// This method is mandatory to have a business service. It receives the multipart file into pInput 
 
// and returns a result to the caller using pOutput
 
Method OnProcessInput(pInput As %GlobalBinaryStreampOutput As %RegisteredObjectAs %Status
{
    //try to do the actions
    try {
        Set reader = ##class(%Net.MIMEReader).%New() //creates a MIMEReader to extract files 
//from multipart requests 
        Do reader.OpenStream(pInput) //reader open the file
 
        Set tSC = reader.ReadMIMEMessage(.message) //the reader put the file uploaded into a MIME Message
        //Get Header obtains headers from the request and the multipart file, like content-type or content 
//disposition the content disposition have 3 headers: Content-Disposition: form-data; name="file"; 
//filename="filename.ext". This split content-disposition header into 3 parts
        Set filenameHeader = $PIECE(message.GetHeader("CONTENT-DISPOSITION", .header),";",3
        //get filename header value
        Set filename = $EXTRACT(filenameHeader12$LENGTH(filenameHeader)-1)
        //Headers are not more needed. It clean the header to remains only the file content to be saved
        Do message.ClearHeaders()
 
        //create a file object to save the multipart file
        Set file=##class(%Stream.FileBinary).%New()
        //points the file to folder informed into folder parameter, plus upload filename from header
        Set file.Filename=..Folder_filename 
        //save body message (the file content) to file object
        Do file.CopyFromAndSave(message.Body)
 
        // Connect a Gateway instance to server JavaGate on the host machine
        set GW = ##class(%Net.Remote.Gateway).%New()
        set st = GW.%Connect("127.0.0.1""55555""IRISAPP",,)
        //instantiate java ocr class
        set proxyOcr = ##class(%Net.Remote.Object).%New(GW,"community.intersystems.pex.ocr.OcrOperation")
        //call ocr method to get text from image
        set pResponse = proxyOcr.doOcr(file.Filename)
        //returns to the service
        Set pOutput = pResponse
        Set tSC=$$$OK
    
    //returns error message to the user
    } catch e {
        Set tSC=e.AsStatus()
        Set pOutput = tSC
    }
 
    Quit tSC
}
 
}

Now, IRIS can do OCR to images and PDF, but with JavaGateway function can do many amazing things. See all details into my app code: https://openexchange.intersystems.com/package/OCR-Service.

 

 

40
2 4 89 2

Replies

Hi @YURI MARX GOMES ,

Great!
I must test this app.

I'm working on document converter tool, perhaps may I integrate OCR with your app.

Thanks @Lorenzo Scalese, Tesseract is the most used OCR tool in the market, it will be very useful to you.