Question
· Sep 15, 2022

How to read a pdf file which has images in it

Hi team,

I have a PDF file which has come report data and images (like XRAY) in it. How I can read text from a PDF file in  cache

Thanks

Akshay

$ZV: IRIS for Windows (x86-64) 2020.1 (Build 215U) Mon Mar 30 2020 20:14:33 EDT [HealthConnect:2.1.0]
Discussion (5)1
Log in or sign up to continue

If there's a text layer use LibreOffice to convert to txt (InterSystems IRIS wrapper), for OCR you'll need some thirdparty tool, for example Tesseract can be easily used with Embedded Python.

UPD: LibreOffice can't extract text from PDFs unfortunately. Here's Embedded Python solution:

Class User.PDF
{

/// zw ##class(User.PDF).GetText("/tmp/example.pdf", .text)
ClassMethod GetText(file, Output text) As %Status
{
  try {
    #dim sc As %Status = $$$OK
    kill text
    set dir = $system.Util.ManagerDirectory()_ "python"
    do ##class(%File).CreateDirectoryChain(dir)
    // pip3 install --target /data/db/mgr/python --ignore-requires-python typing==3.10.0.0
    try {
      set pypdf2 = $system.Python.Import("PyPDF2")
    } catch {
      set cmd = "pip3"
      set args($i(args)) = "install"
      set args($i(args)) = "--target"
      set args($i(args)) = dir
      set args($i(args)) = "PyPDF2==2.10.0"
      set args($i(args)) = "dataclasses"
      set args($i(args)) = "typing-extensions==3.10.0.1" 
      set args($i(args)) = "--upgrade"
      set sc = $ZF(-100,"", cmd, .args)
      set pypdf2 = $system.Python.Import("PyPDF2")
    }
    return:'$d(pypdf2) $$$ERROR($$$GeneralError, "Unable to load PyPDF2")
    kill pypdf2
    set text = ..GetTextPy(file)
  } catch ex {
    set sc = ex.AsStatus()
  }
  quit sc
}

ClassMethod GetTextPy(file) [ Language = python ]
{
  from PyPDF2 import PdfReader

  reader = PdfReader(file)
  text = ""
  for page in reader.pages:
    text += page.extract_text() + "\n"

  return text
}

}