How to read a pdf file which has images in it
Hi team,
I have a PDF file which has come report data and images (like XRAY) in it. How I can read text from a PDF file in cache
Thanks
Akshay
$ZV: IRIS for Windows (x86-64) 2020.1 (Build 215U) Mon Mar 30 2020 20:14:33 EDT [HealthConnect:2.1.0]
What does read mean?
@Eduard Lebedyuk
1. I have a PDF file which I need to read from a folder location as text and put data from PDF into HL7 message and send it to downstream system.
2. I have a PDF file which I need to read from a folder location encode it in base64 and put in OBX.5 of MDM message
Do you mean OCR/text layer extraction?
Do it like this.
Do you mean OCR/text layer extraction? yes.
If there's a text layer use LibreOffice to convert to txt (InterSystems IRIS wrapper), for OCR you'll need some thirdparty tool, for example Tesseract can be easily used with Embedded Python.UPD: LibreOffice can't extract text from PDFs unfortunately. Here's Embedded Python solution:
Class User.PDF { /// zw ##class(User.PDF).GetText("/tmp/example.pdf", .text) ClassMethod GetText(file, Output text) As %Status { try { #dim sc As %Status = $$$OK kill text set dir = $system.Util.ManagerDirectory()_ "python" do ##class(%File).CreateDirectoryChain(dir) // pip3 install --target /data/db/mgr/python --ignore-requires-python typing==3.10.0.0 try { set pypdf2 = $system.Python.Import("PyPDF2") } catch { set cmd = "pip3" set args($i(args)) = "install" set args($i(args)) = "--target" set args($i(args)) = dir set args($i(args)) = "PyPDF2==2.10.0" set args($i(args)) = "dataclasses" set args($i(args)) = "typing-extensions==3.10.0.1" set args($i(args)) = "--upgrade" set sc = $ZF(-100,"", cmd, .args) set pypdf2 = $system.Python.Import("PyPDF2") } return:'$d(pypdf2) $$$ERROR($$$GeneralError, "Unable to load PyPDF2") kill pypdf2 set text = ..GetTextPy(file) } catch ex { set sc = ex.AsStatus() } quit sc } ClassMethod GetTextPy(file) [ Language = python ] { from PyPDF2 import PdfReader reader = PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() + "\n" return text } }