PDFs and Reading them

I know there are numerous Java libraries available to scan a PDF meta data, but is there a way to scan a PDF using native cache object script? We are looking to take a PDF from an external vendor, scan for meta data, create the HL7 message, and embed the PDF within the HL7 message.

Thanks

Scott Roth

The Ohio State University Wexner Medical Center

  • 0
  • 0
  • 593
  • 1
  • 1

Answers

PDF is a binary format. It is quite ugly and you don't really have any control over what you are getting. The text could be in there as text, as vector image, or even as pixel image. So short of implementing full OCR in Caché, you'll not find a way to do this without using external tools (and even those are not 100% reliable). 

 

 

Comments

Hi Scott,

What metadata are you referring to? Document properties such as Title, Author, Subject, Keywords, etc. or are you looking to actually extract patient data from formatted page elements?

It doesn't look like HealthShare has any built-in PDF parsing features, but there may be something developed by the community (if there is, I'm not aware of it, though). If it's the properties I mentioned above, though, they're normally stored near the end of the file and it's conceivable that you could scrape them out with COS. To do it right, though, I think you'd want to call an external utility to fetch it. The pdfinfo utility in xpdf does that pretty efficiently:

jdrumm@oobuntoo:/mnt/hgfs/DDownloads/Intersystems$ pdfinfo Sample.pdf 
Title:          Sample
Subject:        Just another pdf
Keywords:       test metadata cache fetching
Author:         Jeff Drumm
Creator:        PDFCreator Version 1.7.3
Producer:       GPL Ghostscript 9.10
CreationDate:   Wed Aug  2 07:32:38 2017

If you're looking to get data stored as formatted page elements, though, that's an entirely different challenge.