Count Number of Pages?

Hello,

We are working on creating a metadata file to accompany PDF documents produced by one of our third party systems for ingestion into our DMS.  One of the pieces of data that the metadata file must contain is the number of pages of the PDF document.

In Cache ObjectScript does anybody know if there is currently a way of counting the number of pages within a file (specifically a PDF) without invoking a non-Caché ObjectScript program/function from within Caché ObjectScript?

If we have to invoke a non-Caché ObjectScript program/function from within Caché ObjectScript then we will but we wanted to ensure that there wasn't a function within cache that can perform this (we can't seem to find one so suspect that currently there is not)?

Thanks

John

  • 0
  • 0
  • 839
  • 0
  • 3

Answers

Have worked it out myself, read in PDF file as %FileBinaryStream and then read through and count the occurrences of /Page (but not /Pages), and this seems to be giving me the correct number of pages for all of the files I have tried so far!

Method OnProcessInput(pInput As %FileBinaryStream, pOutput As %RegisteredObject) As %Status
{
set tsc=$$$OK
Do pInput.Rewind()
set pagecount  = 0
While 'pInput.AtEnd 
{
//set line=pInput.Read(200)
set line=pInput.ReadLine()
if ($FIND(line,"/Type/Page")
{
if '$FIND(line,"/Pages")
{
set pagecount = pagecount + 1
}
}
}
$$$TRACE(pagecount)
Quit tsc
}

There is no Cache function to do that.

But as PDF contains readable "postscript" parts you can use regexp to search for relevant information. Stack. Article. It's not guaranteed to be precise though.

Here's my article about using LibreOffice for work with documents. I've also used ghostscript and postscript to work with pdf from Caché and it's all fairly straightforward.

Also, here's the code I wrote (execute is defined here) to add footer to every page of a PDF file using ghostscript:

/// Use ghostscript (%1) to apply postscript script %3
/// Upon source pdf (%4) to get output pgf (%2)
/// Attempts at speed  -dProvideUnicode -dEmbedAllFonts=true  -dPDFSETTINGS=/prepress
Parameter STAMP = "%1  -dBATCH -dNOPAUSE -sDEVICE=pdfwrite -sOutputFile=%2 %3 -f %4";

ClassMethod stampPDF(pdf, psFile, pdfOut) As %Status
{
    set cmd = $$$FormatText(..#STAMP, ..getGS(), pdfOut, psFile, pdf)
    return ..execute(cmd)
}

ClassMethod createPS(psFile, text) As %Status
{
    set stream = ##class(%Stream.FileCharacter).%New()
        // For cyrillic text
    set stream.TranslateTable = "CP1251"
    set sc = stream.LinkToFile(psFile)
    quit:$$$ISERR(sc) sc
    
    do stream.WriteLine("<<")
    do stream.WriteLine("   /EndPage")
    do stream.WriteLine("   {")
    do stream.WriteLine("     2 eq { pop false }")
    do stream.WriteLine("     {")
    do stream.WriteLine("         gsave")
    do stream.WriteLine("         /MyFont 12 selectfont")
    do stream.WriteLine("         30 70 moveto (" _ text _ ") show")
    do stream.WriteLine("         grestore")
    do stream.WriteLine("         true")
    do stream.WriteLine("     } ifelse")
    do stream.WriteLine("   } bind")
    do stream.WriteLine(">> setpagedevice")
    
    quit stream.%Save()
}

/// Get gs binary
ClassMethod getGS()
{
    if $$$isWINDOWS {
        set gs = "gswin64c"
    } else {
        set gs = "gs"
    }
    return gs
}

Ghostscript can be used to get a number of pages in a PDF file. Here's how.

I believe Apache PDFBox is shipped with Cache/Ensemble and while it doesn't answer your question as it would require calling out of Cache you may find it useful for this purpose as well as others.