Question
· Apr 9, 2018

Extract text from file

In CACHE did there any way to extract text from PDF and Doc files?

Discussion (6)1
Log in or sign up to continue

Right. Forgot about it.

You can use ghostscript, here's how. In your case command would probably look like this:

Parameter COMMAND = "%1 -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=%2 %3";

ClassMethod pdf2txt(pdf, txt) As %Status
{
    set cmd = $$$FormatText(..#COMMAND, ..getGS(), txt, pdf)
    return ..execute(cmd)
}

/// Get gs binary
ClassMethod getGS()
{
    if $$$isWINDOWS {
        set gs = "gswin64c"
    } else {
        set gs = "gs"
    }
    return gs
}

Execute method code.

Also note, that PDF can contain only images instead of text. in that case you'd need OCR.