Extract text from file

In CACHE did there any way to extract text from PDF and Doc files?

  • + 1
  • 0
  • 250
  • 4
  • 2

Answers

Both of those formats do not have support out of the box. And better to use some external tools for it. And it is possible to implement it at least for docx files.

You can convert Doc/PDF into plaintext using LibreOffice and read that from Cache. Here's an article on working with LibreOffice from Cache.

thanks for your answer.

but it's says you can't convert PDF to text...(only to export to PDF)

Right. Forgot about it.

You can use ghostscript, here's how. In your case command would probably look like this:

Parameter COMMAND = "%1 -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=%2 %3";

ClassMethod pdf2txt(pdf, txt) As %Status
{
    set cmd = $$$FormatText(..#COMMAND, ..getGS(), txt, pdf)
    return ..execute(cmd)
}

/// Get gs binary
ClassMethod getGS()
{
    if $$$isWINDOWS {
        set gs = "gswin64c"
    } else {
        set gs = "gs"
    }
    return gs
}

Execute method code.

Also note, that PDF can contain only images instead of text. in that case you'd need OCR.

Good idea too.

Apache PDFBox for PDF + Apache POI for Office files.

Or Apache TIKA can be used to extract text from everything (it's a wrapper around PDFBox and POI).