Extract text from file

Question

Question

oshra matrix · Apr 9, 2018

#Caché

In CACHE did there any way to extract text from PDF and Doc files?

Discussion (6)1

Log in or sign up to continue

Eduard Lebedyuk · Apr 9, 2018

Right. Forgot about it.

You can use ghostscript, here's how. In your case command would probably look like this:

Parameter COMMAND = "%1 -dBATCH -dNOPAUSE -sDEVICE=txtwrite -sOutputFile=%2 %3";

ClassMethod pdf2txt(pdf, txt) As %Status
{
    set cmd = $$$FormatText(..#COMMAND, ..getGS(), txt, pdf)
    return ..execute(cmd)
}

/// Get gs binary
ClassMethod getGS()
{
    if $$$isWINDOWS {
        set gs = "gswin64c"
    } else {
        set gs = "gs"
    }
    return gs
}

Execute method code.

Also note, that PDF can contain only images instead of text. in that case you'd need OCR.

0 0

Bernd Mueller · Apr 9, 2018

as an alternative, you also can use apache pdfbox. See here for example:

https://dzone.com/articles/apache-pdfbox-command-line-tools-no-java-coding-re

0 0

score 0 · Answer 1 · 2018-04-09T04:11:53-04:00

Both of those formats do not have support out of the box. And better to use some external tools for it. And it is possible to implement it at least for docx files.

score 0 · Answer 2 · 2018-04-09T04:18:11-04:00

You can convert Doc/PDF into plaintext using LibreOffice and read that from Cache. Here's an article on working with LibreOffice from Cache.

score 0 · Answer 3 · 2018-04-09T04:47:57-04:00

oshra matrix · Apr 9, 2018

thanks for your answer.

but it's says you can't convert PDF to text...(only to export to PDF)

0 0

score 0 · Answer 4 · 2018-04-09T06:27:00-04:00

Good idea too.

Apache PDFBox for PDF + Apache POI for Office files.

Or Apache TIKA can be used to extract text from everything (it's a wrapper around PDFBox and POI).