Extract text and metadata from documents and images
Content analysis toolkit
tika
$ tika --text document.pdf
$ tika --metadata photo.jpg
$ tika -t file.docx > output.txt
$ tika --detect file.unknown
$ tika --text *.pdf