Extract text and metadata from various document formats
Content analysis toolkit
tika
$ tika --text document.pdf
$ tika --metadata document.docx
$ tika --xml image.jpg