Indexing - This process extracts the text from documents including emails, Microsoft Office Documents, text files and text searchable PDFs. The tool then builds an index, just like the index of a book, that maps words contained in documents back to those documents. Most documents are indexed by DWR upon posting to Review.
Documents appear in blue in the Current Docs Grid Screen to indicate the document contained no indexable text.
When documents contain text but are not indexed, make images searchable to direct the search engine to optically recognize the letters and words to add to the keyword searchable index. This process generates text from images such as scanned PDFs and image formats such as JPG and TIFF via Optical Character Recognition text generation. This generated text will then be added to the DWR Index.
To Extract Text from image files right click on the file and choose Extract Text....
- Make production images searchable - create text searchable PDFs for export either in a production or as individual documents and adds this text to the index
- Extract and save text - for productions requiring a text file for each document
- Vectorize for Find Similar -for use in "find similar" - be sure the RSI column is activated in Settings - Edit Columns
- Overwrite exiting text - for use when existing text should be replaced (all other options will skip documents for which text has already been extracted)
- Attempt forced OCR of PDFs - certain PDFs are delivered with an overlaid text file (usually a form or a produced PDF with a Bates number overlaid) in which only the text overlay is indexed. This option forces the tool to use Optical Character Resolution of the PDF image and not extract the overlaid text.
When data sets involve additional languages the OCR engine can be optimized to recognize those languages.
To monitor the progress go to the Process tab and select the View Jobs interface.
Some PDF files that are not OCR'able and not keyword searchable. It is common to find a small number of files - including EMAIL, PDF, DOC, PPT, etc. - that have errors, contain 0 bytes, are password protected, have forms or other security that prevent them from being keyword indexed.
The DWR default does not extract text for image files. Where pictures are expected to be screenshots or pictures of content with words, follow the steps above to extract text and make the documents keyword searchable.
Extracting Text is preformed on a single file or multiple files by highlighting the rows, right clicking, and following the instructions above.
Extracting Text is also available on a Collection by right clicking on the Collection and choosing the appropriate operation: