Indexing - This process extracts the text from documents including emails, Microsoft Office Documents, text files and text searchable PDFs. The tool then builds an index, just like the index of a book, that maps words contained in documents back to those documents. Most documents are indexed by DWR upon posting to Review.
Documents appear in blue in the Current Docs Grid Screen to indicate the document contained no indexable text.
When documents contain text but are not indexed, generate OCR of the document to create a text file to be added to the keyword searchable index.
OCR (Optical Character Recognition) – This process generates text from images such as scanned PDFs and image formats such as JPG and TIFF. This generated text can then be added to the DWR Index.
To OCR the document, right click on the row that represents the document and choose OCR:
Once OCR has completed, right click to add the created text file to the Index:
To monitor the progress of OCR and Indexing go to Process>View Jobs:
Some PDF files that are not OCR'able and not keyword searchable. It is common to find a small number of files - including EMAIL, PDF, DOC, PPT, etc. - that have errors, contain 0 bytes, are password protected, have forms or other security that prevent them from being keyword indexed.
The DWR default is to not generate OCR for pictures. Where pictures are expected to be screenshots or pictures of content with words, follow the steps above to OCR and Index, to make the documents keyword searchable.
OCR and Indexing can be run on a single file or multiple files by highlighting the rows, right clicking, and following the instructions above.
OCR and Index is also available on a Collection by right clicking on the Collection and choosing the appropriate operation:
Case 2: Productions