Indexing - This process extracts the text from documents including emails, Microsoft Office Documents, text files and text searchable PDFs. The tool then builds an index, just like the index of a book, that maps words contained in documents back to those documents. Most documents are indexed by DWR upon posting to Review.
Documents appear in blue in the Document List to indicate the document contained no indexable text.
After extracting text from all available files, the tool then OCRs any files which do not have available text. This process generates text from images such as scanned PDFs and image formats such as JPG and TIFF via Optical Character Recognition text generation. This generated text will then be added to the DWR Index.
To extract text from individual files, right click on the file and choose Extract Text....
- Make images searchable (will read production images on document which have been produced)- create text searchable PDFs for export either in a production or as individual documents and adds this text to the index
- Extract and save text - creates a separate .txt file for productions requiring a text file for each document
- Vectorize for Find Similar -for use in "find similar" - be sure the RSI column is activated in Settings - Edit Columns
- Overwrite exiting text - for use when existing text should be replaced (all other options will skip documents for which text has already been extracted). For produced documents the text will always be the text that was produced.
- Attempt forced OCR of PDFs - certain PDFs are delivered with an overlaid text file (usually a form or a produced PDF with a Bates number overlaid) in which only the text overlay is indexed. This option forces the tool to use Optical Character Resolution of the PDF image and not extract the overlaid text.
When data sets involve additional languages the OCR engine can be optimized to recognize those languages.
Some PDF files that are not OCR'able and not keyword searchable. It is common to find a small number of files - including EMAIL, PDF, DOC, PPT, etc. - that have errors, contain 0 bytes, are password protected, have forms or other security that prevent them from being keyword indexed.
Extracting Text is preformed on a single file or multiple files by highlighting the rows, right clicking, and following the instructions above.
Extracting Text is also available on a Collection by right clicking on the Collection and choosing the appropriate operation: