Indexing - This process extracts the text from documents including emails, Microsoft Office Documents, text files and text searchable PDFs. The tool then builds an index, just like the index of a book, that maps words contained in documents back to those documents. Most documents are indexed by DWR upon posting to Review.
Documents appear in blue in the Document List to indicate the document contained no indexable text.
After extracting text from all available files, the tool then OCRs any files which do not have available text. This process generates text from images such as scanned PDFs and image formats such as JPG and TIFF via Optical Character Recognition text generation. This generated text will then be added to the DWR Index.
To extract text from individual files, right click on the file and choose Extract Text....
- Make images searchable (will read production images on document which have been produced)- create text searchable PDFs for export either in a production or as individual documents and adds this text to the index
- Extract and save text - creates a separate .txt file for productions requiring a text file for each document
- Enable AI Searching -for use in AI Searching- be sure the option is activated in Settings - Edit Columns - Advanced
- Overwrite exiting text - for use when existing text should be replaced (all other options will skip documents for which text has already been extracted). For produced documents the text will always be the text that was produced.
When data sets involve additional languages the OCR engine can be optimized to recognize those languages.
Some PDF files that are not OCR'able and not keyword searchable. It is common to find a small number of files - including EMAIL, PDF, DOC, PPT, etc. - that have errors, contain 0 bytes, are password protected, have forms or other security that prevent them from being keyword indexed.
Extracting Text is preformed on a single file or multiple files by highlighting the rows, right clicking, and following the instructions above.
Extracting Text is also available on a Collection by right clicking on the Collection and choosing the appropriate operation: