Indexing - This process extracts the text from documents including emails, Microsoft Office Documents, text files and text searchable PDFs. The tool then builds an index, just like the index of a book, that maps words contained in documents back to those documents. Most documents are indexed by DWR upon posting to Review.
Documents appear in blue in the Current Docs Grid Screen to indicate the document contained no indexable text.
When documents contain text but are not indexed, generate OCR of the document to create a text file to be added to the keyword searchable index.
OCR (Optical Character Recognition) – This process generates text from images such as scanned PDFs and image formats such as JPG and TIFF. This generated text can then be added to the DWR Index.
To OCR the document, right click on the row that represents the document and choose OCR:
Once OCR has completed, right click to add the created text file to the Index:
To monitor the progress of OCR and Indexing go to Process>View Jobs:
Some PDF files that are not OCR'able and not keyword searchable. It is common to find a small number of files - including EMAIL, PDF, DOC, PPT, etc. - that have errors, contain 0 bytes, are password protected, have forms or other security that prevent them from being keyword indexed.
The DWR default is to not generate OCR for pictures. Where pictures are expected to be screenshots or pictures of content with words, follow the steps above to OCR and Index, to make the documents keyword searchable.
OCR and Indexing can be run on a single file or multiple files by highlighting the rows, right clicking, and following the instructions above.
OCR and Index is also available on a Collection by right clicking on the Collection and choosing the appropriate operation:
In 10.1, the “OCR” function really means “make searchable pdfs”. (There is still an option to generate .txt files for productions.) There are two main cases to consider:
- When you select files/collections/custodians/filter criteria to OCR, DWR will now attempt to generate searchable PDFs of all PDF/TIF/PNG/JPG/and other image format files it finds in your selection. This has the added benefit of “pre-imaging” these types of documents since it will be generating the pre-production PDFs to be used in the production workflow.
- There is an optional checkbox to “generate .txt files”
- This option is turned off by default in the general case. There really is no need to generate the extracted text for documents you aren’t necessarily producing.
- If NOT checked, all other non-image/non-pdf documents will be ignored by the OCR process.
- If checked, we will still attempt to extract text from the selected native files.
- If you right-click a production or on a file in a production, this option will be turned on by default.
Case 2: Productions
- DWR will attempt to make searchable, all of the bates numbered/endorsed PDFs generated by the production workflow. This applies only to documents being produced as Image or Withheld. Files produced as “Native” will be ignored. This has the benefit of always having a searchable PDF to export when using "Export documents" to export images.
- For documents in Drafts that have not been endorsed, DWR will attempt to make searchable the pre-production Images where they exist.
- If “generate .txt files” is turned on, the text will be extracted from the searchable PDFs (NOT the natives). For documents produced natively, the OCR function will extract text from the natives (as it has done in the past). In both cases, the “Replace existing .txt files” option will only ever replace existing .txt files and will not overwrite/replace existing searchable PDFs.