Around the world, there are transactions, agreements, applications, transcripts, photos and many more places where manual documentation is done. These documents are either saved/stored for future reference, proof or to do the decision making. Assume the number of documents piled in such places and how hard it is to retrieve the corresponding document when needed. Hence every organization is finding alternates to mitigate the said issues. Every organization or institution as a solution to the issue is digitalizing the documents by capturing the data in them, absolutely as it is, into IT systems thereby avoiding any consequences in the future. Also, the data in their IT systems helps them to do analytics and better decision making.
Fine !!! Now comes the concerns about how to get it done
1). Are there software tools to read manual documents.
2). If by scanning the documents can these tools read data
3). Can the tools capture Hand written data
4). How about multi-language support.
5). How do these tools read unstructured data
The solution found is to scan these manual documents and convert them to image only files.
Optical character recognition or optical character reader (OCR) is one of a possible solution and there are many software tools which use this technology. It is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image.
The technology is such superior that it can read and copy data from the document. But there are few limitations to it, which are being addressed.
- Font Size. OCR may not convert characters with very large or very small font sizes. This can make the most important characters and words unavailable for text-based systems.
- Uni-Dimensional. With OCR, individual words have one dimension, they’re either before or after other words. OCR does not catalog page coordinate information for characters even though page coordinates can be quite useful for classification and extracting attributes.
- Sequential Editing. OCR errors typically have to be corrected sequentially with the same errors being repeatedly being edited. Global spell checking can introduce other errors.
- Case Sensitivity for Editing. The use of spell checking to correct OCR text will typically not permit the case of the letters to be considered, e.g., cat and CAT will be treated alike.
- Languages. Many languages have special characters, and unless the correct OCR software is loaded, those characters can be lost or incorrectly recognized.
- Non-Symmetrical DPI for Faxes. Faxes are often stored in files where the number of dots per inch horizontally is not the same as the DPI vertically, and OCR engines can have difficulty with this non-symmetrical DPI.
- Partial Text. Document authors often incorporate graphics that have visible text. However, the OCR software may detect some text, assume that OCR is not needed, and skip processing the document leaving the text in the images invisible for text-only searching or analysis. A similar phenomenon can happen when textual headers, footers, or legends are added to previously image-only PDFs. OCR systems may detect the presence of a text layer and not attempt to convert the image layer, even though it may have the most important content.
- Non-Textual Glyphs. Many times there are important non-textual characters or glyphs that do not get converted to characters by OCR, leaving them invisible for text analytics or text-based retrieval, e.g., logos, or map symbols.
- Inferring the Obvious. Graphical elements often provide the most obvious clues as to how a file or document should be classified, e.g., placement and size of logos or text blocks. Because those graphical elements may not be directly accessible by text-restricted systems, they are left to try to infer what is most obvious to anyone just looking at the files.
- Incorrect Document Boundaries. Image-only files often contain multiple documents per file and OCR does not provide a way to correct document boundaries. This causes downstream problems with systems which classify files based on comparing the words that are used within documents. Embedded documents can be missed and the ones that are classified can be misclassified. There can be similar issues for single-page TIFs where document boundaries are not obvious.
In one of our client’s projects, where we are automating the scanned customer invoices using Blue prism, we have encountered a few of the above issues. OCR tools were reading the data perfectly but the alignment issues led to errors in calculations and thus a huge difference in the invoice amounts.
Traditional OCR might need manual intervention to clean such issues but we resolved a few of them by using Artificial Intelligence(AI) and Machine Learning (ML ) capabilities embedded in the RPA tool.
How we did it? What are the techniques applied? What skill set is required? Stay tuned for more information…