Big data is currently a hot topic in the IT market. Big data is about ingesting data from multiple sources and then making sense of it. These sources can include transaction systems, Web sites, machine logs, sensors (think GPS), and social media. They can also include archives of scanned documents.
It’s an old adage that 80% of any business’s data is unstructured. Historically, this has meant that it was contained in documents rather than in ERP and other data-driven systems. And while email and other sources of unstructured content have certainly proliferated in recent years, there still are a lot of business data, both new and legacy, contained in documents—including paper documents and images of them.
This is where the connection lies between document imaging and big data. Any organization that wants a complete picture of its data to apply analytics to can’t ignore document imaging.
Where Big Data Meets Big Documents
Let’s take a bank, for example, that has years and years of mortgage files. Aren’t these data in these files worth analyzing? These mortgage files potentially contain a wealth of data related to interest rates, housing values, geographical buying trends, and financial status of the loan applicants. It could all be mined to help the bank reduce risks on future mortgages, as well as anticipate market trends. It could also possibly be useful for preventing fraud.
Insurance companies are another line of businesses with reams of historical paper and imaged files. A life insurance provider recently was dealing with a higher-than-anticipated number of death notices and wanted to see if there was any correlation with the policies that were written for the recently deceased. Of course, the policies had been written years earlier—before the insurer was even entering its data electronically. To analyze the policy information, the insurer scanned its old paper policy forms and then had a data entry service enter data from the forms into its analytics system.
These are just two examples of potential paper document types that can be useful in big data applications. Contracts, geospatial documents, HR forms, financial transaction records, student records, patient records, and legal documents are just a few more types that could prove valuable.
Applying Automated Data Capture
The key to unlocking this information is likely going to be some sort of automated data-capture technology. That’s because, in many instances, document images are stored with a bare minimum of indexing information. This might include the date a file was scanned, as well as an account number.
However, the real value of the images might lie in details like the price that was paid for a house, the diagnosis given to a patient, or the terms included in a contract. Automated data-capture technologies, like forms recognition and OCR, can be applied to expedite the extraction of these details. Forms recognition or “auto-classification” software can be used to identify document types. Once images are classified, auto-extraction can be used to extract specific data.
Automated data capture is not typically 100% effective, but if it can automate extraction of 80% of the desired fields, with verification and exceptions keyed manually, it can provide a tremendous reduction in labor. There are often considerable professional services to be earned by a value-added reseller (VAR) for setting up these types of systems, which utilize various methods, like rules and learn-by-example, to increase accuracy. Testing and honing a system with samples is key to optimizing it. Software vendors will typically offer VARs training on setting up their products..
A Bridge to the Future
Big data often brings to mind a slew of electronic inputs like Web sites, social media, and line-of-business systems. But let’s not forget that a lot of important business data have always been, and still are, included in paper transactional documents. Document imaging is the key to integrating these documents with emerging big data applications.