Human in the loop (HITL), data extraction techniques, have been around since the first introduction of Optical Character Recognition (OCR) software. Originally, HITL processes were seen as a series of quality checks or validations to make sure the OCR tools collected data properly. These days Machine learning is becoming more and more mainstream, and the role of the human within the data extraction process is becoming far more complex and important.
In almost every industry there exist a significant amount of ‘critical’ information that needs to be accessed at a moment’s notice but unfortunately may be buried deep in an email, pdf or word doc. The problem is that most of these documents have no real data structure or content organization so getting at that ‘critical’ content, when you need it, is a tremendous challenge.
In supply chain management, important details written in documents such as legal contracts, invoices, pricing agreements, and almost everything in-between are frequently locked away and not accessible.
Clearly, there is a need to extract and store this valuable data in a structured format to be used in business analytics, but the question is how to do it?
In a perfect world, you might want to get an analyst to review each document interpreting the content and then keying in the relevant elements into a database, but in reality, this may not be feasible. If nothing else, the mere size and scope of the content can make manual reviews unfeasible. Instead, we need to deploy a more scalable approach that utilizes both the insights of a human as well as the scalability of a computer application. This process is commonly called Human-in-the-loop machine learning.
As an example let’s start with a simple use case: Suppose your company has several thousand suppliers and you want to aggregate the liability insurance information to track expiration dates, policy terms or other details.
To accomplish this task, you will first need to create a collection of related, semantic-based, real-world logic and knowledge. This collection of knowledge is referred to as an Ontology. An Ontology is a compilation of content libraries designed to translate and interpret the contents of your documents accurately. Not just definitions and knowledge but also the details of how it all connects semantically to your project. – As an analogy; if the Ontology were a part of the human body, it would not be the brain rather it would be all of the knowledge, experiences, and connections collected inside the brain.
From there you will need to start with a basic understanding of what data you would like to collect. Let’s say you want to extract some different terms including, coverage limits, effective dates or even jurisdictional information.
With those data needs established, you will need to create some set of training-data. To create the first batch of training-data, you can have an analyst review a couple of sample documents and inform the system of what the relevant data is and how it looks. In essence, your analysts will be showing the ML system what patterns and information it will need to identify and collect. The application will create a set of algorithms to replicate your steps and collect the related content.
With the training data set, you can then run the bulk of your insurance certifications through the system and wait to see if the machine extracts the information you desire. To be effective at scale, a machine learning system needs to have the ability to recognize and report when the target data is not successfully identified.
When the system fails to collect the desired information it should periodically and interactively alert human counterpart. The human will then look at the places where the machine is hung up and demonstrate additional ways of how to interpret the data. Effectively, the human creates a new set of training-data collection techniques which will then be incorporated and applied to make the system more robust. This process of testing content and then adding new data capturing techniques in a perpetual cycle is called ‘Human in the loop.’
Human in the loop data extraction techniques can also allow Machine Learning systems to learn shapes, sounds, and orientations. All of these are especially useful where the data extraction needs are related to handwriting or voice mails.