One of the great challenges we find in the industrialization of document processing in large companies is the fact that documents, most of the time, lack structure.
The state of the art technology (through OCRs and AI) makes it possible to treat structured documents such as invoices with a fairly high success rate. Our experience teaches us that with a dataset of approximately 500-600 invoices with 10-15 different design types and formats, an accuracy of more than 90% can be achieved.
But what about unstructured documents?
With one of our latest clients (energy sector) we have achieved a very high level of automation in the extraction of information from unstructured documents. These are apparently structured documents. They are documents with technical information on certain chemical products where mandatory information by law and a certain similar structure must be respected.
The use case was a classical one: process the document, extract the information (55 fields) and check that this information is correct. This process (reading, extraction, verification and input into systems) involves approximately 180 minutes of a technical profile with experience of more than 8 years (profile with high salaries).
The applied methodology (similar to the one we use in mortgage documentation processing and many other cases) where several factors are taken into account to extract the relevant information:
the geometric position of the text
the semantic meaning of the same within the document
With our client we found some documents with various peculiarities:
Lack of structure
Little homogenization of the text itself
Different expressions to refer to the same concept
Similar (even identical) semantic expressions that depending on the place in the text where they are, the meaning and therefore their relevance changed.
Each supplier designed and used a different format and structure
The challenge was important and to solve the case we configured and trained the extraction models taking into account the following parameters:
grouping texts with a specific geometric rule taking into account font size, position, etc.
creation of an external dictionary of relevant technical expressions
transformation of texts into vectors
With all this we achieved that different documents, with different sections, had a structure and therefore we were able to identify the 55 relevant fields / concepts in order to subsequently make a check via API.
Currently the process has been reduced from hours to seconds. Now the responsible technicians invest their time in managing those suppliers whose documentation is not in accordance with the legally established process instead of reading documents and doing manual entry.
Write to us if you have similar processes or want to know the details of how the platform works or need a demo.
Comments