Data Science & Machine Learning

Machine learning for efficient document classification

Our client, a multinational healthcare corporation,  processes vast numbers of documents such as publications, training material, brochures, and presentations. Each document must be made available to different sales representatives, who then distribute them to clients and other stakeholders on a targeted basis. Locating and pinpointing specific documents or a group of documents in the old AWS system necessitates manual tagging, which is a task prone to errors and time-consuming. 

The client asked FELD M for support in solving the issue.

Main objectives:

  • Development and evaluation of Natural Language Processing models and further machine learning (ML) models, both black-box and tailor-made, for document labelling 
  • Image extraction from PDFs and one-shot classification of labels using the CLIP model 
  • A production-ready pipeline in AWS, capable of processing thousands of documents on the fly

Our approach 

The FELD M team concluded that building an automated data pipeline was the key to addressing the issue. This pipeline needed to efficiently handle the intricate demands of the existing tagging taxonomy. Moreover, it was essential that it adapts based on user input and feedback, thereby constantly enhancing its tagging accuracy. This way a continuous feedback loop could be employed. 

The overall goal of the project was the implementation and deployment of a pipeline that utilizes developed models and integrates prediction results into other services (such as features for a recommendation engine or enriching dashboards). 

Development of a data product from whiteboard to production  

Following Design Thinking principles, the Data Product team of FELD M started with a combination of desk research, interviews of stakeholders and workshops. This enabled the team to gain a deep understanding of  the different business needs, the different use cases for tagged content and the taxonomy for document tagging. This was all the basis to develop a concept of how potential solutions could be employed by the end users. Including experts from the fields of data science and data engineering, a tailored concept for a complete solution including different machine learning models, image classification and processing pipeline as well as a user interface has been developed. 

Our new model outperformed AWS Comprehend in several cases

FELD M used state-of-the-art deep learning-based computer vision techniques to extract labels from the pictures contained in the documents. We opted for CLIP as an open source alternative to AWS Rekognition, as it supports one-shot learning. This allowed us to provide the model with a customized list of objects of interest to be detected. For each of these objects, the model returns a certain probability. The object with the highest probabilities above a certain threshold were chosen as additional features for the documents. 

The FELD M team built a productive document classification pipeline based on Python and MLflow for model management into the existing AWS infrastructure. This included a benchmark classification model (XGBoost), built for each of the given tags using a set of curated features and image labels. The results were compared to the AWS Comprehend model, a black-box NLP service for text classification tasks, based on different machine learning metrics (such as F1-score). For several labels our new model outperformed AWS Comprehend. The results led to the decision to use our new model in these cases, and stick with AWS Comprehend for the remaining labels to achieve the best results. 

Fully automated document labelling pipeline in AWS 

FELD M was able to successfully develop an automated document tagging pipeline within AWS. The solution provided has an integrated feedback loop: when new documents are available or when feedback is provided (i.e. users are able to suggest or correct labels for the documents), the pipeline is triggered, and the models are retrained to return increasingly accurate predictions with every iteration.

The solution fully meets the client’s needs, automatically and accurately classifying and organizing documents, making it easier to recommend specific documents to sales representatives.

Similar projects