An overview with its relation to IDP 

The challenge arising in unstructured data is that plenty of data is present within images. To process unstructured data and to extract meaningful insights from the document, we need a model that can segregate a document into different parts where each part represents each component such as title, table, text, image, and so on. Surprisingly from these extractions and classifications, we can deduce the type of document. So get ready to transform the enterprise using IDP Solutions and experience 4x increase in process capacity

Generally speaking, AI-powered intelligent document processing (IDP) solutions will automate the extraction of information from documents in order to save time, money and human effort as well as to reduce errors and increase access to data. Data Extraction is one important task that is simplified by Artificial Intelligence and has expedited the workflow in organizations implementing automation.  To be more specific, the LayoutLM models extract layers from visually rich documents and helps to overcome difficult challenges such as object detection and the extraction of data from tables.

Introduction – The LayoutLMs 

These days, huge number of companies extract large amount of data from business documents through manual efforts that are expensive, meanwhile requiring manual customization or configuration. Text and layout are jointly learned in a single framework for document-level pre-training for the first time. It achieves new state-of-the-art results in several downstream tasks, including form understanding, receipt understanding and document image classification.

For the purpose of future analysis, computers can now collect, identify, and label data from a various number of documents and put it into a data analytics tool such as Excel, Power BI, or Tableau. IDP’s focus is to eliminate the need for human staff for the purpose of handling repetitive documents. This document-intensive process will allow businesses to obtain critical insights from data quickly, which is why IDP is used in banking, finance, healthcare, travel, and several other industries for document processing. When compared to other IDP models, LayoutLM is the most effective pre-trained models as it is the best to understand forms, receipts, and document-image classification. It is the first IDP platform that used text and layout information in context with images to improve document image and text interpretation. LayoutLM’s V1, V2, and V3 are subsequent improvements as they have got trained on larger datasets. In fact, they have used novel methods for image extraction and layout understanding. 

The layoutLMs occupy the first seat in performing an efficient document understanding model that is trained on large amounts of data. These data are of different types such as forms, claims, invoices, receipts, and so on. Even though we have Optical Character Recognition to extract the text from the document, there was a need for a model that does the categorization of data and labels them as texts or images headings, or tables. This feature is the first distinguisher of layoutlm models from OCR. When we move from version to version, we can see a substantial increase in the number of data they are trained upon. LayoutLMv3 is the latest and an effective model that is trained upon 11 million data. 

Token Classification with LayoutLM models:

How does a machine know what key information is present in the provided document?

The key values differ based on the type of document. We need a well-performing AI system that can extract the key information such as invoice number, name of the recipient, account number, and so on.  LayoutLM created a revolution by extracting key information from the provided document paving the way for a successful document understanding. It is now easy to load and train different state-advanced-the-art transformer models where we don’t need to deal much with pytorch programming, similar to what Keras does with tensorflow. What you need to do is to create your training dataset and model configurations. On the other hand, there is no need to take care of loading transformer tokenizers, defining the optimizers and to generate the input text files.

LayoutLM model does what other regular transformer models do such as BERT or RoBerta, on the other hand it also considers the layout of document. It is a suitable model for forms, tables, and all other document that is a mix of plain text and tables where the alignment and position of words matters. Applications of LayoutLM will be document processing tasks such as Automating Invoice Processing, Table Data and Form Data Extraction, contracts or resumes, and so on.

There are a few differences between version 2 and version 3 of layoutLM models and they obviously overrule the existing version 1 model. So, we shall see a comparative analysis of layoutlmv2 and layoutlmv3. 

Technical differences between layoutlmv2 & layoutlmv3

  • Unlike versions 1 & 2, layoutlmv3 does not depend on other pre-trained models such as Faster R-CNN for image extraction from provided documents. 
  • It is trained in such a way that it has a superior text-image alignment. 
  • Doesn’t require a detectron model for its running. This feature reduces the space complexity that was previously caused by the detectron model. 
  • Better form understanding, receipt understanding, and document image understanding.
  • Unlike layoutlmv2, layoutlmv3 uses byte-pair encoding to perform token classification.
  • Layoutlmv3 requires the image in RGB format whereas Layoutlmv2 utilizes the image in the form of BGR. 

How to perform Layoutlmv3?

The main aim of this blog is to make people aware of the difference between the two versions of the layoutlm model and to know the implementation of layoutlmv3. Here is the explanation of the implementation of token classification using layoutlmv3.

Install the required libraries and import them:

!pip install -q git+https://github.com/huggingface/transformers.git
!pip install pytesseract datasets 
!apt install tesseract-ocr

import pytesseract
pytesseract.pytesseract.tesseract_cmd = (
from transformers import LayoutLMv3ForTokenClassification
from transformers import AutoProcessor
from datasets import load_dataset
from PIL import ImageDraw, ImageFont, Image
import torch
from datasets.features import ClassLabel
import numpy as np
import pytesseract

def get_label_list(labels):
    unique_labels = set()
    for label in labels:
        unique_labels = unique_labels | set(label)
    label_list = list(unique_labels)
    return label_list

def iob_to_label(label):
    label = label[2:]
    if not label:
      return 'other'
    return label

def unnormalize_box(bbox, width, height):
     return [
         width * (bbox[0] / 1000),
         height * (bbox[1] / 1000),
         width * (bbox[2] / 1000),
         height * (bbox[3] / 1000),

def doc_tagging(image):
  label2color = {'question':'blue', 'answer':'green', 'header':'orange', 'other':'violet'}
  datasets = load_dataset("nielsr/funsd-layoutlmv3")
  features = datasets["train"].features
  column_names = datasets["train"].column_names
  image_column_name = "image"
  text_column_name = "tokens"
  boxes_column_name = "bboxes"
  label_column_name = "ner_tags"
  if isinstance(features[label_column_name].feature, ClassLabel):
    labels = features[label_column_name].feature.names
    id2label = {k: v for k,v in enumerate(labels)}
    label2id = {v: k for k,v in enumerate(labels)}
    labels = get_label_list(dataset["train"][label_column_name])
    id2label = {k: v for k,v in enumerate(labels)}
    label2id = {v: k for k,v in enumerate(labels)}
  image = image.convert("RGB")
  processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base",ocr_language='en')
  processor.feature_extractor.apply_ocr = True
  encoding = processor(image,return_offsets_mapping = True, return_tensors="pt", truncate=True)
  offset_mapping = encoding.pop('offset_mapping')
  device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
  for k,v in encoding.items():
      encoding[k] = v.to(device)
  model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base",
  outputs = model(**encoding)
  predictions = outputs.logits.argmax(-1).squeeze().tolist()
  token_boxes = encoding.bbox.squeeze().tolist()
  width, height = image.size
  is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0
  true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
  true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]
  draw = ImageDraw.Draw(image)
  font = ImageFont.load_default()
  for prediction, box in zip(true_predictions, true_boxes):
    predicted_label = iob_to_label(prediction).lower()
    draw.rectangle(box, outline=label2color[predicted_label])
    draw.text((box[0] + 10, box[1] - 10), text=predicted_label, fill=label2color[predicted_label], font=font)
  return image

Let’s break down the code step by step:

1. get_label_list(labels): This function takes a list of labels and returns a sorted list of unique labels. It is used to determine the possible labels for token classification.

2. iob_to_label(label): This function takes a label in the “IOB” (Inside, Outside, Beginning) format and converts it to a simpler label format. For example, “B-question” is converted to “question.” If the label is empty, it is mapped to “other.”

3. unnormalize_box(bbox, width, height): This function takes a bounding box represented as four values between 0 and 1000 and converts it to pixel coordinates based on the provided width and height.

4. doc_tagging(image): This is the main function for document tagging using LayoutLMv3.

Here’s what it does step by step:

  – label2color: A dictionary that maps label names to colors for visualization purposes.

   – Loads a dataset from the “nielsr/funsd-layoutlmv3” dataset. 

   – Retrieves various features, column names, and data related to images, tokens, bounding boxes, and NER (Named Entity Recognition) tags from the dataset.

   – Constructs dictionaries id2label and label2id to map label names to label IDs and vice versa.

   – Converts the input image to RGB format and initializes an OCR (Optical Character Recognition) processor using LayoutLMv3.

   – Encodes the image and extracts token-level information, such as token IDs, attention masks, and bounding boxes, while also mapping the offsets of tokens within the original text.

   – Moves the encoding data and the model to the appropriate device (CPU or GPU).

   – Passes the encoded image data through the LayoutLMv3 model to obtain token classification predictions. These predictions are based on the LayoutLMv3 model’s understanding of the document’s layout.

   – Processes the predictions to extract the most likely label for each token.

   – Converts the normalized bounding box coordinates to pixel coordinates based on the image’s width and height.

   – Draws rectangles around the recognized tokens on the image and adds text labels with corresponding colors.

   – Saves the annotated image as “output.png.”    – Finally, it returns the annotated image.

Conclusion & Future Work:

  • In order to extract visual features, LayoutLMv3 will never depend on pre-trained CNN or Faster R-CNN backbone.
  • LayoutLM models use unified text and image masking pre-training objectives such as masked language modeling, masked image modeling, and word-patch alignment.
  • Explore fewshot and zero-shot learning capabilities to facilitate more real-world business scenarios and improve business efficiency using the Document AI industry.

Author: Paushigaa S, AI Research Intern

Leave a Reply

Your email address will not be published.