JPMorgan’s AI team has revealed another paper for document analysis. DocGraphLM is a development in Visually Rich Document Understanding (VrDU) that significantly enhances information extraction (IE) and question-answering (QA) capabilities over documents characterised by complex layouts.
Click here to read the paper.
DocGraphLM introduces a unique approach by combining pre-trained language models with graph semantics. The framework proposes a joint encoder architecture for document representation and a groundbreaking link prediction approach for reconstructing document graphs.
Notably, this link prediction method involves predicting both directions and distances between nodes, prioritising neighbourhood restoration over distant node detection.
Read: Why Are Consulting Firms Building LLMs
Experiments conducted on three state-of-the-art datasets, including FUNSD, CORD, and DocVQA, consistently demonstrate improved performance in IE and QA tasks with the incorporation of graph features. The researchers report that adopting graph features not only enhances task-specific outcomes but also accelerates the learning process during training.
The researchers outline the main contributions of their work, including:
- Proposing a novel architecture that integrates a graph neural network with pre-trained language models to enhance document representation.
- Introducing a link prediction approach for document graph reconstruction, emphasising restoration on nearby neighbour nodes through a joint loss function.
- Demonstrating that the proposed graph neural features consistently improve performance and accelerate convergence in the learning process.
In the context of representing documents as graphs, DocGraphLM adopts an innovative heuristic known as Direction Line-of-sight (D-LoS) to generate edges between nodes. This approach divides the 360-degree horizon surrounding a source node into eight discrete 45-degree sectors, determining the nearest node within each sector.
This unique method avoids the issues associated with traditional approaches like K-nearest-neighbours (KNN) or 𝛽-skeleton, resulting in a more relevant and efficient representation.
Last month, JPMorgan also released DocLLM, a generative language model designed for multimodal document understanding. DocLLM stands out as a lightweight extension to LLMs for analysing enterprise documents, spanning forms, invoices, reports, contracts that carry intricate semantics at the intersection of textual and spatial modalities.
The post JPMorgan Releases DocGraphLM, For Visual Document Analysis appeared first on Analytics India Magazine.