High Impact Factor : 4.396 icon | Submit Manuscript Online icon |

LexiScan Auto: An Intelligent Automated Legal Entity Extractor Using Deep Learning NER for High-Volume Contract Processing in FinTech Legal Operations

Author(s):

Hitesh Suresh Sethiya , PES Modern College of Engineering, Pune; Prof. Yogeshchandra Puranik, PES Modern College of Engineering, Pune

Keywords:

Named Entity Recognition (NER), Bidirectional LSTM, Legal Document Processing, OCR Pipeline, TF-IDF, Conditional Random Field (CRF), Transformer Fine-Tuning, FinTech NLP, Information Extraction, spaCy, TensorFlow.

Abstract

The exponential growth of digital legal documentation in modern FinTech and enterprise law practice has created an acute need for automated, accurate, and scalable information extraction systems. This paper presents LexiScan Auto, a production-grade Intelligent Document Parsing system that employs a multi-stage pipeline combining Optical Character Recognition (OCR), classical Natural Language Processing (NLP) baselines, and a custom deep learning Named Entity Recognition (NER) architecture to automatically extract critical legal entities from unstructured PDF contracts at scale. The system targets four high-value entity classes: Party Names, Effective Dates, Total Monetary Values, and Jurisdiction. We architect an end-to-end pipeline: (1) a robust OCR sub-system powered by Tesseract and pdf2image to digitize scanned, image-based PDFs; (2) a TF-IDF vectorization and Regex-based preliminary classifier providing a strong interpretable baseline; (3) a custom Bidirectional Long Short-Term Memory (Bi-LSTM) model with a Conditional Random Field (CRF) decoding layer, trained on domain-specific legal corpora, achieving superior F1 scores over rule-based approaches; and (4) a post-processing validation engine enforcing logical heuristic constraints (e.g., Termination Date must follow Effective Date). Experimental results demonstrate that LexiScan Auto achieves a macro-averaged F1-score of 91.7% across all four entity categories on a held-out legal contract test set, representing a 38.4% improvement over pure Regex baselines and a 21.2% improvement over off-the-shelf spaCy NER models. The system is implemented in Python using TensorFlow/Keras and spaCy, and is designed for seamless integration into high-volume law firm document management workflows.

Other Details

Paper ID: IJSRDV14I40003
Published in: Volume : 14, Issue : 4
Publication Date: 01/07/2026
Page(s): 65-70

Article Preview

Download Article