Document(Invoice) Parser

  • Tech Stack: javascript, OpenCV, Tesseract, Python, Tensorflow
  • Github URL: Project Link

Developed an algorithm to classify invoices into specific vendor types using Convolutional Neural Networks (CNNs) to analyze the structural patterns of invoices. The algorithm marked regions of importance based on the vendor class and used OpenCV to detect text and word blocks within those regions. Tesseract was then applied to extract text from the identified blocks, while DBSCAN clustering grouped related text blocks together.

The process enabled the extraction of essential key-value information from invoices, such as invoice ID, seller company name, and the grand total. This approach streamlined the extraction of structured data from unstructured invoice formats, offering a robust solution for automating invoice processing and classification across various vendors.