Image Classification and Information extraction in Scanned Documents
The machine learning team of LITIS is engaged in a project with a company wishing to automate its supply chain by the introduction of AI technologies able to link delivery notes with purchase orders. Based on Deep Learning technologies, LITIS will develop two main software components: – scanned document classification – information extraction (detection and recognition of alphanumeric fields in document images).
Your mission is twofold 1) Develop a Deep Neural Networks based document image classification module able to deal with image-based and text-based features (captured from OCR). The system is expected to have incremental capabilities to iteratively include more document types; 2) Develop of an extraction module that will exploit OCR character hypothesis to detect the alphanumerical fields of interest by the design of specific language models and extraction templates (using Kaldi). The datasets are made of scanned documents from the running workflow of the company, with random degradations due to scanning, condition of use etc…