Ph.D Student : Multimodal models for Document Image Understanding


Context and main objectives

The digital transformation of libraries, which has been based on OCR (Optical Character Recognition) technology for more than 20 years, faces some limitations both in terms of quality, due to the diversity of the collections and the limitations of OCR technology, and in terms of added value due to a lack of structuring and high-level indexing. Named entity extraction is still little used because it relies on language processing technologies, which were not very adaptable until recently. More generally, the semantic indexing of collections is underdeveloped and integrated with metadata. We propose to develop multimodal models (text + image) for the extraction of information from collections of digitized documents in large libraries. The literature shows that work in this direction is still underdeveloped, and that it is mainly aimed at processing commercial documents (invoices etc…). 

The proposed project aims to disrupt the traditional sequential document processing workflow by combining Vision models and Large Language Models (LLM) to provide a more streamlined and efficient approach. The standard two-stages architectures based on OCR + NER (Optical Character Recognition, Named Entity Recognition) are now giving way to end-to-end multimodal approaches known as Document Understanding, which are more versatile and easily adaptable to new corpora, making it easier and more cost-effective to set up and run document processing projects. As a result, this accessible, user-friendly approach will democratize access to advanced AI technologies for a wider range of institutions, contributing to the evolution of the technology value chain in the Libraries, Archives and Museums (LAM) sector and opening up new opportunities for research and discovery. 


The proposed work program funded by the FINLAM project (Foundation INtegrated models for Libraries Archives and Museum, ANR 2023), relies on the expertise of LITIS to study the most relevant multimodal architectures to integrate the language knowledge conveyed by the large language models developed recently and to study the modalities of specialization/adaptation of these models in conjunction with the learning of a generic optical encoder, benefiting from the annotated collections available at the French national library (Bibliothèque nationale de France -  BnF). User interaction will be considered according to different scenarii of closed and open queries. 

State of the art overview

In 2022, the first end-to-end models integrating OCR and named entity extraction have been proposed for document understanding tasks. The DONUT (DOcumeNt Understanding Transformer) model [1] proposed by NaverLab and Google Asia performs in a single stage the analysis of the layout to detect writing areas, proceeds to their recognition using a lexicon of subwords and finally detects the named entities using specific TAGs, and a strong external language model (BART). DONUT is pre-trained on synthetic documents whose associated ground truth is a sequence of subwords and TAGs. No segmentation ground truth is used. Document Understanding is thus reduced to a task of learning a tagged language provided that the system has vision capabilities to build high-level visual representations. A similar approach has been proposed by Adobe in autumn 2022 with DESSURT [2]. The Pix2struct [3] architecture proposed by Google USA also falls into this category of integrated systems for document understanding.


In the year 2022, The LITIS Machine Learning team proposed two models for digitized documents that integrate the layout analysis stage. The VAN (Vertical Attention Network) model is capable of learning to recognize paragraphs of handwritten text [4] and outperforms the state of the art. The DAN (Document Attention Network) model [5] can learn the layout and the handwriting of a handwritten document end-to-end. The DAN model is trained on synthetic printed documents before being specialized on handwritten documents without using any physical segmentation information. It outputs the recognized texts enriched with some layout TAGs. The DONUT and DAN models proceed according to the same visual attention mechanisms thanks to a transformer-type network, they are pre-trained on synthetic documents and use only tagged transcriptions of texts during training. DAN is specialised in text recognition, while DONUT is specialized in named entity extraction.


Orientation of research

Multimodal Architectures design

A first orientation of the work will aim at integrating into the DAN architecture pre-owned, powerful and royalty-free language representations, such as BERT [6], BART, CAMEMBERT, BLOOM ... Particular attention will be paid to the mode of integration of these representations with regard to their dimension vis-à-vis the dimension of the internal representation of the DAN architecture. 


Multimodal architecture training

The integration of language knowledge in the form of a pre-learned model will be considered in different training modalities. Model distillation approaches will be studied. In a more integrated way, we will also try to learn a language representation of the target domain by minimizing the distance between the target representation and the generic representation. In this perspective, one could be inspired by optimal transport approaches. Following the DAN training approach, we aim to explore more in depth using synthetic documents with curriculum learning. One could be inspired by Markovian generative processes or generative adversarial networks (GAN), or diffusion models, to develop an original solution.


Exploring visual question answering (VQA)

One of the most striking developments in recent years is certainly linked to the capacity of large language models to generalize easily from a few examples [7] and without learning, giving rise to specialization through textual interactions with the user (Chat). Even if it seems unthinkable to transpose this type of approach to document understanding, it seems quite relevant to explore the capacity of the architectures we will propose to solve different fictitious or real tasks of visual question answering. We will benefit from some available datasets [8, 9], but we will also explore new scenarios of question-answering tasks in the aim to make the system more adaptive to the user's needs. 



1.    Geewook Kim, Teakgyu Hong, Moonbin Yim, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. Donut : OCR-free Document understanding transformer, ECCV, pp 498–517, 2022,

2.    Brian Davis, Bryan Morse, et al., End-to-end Document Recognition and Understanding with Dessurt, 2022,

3.    K Lee, Mandar Joshi, et al. , Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding, 2022,

4.    Denis Coquenet, Clément Chatelain, and Thierry Paquet, End-to-end Handwritten Paragraph Text Recognition Using a Vertical Attention Network, IEEE-PAMI , Vol.45 n° 1 , pp. 508-524, jan. 2023., pre-print

5.    D. Coquenet, C. Chatelain and T. Paquet, « DAN: a Segmentation-free Document Attention Network for Handwritten Document Recognition, » in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, , pre-print,

6.    Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, 2019, 

7.    Alex Andonian, Quentin Anthony, Stella Biderman, Sid Black, Preetham Gali, Leo Gao, Eric Hallahan, Josh Levy-Kramer, Connor Leahy, Lucas Nestler, Kip Parker, Michael Pieler, Shivanshu Purohit, Tri Songz, Phil Wang, and Samuel Weinbach. 2021. GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch

8.    Minesh Mathew, Dimosthenis Karatzas, C. V. Jawahar, "DocVQA: A Dataset for VQA on Document Images", arXiv:2007.00398, WACV 2021.

9.    Minesh Mathew, Ruben Tito, Dimosthenis Karatzas, R. Manmatha, C.V. Jawahar, "Document Visual Question Answering Challenge 2020", arXiv:2008.08899, DAS 2020.



Fiche de poste
How to apply ?


Keywords Deep Learning, Vision, OCR, Document Understanding, Natural language processing



Thierry Paquet,   

Pierrick Tranouez,

Clément Chatelain,


We look for a candidate with a curriculum in Machine Learning, and a significant experience of Deep Learning technologies applied to vision or natural language processing (NLP).


Send CV, letter of application, cover letters.