Ingénieur recherche : Information extraction, Text Recognition in Historical Document Collections

Contexte du poste



 EXO-POPP project: Optical Extraction of Handwritten Named Entities for Marriage Certificates for the Population of Paris (1880–1940) 

Thanks to a collaboration between specialists in machine learning and historians, the EXO-POPP project will develop a database of 300,000 marriage certificates from Paris and its suburbs between 1880 and 1940. These marriage certificates provide a wealth of information about the bride and groom, their parents, and their marriage witnesses, that will be analyzed from a host of new angles made possible by the new dataset. These studies of marriage, divorce, kinship, and social networks covering a span 60 years will also intersect with transversal issues such as gender, class, and origin. The geolocation of data will provide a rare opportunity to work on places and relocations within the city, and linkage with two other databases will make it possible to follow people from birth to death. 

Building such a database by hand would take at least 50,000 hours of work. But, thanks to the recent developments in deep learning and machine learning, it is now possible to build huge databases with automated reading systems including handwriting recognition and natural language understanding. Indeed, because of these recent advances, optical printed named entity recognition (OP-NER) is now performing very well. On the other hand, while handwriting recognition by machine has become a reality, also thanks to deep learning, optical handwritten named entity recognition (OH-NER) has not received much attention. OH-NER is expected to achieve promising results on handwritten marriage certificates dating from 1880 to 1923. This project’s research questions will focus on the best strategies for word disambiguation for handwritten named entity recognition. We will explore end-to-end deep learning architectures for OH-NER, writer adaptation of the recognition system, and named entity disambiguation by exploiting the French mortality database (INSEE) and the French POPP database. An additional benefit of this study is that a unique and very large dataset of handwritten material for named entity recognition will be built. 

Description


Missions 

The research engineer will be in charge of the development of a processing pipeline dedicated to optical printed named entity recognition (OP-NER). He will closely collaborate with a Ph.D. student in charge of Handwritten Named Entity Recognition (OH-NER). 

This work package will be devoted to optical printed named entity recognition (OP-NER) on the 1930 and 1940 marriage certificates. OP-NER is the project’s easiest task of the project and will benefit from the latest results achieved by the LITIS team on similar problems on financial yearbooks. The scanned typescript marriage certificates will be transcribed using a professional OCR such as Omnipage (Nuance) or ABBYY FineReader (ABBYY). However, preliminary tests have shown the limits of off-the-shelf OCR on these typescript documents because they are prone to more variability than modern printed documents. We will head off these problems early, at the beginning of WP1, and we plan, if needed, to spend some months specifying and tuning our own deep-learning-based OCR to achieve the best possible results, with the contribution of the PhD student recruited. Then the transcriptions will be processed for named entity extraction and recognition. Named entity recognition is a well-defined task in the natural language processing community. In the EXO-POPP context, however, we need to define each entity to be extracted. The possible relation between the entities will also be established and used in the definition of tags. For example, we will certainly make a clear distinction between the different personal names occurring in the text to distinguish between the wife’s and the husband’s names, and we will proceed similarly for the parents of the husband and of the wife respectively, and so on for the witnesses, any children, etc. An estimation of around 30 categories has been established by S. Brée. We will start WP1 by defining each category of interest and define a tag set accordingly. Manually tagging the transcriptions will be made possible through the PIVAN web-based collaborative interface. This platform provides in one single web interface a document image viewer, viewing and editing of OCR results and text tagging facilities for NER. PIVAN will ease the annotation efforts of the H&SS trainees and will allow for building the largest annotated datasets required for machine learning algorithms to run optimally. PIVAN will be adapted to the requirements of the project, but most of its components are yet ready for the purposes of EXO-POPP as it is currently running on other projects, including the POPP project. This will also strengthen the links between the two teams (CS and H&SS) as they will quickly be able to work together on the specification of the extraction task at end, using the PIVAN platform. 

The named entity recognition task will be based on a state-of-the-art machine learning approach. In Swaileh et al. (2020), the approach was based on pre-trained word embeddings used as contextual word features for training a specific NE extraction system dedicated for the task at end. This extraction system was made up of a BLSTM and a CRF14. We plan to develop and tune the EXO-POPP named entity recognition module using this same approach. We explored different active learning strategies in view of maximising the trade-off between performance gain and the tagging efforts needed. We plan to proceed similarly here. Note that PIVAN will be of primary importance here by giving the user the 

Laboratoire LITIS, EA 4108, Université de Rouen, 76 800 Saint-Etienne du Rouvray, FRANCE Téléphone : (33) 2 32 95 50 13 Fax : (33) 2 32 95 50 22 Email : Thierry.Paquet@univ-rouen.fr 



capacity to browse a large quantity of extraction results for assessment and correction purposes. This annotation/correction work will be rendered much more acceptable for the end user thanks to PIVAN. 

Deliverables: 

D1.1. Transcription of the typescript corpus 

D1.2. EXO-POPP web-based annotation platform 

D1.3. Named entities extracted from the typescript corpus 

Comment postuler ?


Skills : 

  • • Computer Engineer, Python, Machine Learning, Computer vision, Natural Language Processing 
  • Knowledge in web-based programming 
  • • Ability to work in a team, curious and rigorous spirit 


Position to be filled : 

Positions: 1 Research Engineer 

Time commitment: Full-time 

Duration of the contract: September 1st 2021 – 31st August 2022 

Contact: Prof. Thierry Paquet, Thierry.Paquet@univ-rouen.fr 

Indicative salary: €24 000 annual net salary, plus French social security benefits 

Location: LITIS, Campus du Madrillet, Faculty of science, Saint Etienne du Rouvray, France