Offre d'emploi

Information extraction, Text Recognition in Historical Document Collections

Présentation

LITIS (Laboratoire d’Informatique, Traitement de l’information et des Systèmes) is a research laboratory associated to the University of Rouen Normandie, Le Havre Normandie Normandie, and School of Engineering INSA Rouen Normandie. Research at LITIS is organized around 7 research teams which contribute to 3 main application domains: Access to Information, Biomedical Information Processing, Ambient Intelligence. LITIS currently includes 90 faculty staff members, 50 PhD students, 10 PostDoc and Research Engineers. The Machine Learning team of LITIS is developing research in modeling unstructured data (signals, images, text, etc…) with machine learning algorithms and statistical models.
For more than two decades it contributes to the development of reading systems and document image
analysis for various applications such as postal automation, business document exchange, digital
libraries, etc.

EURHISFIRM project

EURHISFIRM aims at developing a research infrastructure to connect, collect, collate, align, and share reliable long-run company-level data for Europe to enable researchers, policymakers and other stakeholders to analyze, develop, and evaluate effective strategies to promote investment and economic growth. To achieve this goal, EURHISFIRM develops innovative tools to spark a “Big data” revolution in the historical social sciences and to open access to cultural heritage EURHISFIRM is a project funded by the European Commission within the Infrastructure Development Program of Horizon 2020.

The goal of the Program is to develop world-class research infrastructures lasting for decades (https://ec.europa.eu/research/infrastructures/index_en.cfm?pg=home ).

Research infrastructures are facilities, resources and services used by the science community to foster innovation and extend the frontiers of knowledge. The first phase of the Infrastructure Development Program lasts for three years. It aims at developing an in-depth design study of the Research Infrastructure. After this phase, Development and Consolidation Phases follow if further applications will be successful. EURHISFIRM brings together eleven research institutions in economics, history, information technologies and data science from seven European countries

Mission

Within the project, you will be in charge of developing text information recognition technologies (ICR)
from historical document images (mostly printed), and information extraction from these data (such as
person names, names of companies, dates, positions, stock prices etc…). The datasets are made of
financial yearbooks and price lists of European companies, in different European languages. Your mission
includes :
1- the development of a machine learning based reading system of text lines composed of both
deep optical models, and language models (statistical, and grammar based). Layout analysis falls
out of the scope of the mission.
2- Data preparation for evaluation purposes
3- Benchmarking with other technologies (commercial products)
4- Integration of the system as a web service allowing its integration and deployment into a full
system
5- Coordination with partners of the project regarding datasets preparation and collation of
datasets, as well as software interoperability with other developments within the EurHisFirm
consortium.

Organisation