Prototyping (and evaluating) a pipeline for Named Entity Recognition in "Mitteilungsblatt"
What is the project about?
Generate a pipeline for Named Entity Recognition in the Mitteilungsblatt.
Who is part of the team?
Daniil Skorinkin - Potsdam University
Henny Sluyter-Gaethje - Potsdam University
Harald Lordick - Steinheim Institute
Mirjam Rürup - Moses Mendelssohn Center
Ursula Wallmeier - Moses Mendelssohn Center
Benjamin Schnabel - FID Jüdische Studien, JudaicaLink
What is the practical use of your research tool for jewish studies? / What can researchers of jewish studies learn/achieve from your project?
We turn the name registries from the Mitteilungsblatt into structured machine readable form (which can then be used for NER improvements and/or by other researchers)
We prototype the pipeline for person mentions extraction from Mitteilungsblatt
Which technical methods and tools are used for developing your project and what tools and/or methods do you use for reaching your goal?
OCR(FineReader, tesseract)
Annotation interface(CATMA)
NER models (Spacy, Flair)
Regex and gazetteer matching
RDF Graph
How have you approached the project so far?
We created an annotation task force, produced manual markup for some samples of Mitteilungsblatt in German and in Hebrew, applied;
In other direction, we OCR-ed the manually created ‘person registries’ from the library
What can you learn from this project for your own personal research interests?
- Organizing collaborative CATMA annotation
- Running evaluation pipelines
- Cleaning noisy OCR data
What do you expect to achieve in the Hackathon?
Output the personal data from the OCR-ed person registries for Mitteilungsblatt in clear machine readable form
Test models that extract person mentions, evaluate their performance, try to improve the quality (possibly with help of the registry data from the previous point)