Metadata, Authority Control, and the AI-Assisted Entity Pipeline
Metadata and authority control
Each IWAC item is described using terms from the Dublin Core Metadata Initiative (DCMI). IWAC also maintains expanding authority files for persons, organisations, events, topics (Dublin Core "Subject"), and locations (Dublin Core "Spatial Coverage"). Authority control reduces ambiguity and consolidates variant names.
Limits of manual indexing and off-the-shelf NLP
Keywords help organise online collections, but manual indexing is costly and difficult to scale. Traditional natural language processing (NLP)—particularly named-entity recognition (NER)—assists by identifying people, places, organisations, and dates. NER underpins search, network analysis, and other digital humanities workflows. However, models trained primarily on Western corpora (e.g., spaCy) often miss African entities and degrade in the presence of optical character recognition (OCR) errors.
Why large language models?
Large language models (LLMs) provides a promising alternative. Rather than relying on fragile patterns, they use semantic context to recognise entities despite OCR noise or gaps in training data. Because they work at the level of meaning, LLMs can deliver faster, more cost-effective transformations without sacrificing quality.
A practical naming problem
These advantages matter for IWAC. Newspapers vary spellings and often reverse name order (e.g. Karim/Karimou, Aboubacar/Boikary, Aboubacar Fofana/Fofana Aboubacar). Conventions for given and family names are flexible. The imam of Lomé's Great Mosque in the late 1980s, Abdou-Salami Abdou Rahim, appears as "Abdul Salam Abdul Rahim", simply "Abdou Rahim", and at least eight other variants. Conventional models risk splitting one historical figure into many entries.
A hybrid, AI-assisted pipeline
To address these challenges, IWAC combines AI-driven extraction with human oversight.
Stage 1: Extraction and normalisation
An LLM, guided by explicit normalisation rules, identifies entities (prompt). Place names are reduced to an essential form (e.g., "Kingdom of Saudi Arabia" becomes "Saudi Arabia"); honorifics are removed from personal names (e.g., "Kassim Mensah" rather than "El Hadj Kassim Mensah"); organisational names are captured in full rather than as acronyms; and variable ordering in African names is recognised.
Stage 2: Reconciliation
AI-extracted entities are reconciled against IWAC's curated authority files. A Python script assigns unique identifiers, flags ambiguous cases for review and, where no match is found, proposes conservative, confidence-ranked candidates via fuzzy matching to inform human decision-making. Frequency-based prioritisation highlights the most significant entities and filters out one-off noise.
Stage 3: Consolidation
OpenRefine is used to consolidate entities. Human expertise remains essential: only historical knowledge can adjudicate organisational evolutions, schisms, and renamings.
Stage 4: Enrichment and linking
A third Python script links documents to authority entries, updating IWAC records with direct, clickable links and transforming a static repository into an interconnected knowledge base.
Early results and openness
Although still early, the pipeline has validated over 4,400 distinct entities, a number that will grow as coverage extends across the corpus. The full code is openly available on GitHub: fmadore/iwac-ai-pipelines.
Why it matters
The pipeline keeps the historian in the loop while enabling entity discovery and normalisation in non-Western corpora, where standard NER often fails. Results may not always meet professional archival standards; the realistic alternative, however, is frequently no processing at all.
Reliable entities and reconciled identifiers also power IWAC's visualisation applications. They enable cross-document navigation, historically meaningful network analysis, and more precise search. The gains are computationally efficient and interpretively sound.
What's next
Development will focus on automatic enrichment of entities—especially West African individuals and organisations absent from major knowledge bases. Using the articles that mention each entity, LLMs will generate concise descriptive notes and, where available, biographical summaries. This extends the pipeline from recognition to contextualisation, strengthening authority records and enabling large-scale thematic exploration.
Note: The metadata index remains incomplete. Please also try searching the database using the full-text option.