Think of this as a very fast librarian who sorts thousands of newspaper articles into neat piles by theme, so you can browse the collection by ideas rather than titles.
What is topic modelling?
Topic modelling is a statistical method for grouping large sets of texts by the themes they discuss. Instead of pre-defining categories, an unsupervised algorithm looks for words that frequently occur together and clusters them into “topics”. Each topic is summarised with a short label drawn from its most representative terms (e.g., mosquée – inauguration – construction – vendredi).
How it works (three quick steps)
- Prepare the texts: Items are OCR’d and lightly cleaned—for example, common function words are removed and words are reduced to their base forms (lemmatisation).
- Group similar items: The system creates a compact numerical representation of each text, compares these representations, and groups items that use similar language into topics.
- Label and score: Each group receives a human-readable label from its top terms. Every item then gets:
- a topic ID (the group it best matches),
- a label (the topic’s top words), and
- a fit score indicating how strongly it belongs. Some items do not fit any group confidently; these appear as unassigned/outliers.
What you can do here
- Browse by theme: Jump straight to topics such as pilgrimage, mosque building, religious education, student associations, or women’s organisations.
- Trace patterns: See which themes are most common, how they vary by country or source, and how attention to an issue changes over time.
- Open the evidence: Each topic links to the underlying articles, with dates and publication details.
How to read the results (good practice)
- Topics are guides, not verdicts. They reflect recurring word patterns rather than editorial categories. Always click through to the documents.
- Labels are shorthand. They summarise the top terms and are necessarily approximate.
- Data quality matters. OCR noise, spelling variants, and uneven coverage across countries or years can shape clusters. Consequently, very small or very broad topics may split or merge when the model is refreshed.
Methods note
IWAC uses an unsupervised, algorithmic workflow. For French-language items, each text is first converted into a compact numerical summary using CamemBERT (a French language model). The system then compares these summaries to group texts that use similar language and assigns each item a dominant topic with a strength score. No predefined categories are used. Metadata such as country or date supports filtering and visualising results—it does not create the topics.