Topic modelling

Think of this as a very fast librarian who sorts thousands of newspaper articles into neat piles by theme, so you can browse the collection by ideas rather than titles.

What is topic modelling?

Topic modelling is a statistical method for grouping large sets of texts by the themes they discuss. Instead of pre-defining categories, an unsupervised algorithm looks for words that frequently occur together and clusters them into "topics". Each topic is summarised with a short label drawn from its most representative terms (e.g., mosquée – inauguration – construction – vendredi).

How it works (three quick steps)

  1. Prepare the texts: Items are OCR'd and cleaned—common function words and domain-specific noise (geographic names, generic verbs, etc.) are removed, and words are reduced to their base forms (lemmatisation). Frequent collocations are detected and joined into single tokens (e.g. "communauté" + "musulman" → "communauté_musulman", "ministre" + "intérieur" → "ministre_intérieur").
  2. Group similar items: Using Latent Dirichlet Allocation (LDA), the system converts each text into a bag-of-words representation, then discovers groups of words that tend to co-occur across the corpus. Each group becomes a "topic".
  3. Label and score: Each group receives a human-readable label from its top terms. Every item then gets:
    • topic ID (lda_topic_id) — the group it best matches,
    • label (lda_topic_label) — the topic's top words, and
    • probability score (lda_topic_prob) indicating how strongly it belongs. Items that do not match any group confidently appear as unassigned/outliers.

What you can do in the dashboard

The IWAC dashboard offers two complementary views for exploring topics:

Topics overview

  • Statistics at a glance: total documents processed, number of unique topics, coverage rate, and outlier count.
  • Top topics chart: a bar chart of the ten most frequent topics by document count.
  • Browse all topics: a searchable, sortable grid of every topic, each linking to a dedicated detail page.

Topic detail page

For any individual topic, you can see:

  • Document countaverage probability, and number of countries represented.
  • Country distribution (pie chart) and temporal distribution (bar chart by year).
  • Full document table with title, country, date, probability score, and sentiment polarity — each row linking back to the original item on the IWAC.

Topic network

  • An interactive force-directed graph showing topics (green nodes) and their associated articles (blue nodes).
  • Filter by topic to isolate a single cluster and its articles, or adjust the probability threshold slider to show only the strongest associations.
  • Focus mode: select any node and zoom into its ego network (immediate neighbours only).

How to read the results (good practice)

  • Topics are guides, not verdicts. They reflect recurring word patterns rather than editorial categories. Always click through to the documents.
  • Labels are shorthand. They summarise the top terms and are necessarily approximate.
  • Data quality matters. OCR noise, spelling variants, and uneven coverage across countries or years can shape clusters. Consequently, very small or very broad topics may split or merge when the model is refreshed.

Methods note

IWAC uses an unsupervised, algorithmic workflow based on LDA implemented with gensim. Only French-language items are processed. Each text goes through the following pipeline:

  1. Tokenisation and stopword removal: the pre-lemmatised text (lemma_nostop column) is tokenised, and domain-specific stopwords — including geographic names already captured by metadata, generic verbs, and newspaper boilerplate — are filtered out.
  2. Phrase detection: gensim Phrases identifies frequent bigrams and trigrams (e.g. "burkina_faso", "el_hadj") so that multi-word expressions are treated as single tokens.
  3. Dictionary filtering: rare tokens (appearing in fewer than 10 documents) and overly common ones (appearing in more than 40 % of documents) are removed to reduce noise.
  4. LDA training: the model is trained with auto-tuned hyperparameters (alpha="auto"eta="auto") over multiple passes. The optimal number of topics can be selected automatically via C_v coherence grid search across a range of values.
  5. Assignment: every item receives a dominant topic ID, a probability score, and a human-readable label built from deduplicated top terms.

No predefined categories are used. Metadata such as country or date supports filtering and visualising results in the dashboard — it does not create the topics.