Optical character recognition (OCR)

The IWAC is dedicated to providing high quality texts from a variety of sources on Islam and Muslims in West Africa. We use Optical Character Recognition (OCR) technology to convert scanned documents into searchable and editable text. By combining manual review with advanced AI technologies, we aim to balance quality and efficiency, ensuring that our database remains a valuable resource for researchers and the public alike.

Newspaper articles

For newspaper articles, we ensure the highest accuracy by manually reviewing all extracted text. This meticulous process ensures that the OCR output is highly reliable, minimising errors that could distort the information.

Islamic magazines and newspapers

We are using new developments in Large Language Models (LLMs) to refine text digitisation. For Islamic magazines and newspapers, we use an automated OCR process without manual review, enhanced by AI, specifically the GPT-4o model, to correct the OCR output. To limit hallucination, we break the text into smaller chunks before processing. Our method integrates advances in transliteration, transcription and OCR to improve the accuracy and accessibility of the text.

Advantages

  1. Efficiency: AI-powered OCR significantly reduces document processing time, enabling faster access to a wider range of materials.

  2. Scalability: AI allows us to process large collections of documents without the need for extensive human resources, making it possible to maintain and expand the database.

  3. Cost-effective: Automated correction reduces operational costs compared to manual review, which requires significant time and labour.

  4. Adaptability: The AI models continuously improves and adapts to different fonts and layouts commonly found in Islamic magazines and newspapers, improving overall accuracy over time.

Inconveniences and limitations

  1. Hallucinations: AI models, including GPT-4o, may occasionally generate text that appears plausible but is not actually present in the original document. These "hallucinations" can introduce inaccuracies.

  2. Contextual errors: While AI is adept at correcting many OCR errors, it may misinterpret contextual words or phrases, resulting in less accurate corrections in some cases.

  3. Lack of manual oversight: Unlike manually reviewed text, automated corrections lack a final human review, which can result in some persistent errors being missed.

  4. Consistency: Although AI models improve over time, their performance can be inconsistent, especially for complex documents or those with poor print quality.

Example

 

Conclusion

While OCR technology greatly improves access to digitised documents, it is not infallible. It is therefore always advisable to refer to the original digitised document for verification when accuracy is paramount.

For researchers wishing to perform computational analysis on Islamic newspapers and periodicals from the Collection, it is important to be aware of these potential limitations of OCR. Careful verification of the OCR-corrected text against the original documents can help maintain the integrity and reliability of the analysis.