← Back to Success Stories
Cultural Preservation
OCR/NLP
Digital Archives

Case Study: Udbhodan Publication

Reviving Legacy Texts Through AI-Powered Digitization for the Ramakrishna Mission, India.

Executive Summary

Udbhodan, the Bengali-language publication wing of the Ramakrishna Mission, preserves over a century of India’s spiritual wisdom. Yet much of this rich archive remained locked in fragile print books and degraded PDFs—unsearchable, uncatalogued, and inaccessible to a digital-first generation. Shothik AI partnered with Udbhodan to build an end-to-end archival intelligence platform. By combining advanced OCR for Bengali and Sanskrit, semantic search, and AI-powered classification, we transformed Udbhodan into a dynamic digital knowledge repository. This project safeguards India’s spiritual legacy while empowering researchers, scholars, and seekers with conversational, real-time access to profound historical content.

The Problem
An Archive Out of Reach
  • Most texts existed only as scanned pages or aging print copies.
  • No way to search across decades of spiritual writing.
  • Scholars spent hours manually flipping through indexes.
  • High risk of degradation or loss for original manuscripts.

Without intervention, Udbhodan’s 100+ year archive risked becoming digitally invisible.

The Solution
AI-Powered Archival Digitization
  • Custom Bengali/Sanskrit OCR: Trained on historical fonts, degraded scans, and varied column layouts.
  • Zero-shot Classification: Categorized content into themes like letters, poems, discourses without labeled training data.
  • AI Semantic Search Engine: Allowed users to query ideas contextually (e.g., "Where does Vivekananda write about fearlessness?").
  • Conversational AI Access: Trained a spiritual Q&A assistant on Udbhodan's texts for guided exploration.
  • Preservation-first Workflow: Tagged damaged pages, created restoration flags, and ensured cloud-based backups.
Technical Highlights
FeatureDescription
Bengali OCR EngineCustom-trained for 19th/20th-century Bengali typesetting and poor scan quality.
Semantic AI SearchVector-based retrieval engine for concept-level queries.
Intelligent BinningGroups by author, topic, period, genre.
Conversational AgentDialogue system for contextual learning from texts.
Preservation WorkflowMetadata tagging, restoration flagging, and archival backups.

OCR and classification models were deployed across mixed media formats (PDF, JPEG, EPUB), with fault tolerance for layout variance and aging artifacts.

Implementation Strategy
  1. 1

    Digital Inventory

    Scanned and indexed Udbhodan’s full archival catalogue.

  2. 2

    Corpus Preparation

    Cleaned and normalized thousands of degraded images.

  3. 3

    Model Tuning

    Fine-tuned language models on spiritual and historical corpus.

  4. 4

    Interface Design

    Built both internal archive tools and public-facing web access.

  5. 5

    Content QA

    Verified OCR outputs and classifications with domain scholars.

A Scholar's Journey

Before this transformation, a researcher studying "karma yoga" across Swami Vivekananda's writings would need days, flipping indexes and notes. Now, they can simply ask:

"Where does Vivekananda explain selfless action in daily life?"

The system returns contextual quotes, article links, and related discourse timelines—in seconds. This isn’t just convenience. It redefines how sacred texts are explored.

Talk to Sales