Methodology

The Russian Indological Research Archive employs a structured digital humanities pipeline to transform historical conference programs into a clean, relational research database.

Data, Metadata, and Derived Fields

Our methodology clearly distinguishes between raw primary records, curated metadata, and derived fields:

Primary Source Programs

The raw inputs are HTML or text transcriptions of the original printed or online conference programs. These are treated as immutable historical artifacts.

Presentation Records

Each presentation is modeled as a distinct event-associated record with a title, session placement, sequence order, date, and time interval.

Normalized Persons

Speaker names are extracted and resolved to canonical scholar entities using deterministic matching, resolving spelling variants, typos, and initials to prevent identity splitting or collision.

Normalized Affiliations

Program strings are retained as provenance. City-only labels remain geographic signals; an institutional affiliation is published when explicitly stated or supported by a dated, verified trajectory, with tentative open continuations into later gaps marked (?).

Broad Theme Labels

Each presentation is classified into one or more high-level research themes (e.g., Art, Linguistics, Philosophy) based on a titles-based heuristic mapping.

Derived Analytics

Aggregate counts (total presentations, series overlap, geographic center clusters) are calculated from the relational graph and exported as open datasets.

Note on thematic classification: Presentation themes are mapped directly from the individual talk titles in our corpus and do not represent a scholar's complete lifetime research output or scientific profile.