The raw inputs are HTML or text transcriptions of the original printed or online conference programs. These are treated as immutable historical artifacts.
Methodology
The Russian Indological Research Archive employs a structured digital humanities pipeline to transform historical conference programs into a clean, relational research database.
Data, Metadata, and Derived Fields
Our methodology clearly distinguishes between raw primary records, curated metadata, and derived fields:
Each presentation is modeled as a distinct event-associated record with a title, session placement, sequence order, date, and time interval.
Speaker names are extracted and resolved to canonical scholar entities using deterministic matching, resolving spelling variants, typos, and initials to prevent identity splitting or collision.
Program strings are retained as provenance. City-only labels remain geographic signals; an institutional affiliation is published when explicitly stated or supported by a dated, verified trajectory, with tentative open continuations into later gaps marked (?).
Each presentation is classified into one or more high-level research themes (e.g., Art, Linguistics, Philosophy) based on a titles-based heuristic mapping.
Aggregate counts (total presentations, series overlap, geographic center clusters) are calculated from the relational graph and exported as open datasets.
Note on thematic classification: Presentation themes are mapped directly from the individual talk titles in our corpus and do not represent a scholar's complete lifetime research output or scientific profile.