Paper Digital Humanities Australasia 2018

TextBase and READ Workbench: A development methodology and implementation platform for ancient text corpora. (26)

Ian McCrabb 1
  1. University of Sydney, Glebe, NSW, Australia

The READ platform, supported by a consortium of four universities involved in the study and publication of ancient Buddhist documents, has been developed as a comprehensive multi-user platform to support editing, analysis and publishing of ancient Sanskrit, Gāndhārī, Pali and other Prakrit texts.  It is open source, supports the TEI standard and is being extended to other language groups.  The defining innovation of READ is atomization to a semantically linked network of objects; a paradigm shift in data structure from strings of marked-up text to aggregates of linked objects.

 The READ entity model was presented at DH 2015 and ADH 2016.  This presentation will outline the distinctive features of READ and precis a range of current corpora projects but will focus primarily on TextBase and READ Workbench (Workbench).  TextBase is a corpus development methodology designed to address some of the ubiquitous constraints inherent in a conventional corpus architecture.  Workbench is a server portal hosted at the University of Sydney which instantiates TextBase and ‘harnesses’ READ, the core philological platform.

 Workbench is a comprehensive management framework to support the integration of people and processes in the collaborative development, maintenance and publishing of textual corpora.  Workbench is a software as a service (SaaS) platform managing multiple READ installations, each with project/language specific configurations, supporting researcher collaborations across multiple institutions.  Workbench delivers READ capability through a self-service portal enabling researchers to develop, maintain, manage and publish texts without requiring technical support; critical to the longer-term sustainability of corpora projects.  Workbench’s three facets (configuration management, database management and corpus workflows) address issues of scalability, project management and sustainability in collaborative READ projects.

 The adoption of a single text/single database (TextBase) as the fundamental object of development, collaboration and portability is quite a departure from conventional models where a centralized administrator manages a single monolithic corpus database.  This TextBase architecture underpins corpus development, analysis and publishing with a methodology that provides significant flexibility in terms of the iteration and synchronization of four fundamental workflows: text alignment, text cubing, analysis registration and text aggregation.

 Text alignment integrates the physical, the textual and the contextual to generate a TextBase ‘substrate’ from distributed inputs.   Text cubing facilitates successive cloning and editing, from an initial substrate through various editions, to encapsulate the scholarly history of a text at its most granular.  Analysis registration associates ‘strata’ of analysis data with a substrate to support a wide array of philological and cross discipline research projects.  Text aggregation supports individual researchers to combine their TextBases in collections or corpora as sequenced, mapped or merged ‘aggregates’.

 The strategy adopted with TextBase Workbench was to reframe some of the ubiquitous issues in the conventional corpus development model; ownership, control, confidentiality, innovation, standardization, portability, resourcing and support.  The critical innovation in maximizing development flexibility and in balancing autonomy and collaboration across the range of individual, collection and corpora development projects is the TextBase; the target of text alignment, the departure point for text cubing, the substrate for registration of analysis and the object aggregated.