|
Chu-Ren Huang,
Institute of Linguistics, Academia Sinica |
|
Taiwan's NDAP Language Archives Project: From bronze inscription texts to Austronesian field recording
Cui-Xia Weng, Ru-Yng Chang, Elizabeth Zeiton, Chao-Jun Chen, Derming Juang, Chu-Ren Huang, and Chin-chuan Cheng The Language Archives Project is part of Taiwan's National Digital Archives Program (NDAP). The project digitizes and archives a wide range of linguistic data, from heritage texts to endangered Formosan languages. The goal is two-fold: both to preserve unique cultural heritages and to provide a comprehensive linguistic infrastructure to support content interpretation of archives. Based on these two goals, the main challenges of this project are: to provide versatile yet uniform presentation of different text types, to account for language change, and to account for language variation. We take two archives of contrasting characteristics to illustrate how these challenges are met. The Bronze Inscription archives deal with an archaic language preserved in a written form that is significantly different from Modern Chinese writing. The Formosan (i.e. Taiwan Austronesian) archives deal with indigenous languages that are endangered and have no written conventions. We show how OLACMS lays the common ground for content documentation of these contrasting archives. First, for the Bronze Inscription Archives, the fundamental issue is how to represent the archaic inscribed written form and to establish the direct correspondences with modern writing systems at the same time. We adopt the Intelligent Character Encoding Scheme to deal with this issue. Basically, although glyph forms vary greatly, the composition of Chinese characters from basic glyph remains regular. Hence an encoding scheme based on composition of basic glyphs will not only help with diachronic Chinese archives but can also deal with cross-lingual variations (e.g. Korean and Japanese Kanji, new characters from Hong Kong, etc.). Second, the Formosan languages are indigenous languages in Taiwan that are also thought to be close to the common ancestor of Austronesian languages. The first issue we face is that of establishing orthography, which is solved by the common use of IPA among field linguists. The second issue involves establishing segmentation and tagging standards. The third issue involves audio-representation of field recording. And the last issue involves mapping the lexicon to GIS (geographic information system) to represent language variations and contrasts. |