Computational tools and spoken corpora designan ongoing dialogue

  1. Vázquez Rozas, Victoria 2
  2. Barcala, Mario 1
  1. 1 NLPgo Technologies S.L.
  2. 2 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Journal:
Caplletra: revista internacional de filología

ISSN: 0214-8188

Year of publication: 2020

Issue: 69

Pages: 221-240

Type: Article

DOI: 10.7203/CAPLLETRA.69.17270 DIALNET GOOGLE SCHOLAR lock_openOpen access editor

More publications in: Caplletra: revista internacional de filología

Abstract

The design of an oral corpus and the processes of registering, codifying and treating the materials in order to build a useful resource for linguistic analysis prompt numerous decisions regarding theory and methodology. This article is focused on those stages of corpus construction which are more clearly conditioned by the computational processing necessary to make it functional.  In order to adequately match the initial expectations and the real possibilities of using the tool, each feature we intend to codify must be measured against the workload and the means required to do so. Therefore, it is essential to take into account the available possibilities of processing and exploitation as they have a crucial impact on decisions regarding the corpus’ construction. Based on experience acquired in the construction of the ESLORA corpus, the present article looks into some of the problems arising in the process of designing an oral corpus, such as the delicacy with which oral phenomena are represented, the segmentation of the discourse, the coexistence of different simultaneous tagging systems and the particularities of annotation in a bilingual or multilingual context.

Funding information

This study was financed by the Agencia Estatal de Investigación (AEI) ‘Spanish State Research Agency’ and by the Fondo Europeo de Desarrollo Regional (FEDER) (European Regional Development Fund) through the ESLORA+ project (FFI2017-86379-P). The authors are members of the research group Gramática del español ‘Spanish Grammar’ from the University of Santiago de Compostela, which has been awarded a grant for the Strengthening and Organisation of Research Groups with Potential for Growth by the Regional Government’s Education Department (ED431B 2017/39). The study has also benefited from the participation of the ESLORA project in the Red temática en estudios de Análisis del Discurso (FFI2017-90738-REDT).

Funders

Bibliographic References

  • Atkins, S., J. Clear & N. Ostler (1992) «Corpus design criteria», Literary and Linguistic Computing, 7 (1), p. 1-16. DOI: 10.1093/llc/7.1.1.
  • Biber, D. (1993) «Representativeness in corpus design», Literary and Linguistic Computing, 8/4, p. 243-257. DOI: 10.1093/llc/8.4.243.
  • Biber, D., S. Johansson, G. Leech, S. Conrad & E. Finegan (1999) Longman Grammar of Spoken and Written English, London/New York, Longman.
  • Cohen K. B., L. M. Fox, P. V. Ogren & L. Hunter (2005) «Corpus design for biomedical natural language processing», Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, Detroit, June 2005, p. 38-45. DOI: 10.3115/1641484.1641490
  • Fernández Sanmartín, A. (2018) «La entrevista libre como método para evitar la paradoja del observador. Un estudio de corpus», CHIMERA. Romance Corpora and Linguistic Studies, 5 (2), p. 141-196. DOI: <http://dx.doi.org/10.15366/chimera2018.5.2.001>.
  • Gries, S. Th & J. Newman (2013) «Creating and using corpora», in R. J. Podesva & D. Sharma (eds.), Research Methods in Linguistics, Cambridge, Cambridge University Press. DOI: 10.1017/CBO9781139013734.015.
  • Gries, S. Th. (2011) «Methodological and interdisciplinary stance in Corpus Linguistics», in V. Viana, S. Zyngier & G. Barnbrook (eds.), Perspectives on Corpus Linguistics, Amsterdam-Philadelphia, John Benjamins, p. 81-98. DOI: 10.1075/scl.48.06gri
  • Hunston, S. (2002) Corpora in applied linguistics, Cambridge, Cambridge University Press.
  • Jesse, E. (2019) «Corpus Design and Representativeness», in T. Berber Sardinha & M. Veirano Pinto (eds.), Multi-Dimensional Analysis: Research Methods and Current Issues, London, Bloomsbury, p. 27-42. DOI: 10.5040/9781350023857.0010
  • Kavanagh, K. (2019) «XML mark-up: an annotation tool for discourse analysis». [Online: <https://walesdtp.ac.uk/methodsblog/2019/05/21/xml-mark-up-anannotation-tool-for-discourse-analysis/#more-116>, accessed: 2019-07-30.]
  • Labov, W. (1972) «Some principles of linguistic methodology», Language in Society, 1 (1), p. 97-120. DOI: 10.1017/S0047404500006576.
  • Labov, W. (1984) «Field Methods of the Project on Linguistic Change and Variation», in J. Baugh & J. Sherzer (eds.), Language in Use: Readings in Sociolinguistics, Englewood Cliffs, NJ, Prentice Hall, p. 28-66.
  • McEnery, T., R. Xiao & Y. Tono, eds. (2006) Corpus-Based Language Studies: An advanced resource book, London / New York, Routledge.
  • Pietrandrea, P., S. Kahane, A. Lacheret-Dujour & F. Sabio (2014) «The notion of sentence and other discourse units in spoken corpus annotation», in H. Mello & T. Raso (eds.), Spoken corpora and Linguistic Studies, Amsterdam, John Benjamins, p. 331-364. DOI: 10.1075/scl.61.12pie
  • Pushing the Envelope, Barcelona, p. 227-229. [Online: <http://www.ltg.ed.ac.uk/~ht/sgmleu97.html>, accessed: 2019-07-27.]
  • Pustejovsky, J. & A. Stubbs (2012) Natural Language Annotation for Machine Learning. A Guide to Corpus-Building for Applications, Sebastopol, California, O’Reilly Media.
  • Rojo, G. (2014) «Hispanic Corpus Linguistics», in M. Lacorte (ed.), The Routledge Handbook of Hispanic Applied Linguistics, New York, Routledge, p. 371-387.
  • Rojo, G. (2016) «Los corpus textuales del español», in J. Gutiérrez-Rexach (ed.), Enciclopedia lingüística hispánica, Oxon, Routledge, p. 285-296.
  • Sinclair, J. (1995) «Corpus typology a framework for classification», in G. Melchers & B. Warren (eds.), Studies in Anglistics, Stockholm, Almqvist and Wiksell International, p. 17-34.
  • Sinclair, J. (2005) «Corpus and Text Basic Principles», in M. Wynne (ed.), Developing Linguistic Corpora: a Guide to Good Practice, Oxford: Oxbow Books, p. 1-16. [Online: <http://ota.ox.ac.uk/documents/creating/dlc>, accessed: 2019-07-26.]
  • Stührenberg, M. (2012) «The TEI and Current Standards for Structuring Linguistic Data: An Overview», Journal of the text encoding initiative, 3. DOI: 10.4000/ jtei.523. [Online: <https://journals.openedition.org/jtei/523, accessed: 2019- 07-30>.].
  • Thompson, H. S. & D. McKelvie (1997) «Hyperlink semantics for standoff markup of read-only documents», Proceedings of SGML Europe 1997: The next decade
  • Tognini-Bonelli, E. (2001) Corpus Linguistics at Work, Amsterdam, John Benjamins.
  • Torruella Casañas, J. (2017) Lingüística de corpus: génesis y bases metodológicas de los corpus (históricos) para la investigación lingüística, Frankfurt, Peter Lang.
  • Wallis, S. (2007) «Annotation, retrieval and experimentation. Or: you only get out what you put in», Studies in Variation, Contacts and Change in English (VARIENG) 1: Annotating Variation and Change. [Online: <http://www.helsinki.fi/varieng/ series/volumes/01/wallis>, accessed: 2019-08-05.]
  • Wang, I., S. Kahane & I. Tellier (2014) «Macrosyntactic Segmenters of a French spoken corpus», Ninth Language Resources and Evaluation Conference (LREC’14), May 2014, Reykjavík, European Languages Resources Association (ELRA), p. 3891-3896. [Online: <http://www.lrec-conf.org/proceedings/lrec2014/ pdf/889_Paper.pdf>, accessed: 2019-08-02.]
  • Weisser, M. (2016) Practical Corpus Linguistics: An Introduction to Corpus-Based Language Analysis, Malden, MA: Wiley-Blackwell.