An Empirical Study on the Number of Items in Human Evaluation of Automatically Generated Texts

  1. González-Corbelle, Javier
  2. Alonso-Moral, Jose M.
  3. Crujeiras, Rosa M.
  4. Bugarín-Diz, Alberto
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2024

Título do exemplar: Procesamiento del Lenguaje Natural, Revista nº 72, marzo de 2024

Número: 72

Páxinas: 45-55

Tipo: Artigo

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

La evaluación humana de modelos neuronales en Generación de Lenguaje Natural (GLN) requiere un diseño experimental cuidadoso de elementos como, por ejemplo, número de evaluadores, número de ítems a evaluar, número de criterios de calidad, entre otros, para así garantizar la reproducibilidad de experimentos, así como para asegurar que las conclusiones extraídas son significativas. Aunque existen algunas recomendaciones genéricas sobre cómo proceder, no existe un protocolo de evaluación consensuado, general y aceptado. En este artículo prestamos atención a cómo influye el número de elementos a evaluar en la evaluación humana de los sistemas de GLN. Aplicamos distintos métodos de remuestreo para simular la evaluación de distintos conjuntos de ítems por parte de cada evaluador. A continuación, comparamos los resultados obtenidos evaluando sólo un conjunto limitado de ítems con los obtenidos evaluando todas las salidas del sistema para el conjunto completo de casos de prueba. Las conclusiones derivadas del estudio empírico corroboran la hipótesis de investigación de partida: el uso de técnicas de remuestreo ayuda a obtener resultados de evaluación significativos incluso con un número pequeño de ítems a evaluar por cada evaluador.

Referencias bibliográficas

  • Altman, D. G. 1991. Practical Statistics for Medical Research. Chapman and Hall.
  • Banerjee, S. and A. Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  • Belz, A. 2022. A Metrological Perspective on Reproducibility in NLP. Computational Linguistics, 48(4):1125–1135. Belz, A., C. Thomson, E. Reiter, and S. Mille. 2023. Non-repeatable experiments and nonreproducible results: The reproducibility crisis in human evaluation in NLP. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL2023, pages 3676–3687, Toronto, Canada. Association for Computational Linguistics.
  • Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
  • De Leon, A. and Y. Zhu. 2008. ANOVA extensions for mixed discrete and continuous data. Computational Statistics Data Analysis, 52(4):2218–2227.
  • Efron, B. 1979. Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1 – 26.
  • Faul, F., E. Erdfelder, A. Buchner, and A.-G. Lang. 2009. Statistical power analyses using G*Power 3.1: Tests for correlation and regression analyses. Behavior Research Methods, 41:1149–1160.
  • Fisher, R. A., 1992. Breakthroughs in Statistics: Methodology and Distribution, chapter Statistical Methods for Research Workers, pages 66–70. Springer New York, New York, NY.
  • Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76:378–382.
  • González Corbelle, J., A. Bugarín-Diz, J. Alonso-Moral, and J. Taboada. 2022. Dealing with hallucination and omission in neural natural language generation: A use case on meteorology. In Proceedings of the 15th International Conference on Natural Language Generation, pages 121–130, Waterville, Maine, USA and virtual meeting. Association for Computational Linguistics.
  • Hesterberg, T. 2008. It’s time to retire the “n ≥ 30” rule. In Proceedings of the American Statistical Association, Alexandria VA.
  • Kane, H., M. Y. Kocyigit, A. Abdalla, P. Ajanoh, and M. Coulibali. 2020. NUBIA: NeUral based interchangeability assessor for text generation. In S. Agarwal, O. Dusek, S. Gehrmann, D. Gkatzia, I. Konstas, E. Van Miltenburg, and S. Santhanam, editors, Proceedings of the 1st Workshop on Evaluating NLG Evaluation, pages 28–37, Online (Dublin, Ireland). Association for Computational Linguistics.
  • Lin, C.-Y. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  • Mair, P. and R. Wilcox. 2020. Robust Statistical Methods in R Using the WRS2 Package. Behavior Research Methods, 52:464–488.
  • Moramarco, F., A. Papadopoulos Korfiatis, M. Perera, D. Juric, J. Flann, E. Reiter, A. Belz, and A. Savkov. 2022. Human evaluation and correlation with automatic metrics in consultation note generation. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5739–5754, Dublin, Ireland. Association for Computational Linguistics.
  • Obeid, J. and E. Hoque. 2020. Chart-to-text: Generating natural language descriptions for charts by adapting the Transformer model. In Proceedings of the 13th International Conference on Natural Language Generation, pages 138–147, Dublin, Ireland. Association for Computational Linguistics.
  • Papineni, K., S. Roukos, T.Ward, and W.-J. Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, page 311–318, USA. Association for Computational Linguistics.
  • Reiter, E. 2018. A structured review of the validity of BLEU. Computational Linguistics, 44(3):393–401.
  • Sellam, T., D. Das, and A. Parikh. 2020. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892. Association for Computational Linguistics.
  • Student. 1908. Probable error of a correlation coefficient. Biometrika, 6(2/3):302–310.
  • Van der Lee, C., A. Gatt, E. van Miltenburg, and E. Krahmer. 2021. Human evaluation of automatically generated text: Current trends and best practice guidelines. Computer Speech & Language, 67:101151.1–101151.24.
  • Wang, J., Y. Liang, F. Meng, Z. Sun, H. Shi, Z. Li, J. Xu, J. Qu, and J. Zhou. 2023. Is ChatGPT a good NLG evaluator? a preliminary study. In Y. Dong, W. Xiao, L. Wang, F. Liu, and G. Carenini, editors, Proceedings of the 4th New Frontiers in Summarization Workshop, pages 1–11, Hybrid. Association for Computational Linguistics.
  • Zhang, T., V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi. 2020. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representations (ICLR). OpenReview.