Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

  1. Sotelo Docío, Susana 1
  2. Gamallo, Pablo 1
  3. Iriarte, Álvaro 2
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

  2. 2 Universidade do Minho
    info

    Universidade do Minho

    Braga, Portugal

    ROR https://ror.org/037wpkx04

Journal:
Linguamática

ISSN: 1647-0818

Year of publication: 2023

Volume: 15

Issue: 2

Pages: 3-18

Type: Article

DOI: 10.21814/LM.15.2.405 DIALNET GOOGLE SCHOLAR lock_openOpen access editor

More publications in: Linguamática

Abstract

 Named Entity Recognition (NER) is an essential task in information extraction where entities in a text are identified and classified. One of the primary challenges addressed by NER systems is the difficulty of generalizing what was learned to different types of corpora beyond the training data. This problem is magnified by the fact that most of the training corpora used are journalistic and therefore need to be adapted to other genres and domains. In this paper, we use a Spanish corpus consisting of interviews with visitors to the city of Santiago de Compostela and annotated with named entities, to evaluate and train NER systems tailored to the domain of cultural analysis and tourism. We provide a comprehensive comparison of various approaches employed, ranging from classical machine learning algorithms to fine-tuning Transformer models. The results significantly outperform the baseline, represented here by the toolkits Stanza, spaCy and Flair, although initial tests with unseen entities during training highlight the need for additional evaluations regarding their generalization capability and the utilization of adversarial splits for the corpus.

Bibliographic References

  • Agarwal, Oshin, Yinfei Yang, Byron C. Wallace& Ani Nenkova. 2021. Interpretability analy-sis for named entity recognition to unders-tand system predictions and how they can im-prove.Computational Linguistics47(1). 117–140.10.1162/coli_a_00397.
  • Akbik, Alan, Tanja Bergmann, Duncan Blythe,Kashif Rasul, Stefan Schweter & Roland Voll-graf. 2019. FLAIR: An easy-to-use frameworkfor state-of-the-art nlp.EmConference ofthe North American Chapter of the Associ-ation for Computational Linguistics, 54–59.10.18653/v1/N19-4010.
  • Amaral, Carlos, Helena Figueira, Afonso Men-des, Pedro Mendes, Cl ́audia Pinto & TiagoVeiga. 2008. Adapta ̧c ̃ao do sistema de reco-nhecimento de entidades mencionadas da Pri-beram ao HAREM. EmDesafios na avalia ̧c ̃aoconjunta do reconhecimento de entidades men-cionadas: O Segundo HAREM, 171–179. Lin-guateca.
  • Augenstein, Isabelle, Leon Derczynski & Ka-lina Bontcheva. 2017. Generalisation in na-med entity recognition: A quantitative analy-sis.Computer Speech & Language44. 61–83.10.1016/j.csl.2017.01.012.
  • Baldwin, Timothy, Marie Catherine de Marneffe,Bo Han, Young-Bum Kim, Alan Ritter & WeiXu. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexicalnormalization and named entity recognition.14–Linguam ́aticaSusana Sotelo Doc ́ıo, Pablo Gamallo & ́Alvaro Iriarte EmWorkshop on Noisy User-generated Text,10.18653/v1/W15-4319.
  • Bamman, David, Sejal Popat & Sheng Shen.2019. An annotated dataset of literary entities.EmConference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, vol. 1,2138–2144.10.18653/v1/N19-1220.
  • Barachi, May El, Sujith Samuel Mathew &Manar AlKhatib. 2022.Combining namedentity recognition and emotion analysis oftweets for early warning of violent actions.Em7thInternational Conference on Smartand Sustainable Technologies (SpliTech), 1–6.10.23919/SpliTech55088.2022.9854231.
  • Bick, Eckhard. 2006. Functional aspects in Por-tuguese NER. EmComputational Processingof the Portuguese Language (PROPOR), 80–89.
  • Bouabdallaoui, Ibrahim, Fatima Guerouate,Samya Bouhaddour, Chaimae Saadi & Moha-med Sbihi. 2022. Named entity recognition ap-plied on Moroccan tourism corpus. Em12thInternational Conference on Emerging Ubiqui-tous Systems and Pervasive Networks / 11thInternational Conference on Current and Fu-ture Trends of Information and Communica-tion Technologies in Healthcare, vol. 198, 373–378.10.1016/j.procs.2021.12.256.
  • Cañete, Jos ́e. 2019. Compilation of large Spa-nish unannotated corpora. Version 2. Zenodo.10.5281/zenodo.3247731.
  • Cañete, Jos ́e, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang & Jorge P ́erez.2020. Spanish pre-trained BERT model andevaluation data. EmPractical Machine Lear-ning for Developing Countries at ICLR, s.p.
  • Cardellino, Cristian. 2019.Spanish billionwords corpus and embeddings.https://crscardellino.github.io/SBWCE/.
  • Chantrapornchai, Chantana & Aphisit Tunsakul.2019. Information extraction based on namedentity for tourism corpus. Em16thInterna-tional Joint Conference on Computer Scienceand Software Engineering (JCSSE), 187–192.10.1109/JCSSE.2019.8864166.
  • Cheng, Xiao, Weihua Wang, Feilong Bao& Guanglai Gao. 2020.MTNER: Acorpus for Mongolian tourism named en-tity recognition.Em Junhui Li & AndyWay (eds.),Machine Translation, 11–23.10.1007/978-981-33-6162-1_2.
  • Conneau, Alexis, Kartikay Khandelwal, NamanGoyal, Vishrav Chaudhary, Guillaume Wen-zek, Francisco Guzmán, Edouard Grave, MyleOtt, Luke Zettlemoyer & Veselin Stoyanov.2019. Unsupervised cross-lingual representa-tion learning at scale.CoRRabs/1911.02116.http://arxiv.org/abs/1911.02116.
  • Del ́eger, Louise, Robert Bossy, Estelle Chaix,Mouhamadou Ba, Arnaud Ferr ́e, PhilippeBessi`eres & Claire N ́edellec. 2016. Overviewof the bacteria biotope task at BioNLP sharedtask. Em4thBioNLP Shared Task Workshop,12–22.10.18653/v1/W16-3002.
  • Devlin, Jacob, Ming-Wei Chang, Kenton Lee& Kristina Toutanova. 2019.BERT: Pre-training of deep bidirectional transformersfor language understanding.EmConfe-rence of the North American Chapter ofthe Association for Computational Linguis-tics: Human Language Technologies, 4171–4186.10.18653/v1/N19-1423.
  • do Amaral, Daniela O. F., Sandra Collovini,A. Figueira, Renata Vieira & Marco Gonza-lez. 2017. Processo de constru ̧c ̃ao de um cor-pus anotado com entidades geol ́ogicas visandoREN. Em11thBrazilian Symposium in Infor-mation and Human Language Technology, 63–72.
  • Doddington, George, Alexis Mitchell, MarkPrzybocki, Lance Ramshaw, Stephanie Stras-sel & Ralph Weischedel. 2004. The automa-tic content extraction (ACE) program – tasks,data, and evaluation. Em4thInternationalConference on Language Resources and Eva-luation (LREC), 837–840.
  • Egger, Roman (ed.). 2022.Applied data sci-ence in tourism: Interdisciplinary approaches,methodologies, and applicationsTourism onthe Verge. Cham: Springer International Pu-blishing.10.1007/978-3-030-88389-8.
  • Eltyeb, Safaa & Naomie Salim. 2014. Chemicalnamed entities recognition: a review on appro-aches and applications.Journal of Cheminfor-matics6. 17.10.1186/1758-2946-6-17.
  • Freitas, Cl ́audia, Cristina Mota, Diana San-tos, Hugo Gon ̧calo Oliveira & Paula Carvalho.2010. Second HAREM: Advancing the stateof the art of named entity recognition in Por-tuguese. Em7thInternational Conference onLanguage Resources and Evaluation (LREC),3630–3637.
  • Frontini, Francesca, Carmen Brando, JoannaByszuk, Ioana Galleron, Diana Santos &Desenvolvimento e avalia ̧c ̃ao de um modelo NERno dom ́ınio da an ́alise cultural e do turismoLinguam ́atica– 15 Ranka Stankovi ́c. 2020. Named entity recog-nition for distant reading in ELTeC. EmCLA-RIN Annual Conference, 37–41.
  • Gamallo,Pablo & Marcos Garcia. 2017.LinguaKit:uma ferramenta multilinguepara a an ́alise lingu ́ıstica e a extra ̧c ̃aode informa ̧c ̃ao.Linguam ́atica9(1). 19–28.10.21814/lm.9.1.243.
  • Garc ́ıa-Pablos, Aitor, Montse Cuadros & Ma-ria Teresa Linaza. 2015. OpeNER: Open to-ols to perform natural language processing onaccommodation reviews. EmInformation andCommunication Technologies in Tourism, 125–137.10.1007/978-3-319-14343-9_10.
  • Giorgi, John M. & Gary D. Bader. 2018.Transferlearningforbiomedicalna-medentityrecognitionwithneuralnetworks.Bioinformatics34(23). 4087–4094.10.1093/bioinformatics/bty449.
  • Grishman, Ralph & Beth Sundheim. 1995. De-sign of the MUC-6 evaluation.Em6thConference on Message Understanding, 1–11.10.3115/1072399.1072401.
  • Guo, Jianyi, Zhengshan Xue, Zhengtao Yu, Zhi-kun Zhang, Yihao Zhang & Xianming Yao.2009. Named entity recognition for the tourismdomain based on cascaded conditional randomfields.Journal of Chinese Information Proces-sing23(5). 47–52.
  • Guti ́errez Fandi ̃no, Asier, Jordi Armengol-Estap ́e, Marc P`amies, Joan Llop-Palao, Joa-quin Silveira-Ocampo, Casimiro Pio-Carrino,Carme Armentano-Oller, Carlos Rodriguez-Penagos, Aitor Gonzalez-Agirre & Marta Vil-legas. 2022. MarIA: Spanish language models.Procesamiento del Lenguaje Natural68. 39–60.10.26342/2022-68-3.
  • He, Xuming, Richard S. Zemel & Miguel A.Carreira-Perpi ̃n ́an. 2004. Multiscale conditio-nal random fields for image labeling. EmIEEEComputer Society Conference on Compu-ter Vision and Pattern Recognition (CVPR),vol. 2, II–II.10.1109/CVPR.2004.1315232.
  • Honnibal, Matthew. 2016. Embed, encode, at-tend, predict: The new deep learning for-mula for state-of-the-art NLP models. Ex-plosion.https://explosion.ai/blog/deep-learning-formula-nlp.
  • Honnibal, Matthew, Adriane Boyd & Vincent D.Warmerdam. 2022.Compact word vectorswith bloom embeddings. Explosion.https://explosion.ai/blog/bloom-embeddings.
  • Kanev, Anton I., Grigory A. Savchenko, Ilya A.Grishin, Denis A. Vasiliev & Emilia M. Duma.2022. Sentiment analysis of multilingual textsusing machine learning methods. EmConfe-rence of Russian Young Researchers in Elec-trical and Electronic Engineering, 326–331.10.1109/ElConRus54750.2022.9755568.
  • Kim, Hyunjae & Jaewoo Kang. 2022.Howdo your biomedical named entity recog-nition models generalize to novel enti-ties?IEEE Access10. 31513–31523.10.1109/ACCESS.2022.3157854.
  • K ́ad ́ar, ́Akos, Lester James Miranda, Victo-ria Slocum & Sofie Van Landeghem. 2023.The tale of bloom embeddings and unseenentities. Explosion.https://explosion.ai/blog/technical-report.
  • Lacoste,Alexandre,AlexandraLuccioni,VictorSchmidt&ThomasDandres.2019.Quantifying the Carbon Emissi-ons of Machine Learning.ArXiv [cs.CY].10.48550/ARXIV.1910.09700.
  • Lafferty, John, Andrew McCallum & FernandoPereira. 2001. Conditional random fields: Pro-babilistic models for segmenting and labelingsequence data. Em18thInternational Confe-rence on Machine Learning, 282–289.
  • Lample, Guillaume, Miguel Ballesteros, San-deep Subramanian, Kazuya Kawakami &Chris Dyer. 2016.Neural architecturesfor named entity recognition.EmCon-ference of the North American Chapter ofthe Association for Computational Linguis-tics: Human Language Technologies, 260–270.10.18653/v1/N16-1030.
  • LeCun, Yann, Yoshua Bengio & Geoffrey Hinton.2015. Deep learning.Nature521(7553). 436–444.10.1038/nature14539.
  • Lee, Jangwon, Jungi Lee, Minho Lee & Gil-Jin Jang. 2022. Named entity correction inneural machine translation using the atten-tion alignment map.Applied Sciences11(15).10.3390/app11157026.
  • Leitner, Elena, Georg Rehm & Julian Moreno-Schneider. 2019. Fine-grained named entityrecognition in legal documents. Em15thInter-national Conference, SEMANTiCS, 272–287.10.1007/978-3-030-33220-4.
  • Lignos, Constantine & Marjan Kamyab. 2020. Ifyou build your own NER scorer, non-replicableresults will come. Em1stWorkshop on In-sights from Negative Results in NLP, 94–99.10.18653/v1/2020.insights-1.15.16–Linguam ́aticaSusana Sotelo Doc ́ıo, Pablo Gamallo & ́Alvaro Iriarte
  • Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zheng-bao Jiang, Hiroaki Hayashi & Graham Neubig.2021. Pre-train, prompt, and predict: A syste-matic survey of prompting methods in naturallanguage processing.ACM Computing Surveys55. 1–35.10.1145/3560815.
  • Liu, Yinhan, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer & Veselin Stoya-nov. 2019. RoBERTa: A robustly optimizedBERT pretraining approach. ArXiv [cs.CL].10.48550/arXiv.1907.11692.
  • Manning, Christopher D., Mihai Surdeanu, JohnBauer, Jenny Finkel, Steven J. Bethard &David McClosky. 2014.The Stanford Co-reNLP natural language processing toolkit.EmAssociation for Computational Linguis-tics (ACL) System Demonstrations, 55–60.10.3115/v1/P14-5010.
  • Matos, Emanuel, M ́ario Rodrigues, Pedro Miguel& Ant ́onio Teixeira. 2021. Towards automa-tic creation of annotations to foster develop-ment of named entity recognizers. Em10thSymposium on Languages, Applications andTechnologies (SLATE), vol. 94, 11:1–11:14.10.4230/OASIcs.SLATE.2021.11.
  • McDonald, Ryan & Fernando Pereira. 2005.Identifyinggeneandproteinmentionsin text using conditional random fields.BMCBioinformatics6(Suppl1).S6.10.1186/1471-2105-6-S1-S6.
  • Miranda, Lester James, ́Akos K ́ad ́ar, Adri-ane Boyd, Sofie Van Landeghem, AndersSøgaard & Matthew Honnibal. 2022. Multihash embeddings in spaCy. ArXiv [cs.CL].10.48550/arXiv.2212.09255.
  • Oronoz, Maite, Koldo Gojenola, Alicia P ́erez,Arantza D ́ıaz de Ilarraza & Arantza Casillas.2015. On the creation of a clinical gold stan-dard corpus in Spanish: Mining adverse drugreactions.Journal of Biomedical Informatics56. 318–332.10.1016/j.jbi.2015.06.016.
  • Ortiz Su ́arez, Pedro Javier, Benoˆıt Sagot & Lau-rent Romary. 2019. Asynchronous pipelines forprocessing huge corpora on medium to low re-source infrastructures. EmWorkshop on Chal-lenges in the Management of Large Corpora,9–16.10.14618/ids-pub-9021.
  • Pais, Vasile, Maria Mitrofan, Carol Luca Ga-san, Vlad Coneschi & Alexandru Ianov.2021.Named entity recognition in theRomanian legal domain.EmNatural Le-gal Language Processing Workshop, 9–18.10.18653/v1/2021.nllp-1.2.
  • Palmer, David D. & David S. Day. 1997. A sta-tistical profile of the named entity task. Em5thConference Applied Natural Language Proces-sing, 190–193.10.3115/974557.974585.
  • Pedregosa, F., G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Bru-cher, M. Perrot & E. Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal ofMachine Learning Research12. 2825–2830.
  • Pennington, Jeffrey, Richard Socher & Ch-ristopher D. Manning. 2014.Glove:Global vectors for word representation.EmEmpirical Methods in Natural Lan-guageProcessing(EMNLP),1532–1543.10.3115/v1/D14-1162.
  • Qi, Peng, Yuhao Zhang, Yuhui Zhang, Ja-son Bolton & Christopher D. Manning.2020.Stanza:A Python natural lan-guage processing toolkit for many humanlanguages.Em58thAnnual Meeting ofthe Association for Computational Linguis-tics (ACL): System Demonstrations, 101–108.10.18653/v1/2020.acl-demos.14.
  • Santos, Diana, Nuno Seco, Nuno Cardoso & RuiVilela. 2006. HAREM: An advanced NER eva-luation contest for Portuguese. Em5thIn-ternational Conference on Language Resourcesand Evaluation (LREC), 1986–1991.
  • Saputro, Khurniawan Eko, Sri Suning Kusu-mawardani & Silmi Fauziati. 2016.Deve-lopment of semi-supervised named entity re-cognition to discover new tourism places.Em2ndInternational Conference on Scienceand Technology-Computer (ICST), 124–128.10.1109/ICSTC.2016.7877360.
  • Settles, Burr. 2004. Biomedical named entityrecognition using conditional random fieldsand rich feature sets. EmInternational JointWorkshop on Natural Language Processing inBiomedicine and its Applications, 107–110.
  • Sha, Fei & Fernando Pereira. 2003. Shallow par-sing with conditional random fields. EmHu-man Language Technology Conference of theNorth American Chapter of the Association forComputational Linguistics, 213–220.
  • Søgaard, Anders, Sebastian Ebert, Jasmijn Bas-tings & Katja Filippova. 2021. We need totalk about random splits. Em16thConferenceof the European Chapter of the Associationfor Computational Linguistics (EACL), 1823–1832.10.18653/v1/2021.eacl-main.156.Desenvolvimento e avalia ̧c ̃ao de um modelo NERno dom ́ınio da an ́alise cultural e do turismoLinguam ́atica– 17
  • Strubell, Emma, Ananya Ganesh & AndrewMcCallum. 2020.Energy and Policy Con-siderations for Modern Deep Learning Re-search.EmAAAI Conference on Arti-ficial Intelligence, vol. 34 9, 13693–13696.10.1609/aaai.v34i09.7123.
  • Tjong Kim Sang, Erik F. 2002. Introductionto the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Em6thConference on Natural Language Learning(CoNLL), s.p.
  • Tjong Kim Sang, Erik F. & Fien De Meulder.2003. Introduction to the CoNLL-2003 Sha-red Task: Language-Independent Named En-tity Recognition. Em7thConference on Natu-ral Language Learning (CoNLL), 142–147.
  • Torres Feij ́o, Elias J. 2019.Bem-estar comu-nit ́ario e visitantes atrav ́es do Caminho deSantiago. Grandes narrativas, ideias e pr ́aticasculturais na cidade. Andavira.
  • Vaswani, Ashish, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser & Illia Polosukhin. 2017. Atten-tion is all you need. Em31stAnnual Confe-rence on Neural Information Processing Sys-tems, vol. 1, 5999–6008.
  • Vijay, J. & Rajeswari Sridhar. 2016. A machinelearning approach to named entity recognitionfor the travel and tourism domain.Asian Jour-nal of Information Technology15(21). 4309–4317.10.3923/ajit.2016.4309.4317.
  • Vu, Van-Hai, Quang-Phuoc Nguyen, Kiem-HieuNguyen, Joon-Choul Shin & Cheol-Young Ock.2020.Korean-Vietnamese Neural MachineTranslation with Named Entity Recognitionand Part-of-Speech Tags.IEICE Transactionson Information and SystemsE103.D(4). 866–873.10.1587/transinf.2019EDP7154.
  • Walker, Christopher, Stephanie Strassel, JulieMedero & Kazuaki Maeda. 2006. ACE 2005multilingual training corpus. Linguistic DataConsortium.10.35111/mwxc-vh88.
  • Wolf, Thomas, Lysandre Debut, Victor Sanh, Ju-lien Chaumond, Clement Delangue, AnthonyMoi, Pierric Cistac, Tim Rault, Remi Louf,Morgan Funtowicz, Joe Davison, Sam Shleifer,Patrick von Platen, Clara Ma, Yacine Jernite,Julien Plu, Canwen Xu, Teven Le Scao, Syl-vain Gugger, Mariama Drame, Quentin Lho-est & Alexander Rush. 2020. Transformers:State-of-the-art natural language processing.EmConference on Empirical Methods in Na-tural Language Processing (EMNLP), 38–45.10.18653/v1/2020.emnlp-demos.6.
  • Xue, Leyi, Han Cao, Fan Ye & Yuehua Qin. 2019.A method of Chinese tourism named entity re-cognition based on BBLC Model. EmIEEESmartWorld: Ubiquitous Intelligence Compu-ting, Advanced Trusted Computing, ScalableComputing Communications, Cloud Big DataComputing, Internet of People and Smart CityInnovation, 1722–1727