Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

Sotelo Docío, Susana; Gamallo, Pablo; Iriarte, Álvaro

doi:10.21814/LM.15.2.405

Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

Sotelo Docío, Susana ¹
Gamallo, Pablo ¹
Iriarte, Álvaro ²

1 Universidade de Santiago de Compostela

Universidade de Santiago de Compostela

Santiago de Compostela, España

ROR https://ror.org/030eybx10
2 Universidade do Minho

Universidade do Minho

Braga, Portugal

ROR https://ror.org/037wpkx04

Revista:

Linguamática

ISSN: 1647-0818

Ano de publicación: 2023

Volume: 15

Número: 2

Páxinas: 3-18

Tipo: Artigo

DOI: 10.21814/LM.15.2.405 DIALNET GOOGLE SCHOLAR Acceso aberto editor

Outras publicacións en: Linguamática

Obxectivos de Desenvolvemento Sustentable

Resumo

O Reconhecimento de Entidades Mencionadas (NER) é uma tarefa essencial de extracção de informação em que as entidades de um texto são identificadas e classificadas. Um dos principais desafios enfrentados pelos sistemas NER é a dificuldade de generalização do aprendido para outros tipos de corpora diferentes dos utilizados durante o treino. Este problema é acentuado pelo facto de a maioria dos corpora de treino utilizados serem de natureza jornalística e, portanto, precisarem de ser adaptados a outros géneros e domínios. Neste artigo, utilizamos um corpus espanhol composto por entrevistas a visitantes da cidade de Santiago de Compostela e anotado com entidades mencionadas, para a avaliação e treino de sistemas NER adaptados ao domínio da cultura e do turismo. Apresentamos uma comparação das diferentes abordagens aplicadas, desde algoritmos clássicos de aprendizagem automática ao afinamento de vários modelos de Transformers. Os resultados obtidos superam significativamente o baseline, representado aqui pelos toolkits Stanza, spaCy e Flair, embora os testes preliminares com entidades não observadas durante o treino sugiram a necessidade de avaliações adicionais da sua capacidade de generalização e o uso de um método de segmentação adversarial no corpus.

Referencias bibliográficas

Agarwal, Oshin, Yinfei Yang, Byron C. Wallace& Ani Nenkova. 2021. Interpretability analy-sis for named entity recognition to unders-tand system predictions and how they can im-prove.Computational Linguistics47(1). 117–140.10.1162/coli_a_00397.
Akbik, Alan, Tanja Bergmann, Duncan Blythe,Kashif Rasul, Stefan Schweter & Roland Voll-graf. 2019. FLAIR: An easy-to-use frameworkfor state-of-the-art nlp.EmConference ofthe North American Chapter of the Associ-ation for Computational Linguistics, 54–59.10.18653/v1/N19-4010.
Amaral, Carlos, Helena Figueira, Afonso Men-des, Pedro Mendes, Cl ́audia Pinto & TiagoVeiga. 2008. Adapta ̧c ̃ao do sistema de reco-nhecimento de entidades mencionadas da Pri-beram ao HAREM. EmDesafios na avalia ̧c ̃aoconjunta do reconhecimento de entidades men-cionadas: O Segundo HAREM, 171–179. Lin-guateca.
Augenstein, Isabelle, Leon Derczynski & Ka-lina Bontcheva. 2017. Generalisation in na-med entity recognition: A quantitative analy-sis.Computer Speech & Language44. 61–83.10.1016/j.csl.2017.01.012.
Baldwin, Timothy, Marie Catherine de Marneffe,Bo Han, Young-Bum Kim, Alan Ritter & WeiXu. 2015. Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexicalnormalization and named entity recognition.14–Linguam ́aticaSusana Sotelo Doc ́ıo, Pablo Gamallo & ́Alvaro Iriarte EmWorkshop on Noisy User-generated Text,10.18653/v1/W15-4319.
Bamman, David, Sejal Popat & Sheng Shen.2019. An annotated dataset of literary entities.EmConference of the North American Chapterof the Association for Computational Linguis-tics: Human Language Technologies, vol. 1,2138–2144.10.18653/v1/N19-1220.
Barachi, May El, Sujith Samuel Mathew &Manar AlKhatib. 2022.Combining namedentity recognition and emotion analysis oftweets for early warning of violent actions.Em7thInternational Conference on Smartand Sustainable Technologies (SpliTech), 1–6.10.23919/SpliTech55088.2022.9854231.
Bick, Eckhard. 2006. Functional aspects in Por-tuguese NER. EmComputational Processingof the Portuguese Language (PROPOR), 80–89.
Bouabdallaoui, Ibrahim, Fatima Guerouate,Samya Bouhaddour, Chaimae Saadi & Moha-med Sbihi. 2022. Named entity recognition ap-plied on Moroccan tourism corpus. Em12thInternational Conference on Emerging Ubiqui-tous Systems and Pervasive Networks / 11thInternational Conference on Current and Fu-ture Trends of Information and Communica-tion Technologies in Healthcare, vol. 198, 373–378.10.1016/j.procs.2021.12.256.
Cañete, Jos ́e. 2019. Compilation of large Spa-nish unannotated corpora. Version 2. Zenodo.10.5281/zenodo.3247731.
Cañete, Jos ́e, Gabriel Chaperon, Rodrigo Fuentes, Jou-Hui Ho, Hojin Kang & Jorge P ́erez.2020. Spanish pre-trained BERT model andevaluation data. EmPractical Machine Lear-ning for Developing Countries at ICLR, s.p.
Cardellino, Cristian. 2019.Spanish billionwords corpus and embeddings.https://crscardellino.github.io/SBWCE/.
Chantrapornchai, Chantana & Aphisit Tunsakul.2019. Information extraction based on namedentity for tourism corpus. Em16thInterna-tional Joint Conference on Computer Scienceand Software Engineering (JCSSE), 187–192.10.1109/JCSSE.2019.8864166.
Cheng, Xiao, Weihua Wang, Feilong Bao& Guanglai Gao. 2020.MTNER: Acorpus for Mongolian tourism named en-tity recognition.Em Junhui Li & AndyWay (eds.),Machine Translation, 11–23.10.1007/978-981-33-6162-1_2.
Conneau, Alexis, Kartikay Khandelwal, NamanGoyal, Vishrav Chaudhary, Guillaume Wen-zek, Francisco Guzmán, Edouard Grave, MyleOtt, Luke Zettlemoyer & Veselin Stoyanov.2019. Unsupervised cross-lingual representa-tion learning at scale.CoRRabs/1911.02116.http://arxiv.org/abs/1911.02116.
Del ́eger, Louise, Robert Bossy, Estelle Chaix,Mouhamadou Ba, Arnaud Ferr ́e, PhilippeBessi`eres & Claire N ́edellec. 2016. Overviewof the bacteria biotope task at BioNLP sharedtask. Em4thBioNLP Shared Task Workshop,12–22.10.18653/v1/W16-3002.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee& Kristina Toutanova. 2019.BERT: Pre-training of deep bidirectional transformersfor language understanding.EmConfe-rence of the North American Chapter ofthe Association for Computational Linguis-tics: Human Language Technologies, 4171–4186.10.18653/v1/N19-1423.
do Amaral, Daniela O. F., Sandra Collovini,A. Figueira, Renata Vieira & Marco Gonza-lez. 2017. Processo de constru ̧c ̃ao de um cor-pus anotado com entidades geol ́ogicas visandoREN. Em11thBrazilian Symposium in Infor-mation and Human Language Technology, 63–72.
Doddington, George, Alexis Mitchell, MarkPrzybocki, Lance Ramshaw, Stephanie Stras-sel & Ralph Weischedel. 2004. The automa-tic content extraction (ACE) program – tasks,data, and evaluation. Em4thInternationalConference on Language Resources and Eva-luation (LREC), 837–840.
Egger, Roman (ed.). 2022.Applied data sci-ence in tourism: Interdisciplinary approaches,methodologies, and applicationsTourism onthe Verge. Cham: Springer International Pu-blishing.10.1007/978-3-030-88389-8.
Eltyeb, Safaa & Naomie Salim. 2014. Chemicalnamed entities recognition: a review on appro-aches and applications.Journal of Cheminfor-matics6. 17.10.1186/1758-2946-6-17.
Freitas, Cl ́audia, Cristina Mota, Diana San-tos, Hugo Gon ̧calo Oliveira & Paula Carvalho.2010. Second HAREM: Advancing the stateof the art of named entity recognition in Por-tuguese. Em7thInternational Conference onLanguage Resources and Evaluation (LREC),3630–3637.
Frontini, Francesca, Carmen Brando, JoannaByszuk, Ioana Galleron, Diana Santos &Desenvolvimento e avalia ̧c ̃ao de um modelo NERno dom ́ınio da an ́alise cultural e do turismoLinguam ́atica– 15 Ranka Stankovi ́c. 2020. Named entity recog-nition for distant reading in ELTeC. EmCLA-RIN Annual Conference, 37–41.
Gamallo,Pablo & Marcos Garcia. 2017.LinguaKit:uma ferramenta multilinguepara a an ́alise lingu ́ıstica e a extra ̧c ̃aode informa ̧c ̃ao.Linguam ́atica9(1). 19–28.10.21814/lm.9.1.243.
Garc ́ıa-Pablos, Aitor, Montse Cuadros & Ma-ria Teresa Linaza. 2015. OpeNER: Open to-ols to perform natural language processing onaccommodation reviews. EmInformation andCommunication Technologies in Tourism, 125–137.10.1007/978-3-319-14343-9_10.
Giorgi, John M. & Gary D. Bader. 2018.Transferlearningforbiomedicalna-medentityrecognitionwithneuralnetworks.Bioinformatics34(23). 4087–4094.10.1093/bioinformatics/bty449.
Grishman, Ralph & Beth Sundheim. 1995. De-sign of the MUC-6 evaluation.Em6thConference on Message Understanding, 1–11.10.3115/1072399.1072401.
Guo, Jianyi, Zhengshan Xue, Zhengtao Yu, Zhi-kun Zhang, Yihao Zhang & Xianming Yao.2009. Named entity recognition for the tourismdomain based on cascaded conditional randomfields.Journal of Chinese Information Proces-sing23(5). 47–52.
Guti ́errez Fandi ̃no, Asier, Jordi Armengol-Estap ́e, Marc P`amies, Joan Llop-Palao, Joa-quin Silveira-Ocampo, Casimiro Pio-Carrino,Carme Armentano-Oller, Carlos Rodriguez-Penagos, Aitor Gonzalez-Agirre & Marta Vil-legas. 2022. MarIA: Spanish language models.Procesamiento del Lenguaje Natural68. 39–60.10.26342/2022-68-3.
He, Xuming, Richard S. Zemel & Miguel A.Carreira-Perpi ̃n ́an. 2004. Multiscale conditio-nal random fields for image labeling. EmIEEEComputer Society Conference on Compu-ter Vision and Pattern Recognition (CVPR),vol. 2, II–II.10.1109/CVPR.2004.1315232.
Honnibal, Matthew. 2016. Embed, encode, at-tend, predict: The new deep learning for-mula for state-of-the-art NLP models. Ex-plosion.https://explosion.ai/blog/deep-learning-formula-nlp.
Honnibal, Matthew, Adriane Boyd & Vincent D.Warmerdam. 2022.Compact word vectorswith bloom embeddings. Explosion.https://explosion.ai/blog/bloom-embeddings.
Kanev, Anton I., Grigory A. Savchenko, Ilya A.Grishin, Denis A. Vasiliev & Emilia M. Duma.2022. Sentiment analysis of multilingual textsusing machine learning methods. EmConfe-rence of Russian Young Researchers in Elec-trical and Electronic Engineering, 326–331.10.1109/ElConRus54750.2022.9755568.
Kim, Hyunjae & Jaewoo Kang. 2022.Howdo your biomedical named entity recog-nition models generalize to novel enti-ties?IEEE Access10. 31513–31523.10.1109/ACCESS.2022.3157854.
K ́ad ́ar, ́Akos, Lester James Miranda, Victo-ria Slocum & Sofie Van Landeghem. 2023.The tale of bloom embeddings and unseenentities. Explosion.https://explosion.ai/blog/technical-report.
Lacoste,Alexandre,AlexandraLuccioni,VictorSchmidt&ThomasDandres.2019.Quantifying the Carbon Emissi-ons of Machine Learning.ArXiv [cs.CY].10.48550/ARXIV.1910.09700.
Lafferty, John, Andrew McCallum & FernandoPereira. 2001. Conditional random fields: Pro-babilistic models for segmenting and labelingsequence data. Em18thInternational Confe-rence on Machine Learning, 282–289.
Lample, Guillaume, Miguel Ballesteros, San-deep Subramanian, Kazuya Kawakami &Chris Dyer. 2016.Neural architecturesfor named entity recognition.EmCon-ference of the North American Chapter ofthe Association for Computational Linguis-tics: Human Language Technologies, 260–270.10.18653/v1/N16-1030.
LeCun, Yann, Yoshua Bengio & Geoffrey Hinton.2015. Deep learning.Nature521(7553). 436–444.10.1038/nature14539.
Lee, Jangwon, Jungi Lee, Minho Lee & Gil-Jin Jang. 2022. Named entity correction inneural machine translation using the atten-tion alignment map.Applied Sciences11(15).10.3390/app11157026.
Leitner, Elena, Georg Rehm & Julian Moreno-Schneider. 2019. Fine-grained named entityrecognition in legal documents. Em15thInter-national Conference, SEMANTiCS, 272–287.10.1007/978-3-030-33220-4.
Lignos, Constantine & Marjan Kamyab. 2020. Ifyou build your own NER scorer, non-replicableresults will come. Em1stWorkshop on In-sights from Negative Results in NLP, 94–99.10.18653/v1/2020.insights-1.15.16–Linguam ́aticaSusana Sotelo Doc ́ıo, Pablo Gamallo & ́Alvaro Iriarte
Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zheng-bao Jiang, Hiroaki Hayashi & Graham Neubig.2021. Pre-train, prompt, and predict: A syste-matic survey of prompting methods in naturallanguage processing.ACM Computing Surveys55. 1–35.10.1145/3560815.
Liu, Yinhan, Myle Ott, Naman Goyal, JingfeiDu, Mandar Joshi, Danqi Chen, Omer Levy,Mike Lewis, Luke Zettlemoyer & Veselin Stoya-nov. 2019. RoBERTa: A robustly optimizedBERT pretraining approach. ArXiv [cs.CL].10.48550/arXiv.1907.11692.
Manning, Christopher D., Mihai Surdeanu, JohnBauer, Jenny Finkel, Steven J. Bethard &David McClosky. 2014.The Stanford Co-reNLP natural language processing toolkit.EmAssociation for Computational Linguis-tics (ACL) System Demonstrations, 55–60.10.3115/v1/P14-5010.
Matos, Emanuel, M ́ario Rodrigues, Pedro Miguel& Ant ́onio Teixeira. 2021. Towards automa-tic creation of annotations to foster develop-ment of named entity recognizers. Em10thSymposium on Languages, Applications andTechnologies (SLATE), vol. 94, 11:1–11:14.10.4230/OASIcs.SLATE.2021.11.
McDonald, Ryan & Fernando Pereira. 2005.Identifyinggeneandproteinmentionsin text using conditional random fields.BMCBioinformatics6(Suppl1).S6.10.1186/1471-2105-6-S1-S6.
Miranda, Lester James, ́Akos K ́ad ́ar, Adri-ane Boyd, Sofie Van Landeghem, AndersSøgaard & Matthew Honnibal. 2022. Multihash embeddings in spaCy. ArXiv [cs.CL].10.48550/arXiv.2212.09255.
Oronoz, Maite, Koldo Gojenola, Alicia P ́erez,Arantza D ́ıaz de Ilarraza & Arantza Casillas.2015. On the creation of a clinical gold stan-dard corpus in Spanish: Mining adverse drugreactions.Journal of Biomedical Informatics56. 318–332.10.1016/j.jbi.2015.06.016.
Ortiz Su ́arez, Pedro Javier, Benoˆıt Sagot & Lau-rent Romary. 2019. Asynchronous pipelines forprocessing huge corpora on medium to low re-source infrastructures. EmWorkshop on Chal-lenges in the Management of Large Corpora,9–16.10.14618/ids-pub-9021.
Pais, Vasile, Maria Mitrofan, Carol Luca Ga-san, Vlad Coneschi & Alexandru Ianov.2021.Named entity recognition in theRomanian legal domain.EmNatural Le-gal Language Processing Workshop, 9–18.10.18653/v1/2021.nllp-1.2.
Palmer, David D. & David S. Day. 1997. A sta-tistical profile of the named entity task. Em5thConference Applied Natural Language Proces-sing, 190–193.10.3115/974557.974585.
Pedregosa, F., G. Varoquaux, A. Gramfort,V. Michel, B. Thirion, O. Grisel, M. Blondel,P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-derplas, A. Passos, D. Cournapeau, M. Bru-cher, M. Perrot & E. Duchesnay. 2011. Scikit-learn: Machine learning in Python.Journal ofMachine Learning Research12. 2825–2830.
Pennington, Jeffrey, Richard Socher & Ch-ristopher D. Manning. 2014.Glove:Global vectors for word representation.EmEmpirical Methods in Natural Lan-guageProcessing(EMNLP),1532–1543.10.3115/v1/D14-1162.
Qi, Peng, Yuhao Zhang, Yuhui Zhang, Ja-son Bolton & Christopher D. Manning.2020.Stanza:A Python natural lan-guage processing toolkit for many humanlanguages.Em58thAnnual Meeting ofthe Association for Computational Linguis-tics (ACL): System Demonstrations, 101–108.10.18653/v1/2020.acl-demos.14.
Santos, Diana, Nuno Seco, Nuno Cardoso & RuiVilela. 2006. HAREM: An advanced NER eva-luation contest for Portuguese. Em5thIn-ternational Conference on Language Resourcesand Evaluation (LREC), 1986–1991.
Saputro, Khurniawan Eko, Sri Suning Kusu-mawardani & Silmi Fauziati. 2016.Deve-lopment of semi-supervised named entity re-cognition to discover new tourism places.Em2ndInternational Conference on Scienceand Technology-Computer (ICST), 124–128.10.1109/ICSTC.2016.7877360.
Settles, Burr. 2004. Biomedical named entityrecognition using conditional random fieldsand rich feature sets. EmInternational JointWorkshop on Natural Language Processing inBiomedicine and its Applications, 107–110.
Sha, Fei & Fernando Pereira. 2003. Shallow par-sing with conditional random fields. EmHu-man Language Technology Conference of theNorth American Chapter of the Association forComputational Linguistics, 213–220.
Søgaard, Anders, Sebastian Ebert, Jasmijn Bas-tings & Katja Filippova. 2021. We need totalk about random splits. Em16thConferenceof the European Chapter of the Associationfor Computational Linguistics (EACL), 1823–1832.10.18653/v1/2021.eacl-main.156.Desenvolvimento e avalia ̧c ̃ao de um modelo NERno dom ́ınio da an ́alise cultural e do turismoLinguam ́atica– 17
Strubell, Emma, Ananya Ganesh & AndrewMcCallum. 2020.Energy and Policy Con-siderations for Modern Deep Learning Re-search.EmAAAI Conference on Arti-ficial Intelligence, vol. 34 9, 13693–13696.10.1609/aaai.v34i09.7123.
Tjong Kim Sang, Erik F. 2002. Introductionto the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. Em6thConference on Natural Language Learning(CoNLL), s.p.
Tjong Kim Sang, Erik F. & Fien De Meulder.2003. Introduction to the CoNLL-2003 Sha-red Task: Language-Independent Named En-tity Recognition. Em7thConference on Natu-ral Language Learning (CoNLL), 142–147.
Torres Feij ́o, Elias J. 2019.Bem-estar comu-nit ́ario e visitantes atrav ́es do Caminho deSantiago. Grandes narrativas, ideias e pr ́aticasculturais na cidade. Andavira.
Vaswani, Ashish, Noam Shazeer, Niki Parmar,Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser & Illia Polosukhin. 2017. Atten-tion is all you need. Em31stAnnual Confe-rence on Neural Information Processing Sys-tems, vol. 1, 5999–6008.
Vijay, J. & Rajeswari Sridhar. 2016. A machinelearning approach to named entity recognitionfor the travel and tourism domain.Asian Jour-nal of Information Technology15(21). 4309–4317.10.3923/ajit.2016.4309.4317.
Vu, Van-Hai, Quang-Phuoc Nguyen, Kiem-HieuNguyen, Joon-Choul Shin & Cheol-Young Ock.2020.Korean-Vietnamese Neural MachineTranslation with Named Entity Recognitionand Part-of-Speech Tags.IEICE Transactionson Information and SystemsE103.D(4). 866–873.10.1587/transinf.2019EDP7154.
Walker, Christopher, Stephanie Strassel, JulieMedero & Kazuaki Maeda. 2006. ACE 2005multilingual training corpus. Linguistic DataConsortium.10.35111/mwxc-vh88.
Wolf, Thomas, Lysandre Debut, Victor Sanh, Ju-lien Chaumond, Clement Delangue, AnthonyMoi, Pierric Cistac, Tim Rault, Remi Louf,Morgan Funtowicz, Joe Davison, Sam Shleifer,Patrick von Platen, Clara Ma, Yacine Jernite,Julien Plu, Canwen Xu, Teven Le Scao, Syl-vain Gugger, Mariama Drame, Quentin Lho-est & Alexander Rush. 2020. Transformers:State-of-the-art natural language processing.EmConference on Empirical Methods in Na-tural Language Processing (EMNLP), 38–45.10.18653/v1/2020.emnlp-demos.6.
Xue, Leyi, Han Cao, Fan Ye & Yuehua Qin. 2019.A method of Chinese tourism named entity re-cognition based on BBLC Model. EmIEEESmartWorld: Ubiquitous Intelligence Compu-ting, Advanced Trusted Computing, ScalableComputing Communications, Cloud Big DataComputing, Internet of People and Smart CityInnovation, 1722–1727

Fonte de datos: Dialnet

Desenvolvimento e avaliação de um modelo NER no domínio da análise cultural e do turismo

Universidade de Santiago de Compostela

Universidade do Minho

Obxectivos de Desenvolvemento Sustentable

Resumo

Referencias bibliográficas