CarvalhoEnglish-Galician SMT system from EuroParl English-Portuguese parallel corpus

  1. Pichel Campos, José Ramón
  2. Malvar Fernández, Paulo
  3. Senra Gómez, Oscar
  4. Gamallo Otero, Pablo
  5. García, Alberto
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2009

Issue: 43

Pages: 379-381

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

In order to build reliable Statistical Machine Translation (SMT) engines between two languages it is essential to use a significantly big amount of parallel corpora. Since available English-Galician parallel corpora are not yet sufficient, it is obvious that other strategies must be followed. Important Romanicists, such as Coseriu (1987) or Cunha & Cintra (2002) have theorized that Galician and Portuguese are two varieties of European Portuguese. From a Computational Linguistics practical stand point, this assumption opens a new line of research that potentially supplies Galician with huge amount of computational resources from both European and Brazilian Portuguese. Thus, drawing from the English-Portuguese Europarl parallel corpus, imaxin|software has built a English-Galician Phrase-based Statistical Machine Translation prototype. To achieve that, the English-Portuguese parallel corpus was first converted into English-Galician using a Opentrad Portuguese Galician Rule-based Machine Translation (RBMT) engine and a spelling converter. Secondly, using Moses, Kohen et al. (2007), and GIZA++, Och & Ney (2003) we built the English-Galician translations and language models of our prototype. The results obtained allow us to conclude that SMT tools based on Galician can be drawn from Portuguese resources, which otherwise would have been an unthinkable task due to the lack of English-Galician parallel corpora. We can also conclude that this strategy can be implemented to develop a great variety of computational tools for Galician language.