Attention mechanism and skip-gram embedded phrases
PDF

Keywords

computational linguistics
legal terminology
legal translation
Neural Machine Translation
Self Attention Mechanism
short and long-distance dependency n-grams
skip-gram algorithm

How to Cite

Krimpas, P., & Valavani, C. (2023). Attention mechanism and skip-gram embedded phrases: short and long-distance dependency n-grams for legal corpora. Comparative Legilinguistics, 52, 318–350. https://doi.org/10.14746/cl.52.2022.14

Abstract

This article examines common translation errors that occur in the translation of legal texts. In particular, it focuses on how German texts containing legal terminology are rendered into Modern Greek by the Google translation machine. Our case study is the Google-assisted translation of the original (German) version of the Constitution of the Federal Republic of Germany into Modern Greek. A training method is proposed for phrase extraction based on the occurrence frequency, which goes through the Skip-gram algorithm to be then integrated into the Self Attention Mechanism proposed by Vaswani et al. (2017) in order to minimise human effort and contribute to the development of a robust machine translation system for multi-word legal terms and special phrases. This Neural Machine Translation approach aims at developing vectorised phrases from large corpora and process them for translation. The research direction is to increase the in-domain training data set and enrich the vector dimension with more information for legal concepts (domain specific features).

https://doi.org/10.14746/cl.52.2022.14
PDF

References

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015. arXiv:1409.0473v7 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.1409.0473.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5: 135–46. https://aclanthology.org/Q17-1010.pdf (accessed December 28, 2022). DOI: https://doi.org/10.1162/tacl_a_00051

Bouma, Gerlof. 2009. Normalized (Pointwise) Mutual information in collocation extraction. In From Form to Meaning: Processing Texts Automatically: Proceedings of the Biennial GSCL Conference 2009, eds. Christian Chiarcos, Richard Eckart de Castilho and Manfred Stede, 31–40. Tübingen: Gunter Narr.

Camacho-Collados, José, and Mohammad Taher Pilehvar. 2018. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research 63: 743–88. DOI: https://doi.org/10.1613/jair.1.11259. DOI: https://doi.org/10.1613/jair.1.11259

Diniz da Costa, Alexandre, Mateus Coutinho Marim, Ely Edison da Silva Matos, and Tiago Timponi Torrent. 2022. Domain Adaptation in Neural Machine Translation using a Qualia-Enriched FrameNet. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), eds. Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk and Stelios Piperidis, 1–12. Paris: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2022/LREC-2022.pdf (accessed December 28, 2022).

Duběda, Tomáš. 2021. Direction-asymmetric equivalence in legal translation. Comparative Legilinguistics 47: 57–72. DOI: http://dx.doi.org/10.2478/cl-2021-0012. DOI: https://doi.org/10.2478/cl-2021-0012

Giampieri, Patrizia. 2018. The web as corpus and online corpora for legal translations. Comparative Legilinguistics 33: 35–55. DOI: http://dx.doi.org/10.14746/cl.2018.33.2. DOI: https://doi.org/10.14746/cl.2018.33.2

Goźdź-Roszkowski, Stanisław. 2021. Corpus linguistics in legal discourse. International Journal for the Semiotics of Law - Revue internationale de Sémiotique juridique 34: 1515–1540. DOI: https://doi.org/10.1007/s11196-021-09860-8. DOI: https://doi.org/10.1007/s11196-021-09860-8

Jurafsky, Daniel, and James H. Martin. 2022. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd edition draft). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf (accessed December 28, 2022).

Kalchbrenner, Nal, and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 18-21 October 2013 (EMNLP 2013), 1700–09. Stroudsburg: Association for Computational Linguistics. https://aclanthology.org/D13-1176.pdf (accessed December 28, 2022).

Klemen, Matej, Luka Krsnik, and Marko Robnik-Šikonja. 2022. Enhancing deep neural networks with morphological information. Natural Language Engineering 28(3): 1–26. DOI: https://doi.org/10.1017/S1351324922000080. DOI: https://doi.org/10.1017/S1351324922000080

Krimpas, Panagiotis G. 2017a. Terminological preciseness or translational and legal effectiveness? Terminology of commodatum in no > el language pair. In Konteksty súdneho prekladu a tlmočenia VI, ed. Zuzana Guldanová, 66–84. Bratislava: Univerzita Komenského v Bratislave. https://fphil.uniba.sk/fileadmin/fif/katedry_pracoviska/kgn/transius/na_stiahnutie/Kontexty_sudneho_prekladu_a_tlmocenia_VI_2017.pdf (accessed December 28, 2022).

Krimpas, Panagiōtīs G. 2017b. Eisagōgī stī theōria tīs metafrasīs [Introduction to Translation Theory]. Athīna: Grīgorī.

Krimpas, Panagiōtīs G. 2019. Pseudologioi typoi kai yperdiorthōsī stī Neoellīnikī Koinī me vasī ta epipeda glōssikīs analysīs [Pseudo-learned forms and hypercorrection in Standard Modern Greek on the basis of linguistic analysis levels]. In Apo ton oiko sto spiti kai tanapalin… To logio epipedo stī sygchronī nea ellīnikī: Theōria, Istoria, Efarmogī [From oikos to spiti and vice versa: The learned register in Standard Modern Greek: Theory, History, Practice], eds. Asimakīs Fliatouras and Anna Anastasiadī-Symeōnidī, 57–126. Athīna: Patakī.

Maksym Del, Andre Tättar, and Mark Fishel. 2018. Phrase-based unsupervised machine translation with compositional phrase embeddings. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, Belgium, Brussels, October 31 - Novermber 1, 2018, 361–67. Stroudsburg: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W18-64034. DOI: https://doi.org/10.18653/v1/W18-6407

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2014. Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems 2013, December 5-10, 2013 Lake Tahoe, Nevada, USA, Volume 1 of 4, 3128–36. New York: Curran.

Prieto Ramos, Fernando. 2014. Parameters for problem-solving in legal translation: Implications for legal lexicography and institutional terminology management. In The Ashgate Handbook of Legal Translation, eds. Le Cheng, King Kui Sin, and Anne Wagner, 121–134. Abingdon: Routledge.

Shang, Jingbo, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30(10): 1825–1837. DOI: https://doi.org/10.1109/TKDE.2018.2812203. DOI: https://doi.org/10.1109/TKDE.2018.2812203

Tezcan, Arda, Véronique Hoste, and Lieve Macken. 2017. SCATE Taxonomy and Corpus of Machine Translation Errors. In Trends in e-Tools and Resources for Translators and Interpreters, eds. Gloria Corpas Pastor and Isabel Durán Muñoz, 219–248. Leiden: Brill. DOI: https://doi.org/10.1163/9789004351790_012. DOI: https://doi.org/10.1163/9789004351790_012

Tognini Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins. DOI: https://doi.org/10.1075/scl.6

Valeontīs, Kōnstantinos E., and Panagiōtīs G. Krimpas. 2014. Nomikī Glōssa, Nomikī Orologia: Theōria kai Praxī [Legal Language, Legal Terminology: Theory and Practice]. Athīna: Nomikī Vivliothīkī/Ellīnikī Etaireia Orologias.

van Brussel, Laura, Arda Tezcan, and Lieve Macken. 2018. A Fine-grained Error Analysis of NMT, PBMT and RBMT Output for English-to-Dutch. In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018), eds. Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis and Takenobu Tokunaga, 3799–3804. Paris: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/index.html (accessed December 28, 2022).

Van de Cruys, Tim. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality (DiSCo’2011), 16–20. Stroudsburg: Association for Computational Linguistics.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. DOI: https://doi.org/10.48550/arXiv.1706.03762.

Vaswani, Ashish, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: MT Researchers’ Track), eds. Colin Cherry and Graham Neubig, 193–99. Association for Machine Translation in the Americas. https://aclanthology.org/W18-1819.pdf (accessed December 28, 2022).

Wiesmann, Ewa. 2019. Machine translation in the field of law: A study of the translation of Italian legal texts into German. Comparative Legilinguistics 37: 117–153. DOI: http://dx.doi.org/10.14746/cl.2019.37.4. DOI: https://doi.org/10.14746/cl.2019.37.4

Zhang, Jiajun, Shujie Liu, Mu Li, Ming Zhou, and Chengqing Zong. 2014. Bilingually-constrained Phrase Embeddings for Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Volume 1: Long Papers, ACL 2014, June 22–27, Baltimore, 111–21. Stroudsburg: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P14-1011. DOI: https://doi.org/10.3115/v1/P14-1011