Attention mechanism and skip-gram embedded phrases: short and long-distance dependency n-grams for legal corpora

Panagiotis Krimpas; Christina Valavani

doi:10.14746/cl.52.2022.14

Tom 52 (2022), Artykuły

Tom 52 (2022)

Attention mechanism and skip-gram embedded phrases: short and long-distance dependency n-grams for legal corpora

Artykuły

https://doi.org/10.14746/cl.52.2022.14

Opublikowane 2023-01-09

Panagiotis Krimpas⁺⁻
Christina Valavani⁺⁻

Panagiotis Krimpas

https://orcid.org/0000-0001-7271-9653

Democritus University of Thrace

Greece

Christina Valavani

https://orcid.org/0000-0002-2944-0734

National and Kapodistrian University of Athens

Greece

Okładka czasopisma Comparative Legilinguistics, tom 52, rok 2022

PDF (English)

Słowa kluczowe

computational linguistics
legal terminology
legal translation
Neural Machine Translation
Self Attention Mechanism
short and long-distance dependency n-grams
skip-gram algorithm

Jak cytować

Krimpas, P., & Valavani, C. (2023). Attention mechanism and skip-gram embedded phrases: short and long-distance dependency n-grams for legal corpora. Comparative Legilinguistics, 52, 318–350. https://doi.org/10.14746/cl.52.2022.14

Abstrakt

This article examines common translation errors that occur in the translation of legal texts. In particular, it focuses on how German texts containing legal terminology are rendered into Modern Greek by the Google translation machine. Our case study is the Google-assisted translation of the original (German) version of the Constitution of the Federal Republic of Germany into Modern Greek. A training method is proposed for phrase extraction based on the occurrence frequency, which goes through the Skip-gram algorithm to be then integrated into the Self Attention Mechanism proposed by Vaswani et al. (2017) in order to minimise human effort and contribute to the development of a robust machine translation system for multi-word legal terms and special phrases. This Neural Machine Translation approach aims at developing vectorised phrases from large corpora and process them for translation. The research direction is to increase the in-domain training data set and enrich the vector dimension with more information for legal concepts (domain specific features).

https://doi.org/10.14746/cl.52.2022.14

PDF (English)

Pobrania

Brak dostępnych danych do wyświetlenia.

Bibliografia

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2016. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015. arXiv:1409.0473v7 [cs.CL]. DOI: https://doi.org/10.48550/arXiv.1409.0473.

Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5: 135–46. https://aclanthology.org/Q17-1010.pdf (accessed December 28, 2022). DOI: https://doi.org/10.1162/tacl_a_00051

Bouma, Gerlof. 2009. Normalized (Pointwise) Mutual information in collocation extraction. In From Form to Meaning: Processing Texts Automatically: Proceedings of the Biennial GSCL Conference 2009, eds. Christian Chiarcos, Richard Eckart de Castilho and Manfred Stede, 31–40. Tübingen: Gunter Narr.

Camacho-Collados, José, and Mohammad Taher Pilehvar. 2018. From word to sense embeddings: A survey on vector representations of meaning. Journal of Artificial Intelligence Research 63: 743–88. DOI: https://doi.org/10.1613/jair.1.11259. DOI: https://doi.org/10.1613/jair.1.11259

Diniz da Costa, Alexandre, Mateus Coutinho Marim, Ely Edison da Silva Matos, and Tiago Timponi Torrent. 2022. Domain Adaptation in Neural Machine Translation using a Qualia-Enriched FrameNet. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), eds. Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk and Stelios Piperidis, 1–12. Paris: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2022/LREC-2022.pdf (accessed December 28, 2022).

Duběda, Tomáš. 2021. Direction-asymmetric equivalence in legal translation. Comparative Legilinguistics 47: 57–72. DOI: http://dx.doi.org/10.2478/cl-2021-0012. DOI: https://doi.org/10.2478/cl-2021-0012

Giampieri, Patrizia. 2018. The web as corpus and online corpora for legal translations. Comparative Legilinguistics 33: 35–55. DOI: http://dx.doi.org/10.14746/cl.2018.33.2. DOI: https://doi.org/10.14746/cl.2018.33.2

Goźdź-Roszkowski, Stanisław. 2021. Corpus linguistics in legal discourse. International Journal for the Semiotics of Law - Revue internationale de Sémiotique juridique 34: 1515–1540. DOI: https://doi.org/10.1007/s11196-021-09860-8. DOI: https://doi.org/10.1007/s11196-021-09860-8

Jurafsky, Daniel, and James H. Martin. 2022. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition (3rd edition draft). https://web.stanford.edu/~jurafsky/slp3/ed3book_jan122022.pdf (accessed December 28, 2022).

Kalchbrenner, Nal, and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, 18-21 October 2013 (EMNLP 2013), 1700–09. Stroudsburg: Association for Computational Linguistics. https://aclanthology.org/D13-1176.pdf (accessed December 28, 2022).

Klemen, Matej, Luka Krsnik, and Marko Robnik-Šikonja. 2022. Enhancing deep neural networks with morphological information. Natural Language Engineering 28(3): 1–26. DOI: https://doi.org/10.1017/S1351324922000080. DOI: https://doi.org/10.1017/S1351324922000080

Krimpas, Panagiotis G. 2017a. Terminological preciseness or translational and legal effectiveness? Terminology of commodatum in no > el language pair. In Konteksty súdneho prekladu a tlmočenia VI, ed. Zuzana Guldanová, 66–84. Bratislava: Univerzita Komenského v Bratislave. https://fphil.uniba.sk/fileadmin/fif/katedry_pracoviska/kgn/transius/na_stiahnutie/Kontexty_sudneho_prekladu_a_tlmocenia_VI_2017.pdf (accessed December 28, 2022).

Krimpas, Panagiōtīs G. 2017b. Eisagōgī stī theōria tīs metafrasīs [Introduction to Translation Theory]. Athīna: Grīgorī.

Krimpas, Panagiōtīs G. 2019. Pseudologioi typoi kai yperdiorthōsī stī Neoellīnikī Koinī me vasī ta epipeda glōssikīs analysīs [Pseudo-learned forms and hypercorrection in Standard Modern Greek on the basis of linguistic analysis levels]. In Apo ton oiko sto spiti kai tanapalin… To logio epipedo stī sygchronī nea ellīnikī: Theōria, Istoria, Efarmogī [From oikos to spiti and vice versa: The learned register in Standard Modern Greek: Theory, History, Practice], eds. Asimakīs Fliatouras and Anna Anastasiadī-Symeōnidī, 57–126. Athīna: Patakī.

Maksym Del, Andre Tättar, and Mark Fishel. 2018. Phrase-based unsupervised machine translation with compositional phrase embeddings. In Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, Belgium, Brussels, October 31 - Novermber 1, 2018, 361–67. Stroudsburg: Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/W18-64034. DOI: https://doi.org/10.18653/v1/W18-6407

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2014. Distributed representations of words and phrases and their compositionality. In 27th Annual Conference on Neural Information Processing Systems 2013, December 5-10, 2013 Lake Tahoe, Nevada, USA, Volume 1 of 4, 3128–36. New York: Curran.

Prieto Ramos, Fernando. 2014. Parameters for problem-solving in legal translation: Implications for legal lexicography and institutional terminology management. In The Ashgate Handbook of Legal Translation, eds. Le Cheng, King Kui Sin, and Anne Wagner, 121–134. Abingdon: Routledge.

Shang, Jingbo, Jialu Liu, Meng Jiang, Xiang Ren, Clare R. Voss, Jiawei Han. 2018. Automated phrase mining from massive text corpora. IEEE Transactions on Knowledge and Data Engineering 30(10): 1825–1837. DOI: https://doi.org/10.1109/TKDE.2018.2812203. DOI: https://doi.org/10.1109/TKDE.2018.2812203

Tezcan, Arda, Véronique Hoste, and Lieve Macken. 2017. SCATE Taxonomy and Corpus of Machine Translation Errors. In Trends in e-Tools and Resources for Translators and Interpreters, eds. Gloria Corpas Pastor and Isabel Durán Muñoz, 219–248. Leiden: Brill. DOI: https://doi.org/10.1163/9789004351790_012. DOI: https://doi.org/10.1163/9789004351790_012

Tognini Bonelli, Elena. 2001. Corpus Linguistics at Work. Amsterdam and Philadelphia: John Benjamins. DOI: https://doi.org/10.1075/scl.6

Valeontīs, Kōnstantinos E., and Panagiōtīs G. Krimpas. 2014. Nomikī Glōssa, Nomikī Orologia: Theōria kai Praxī [Legal Language, Legal Terminology: Theory and Practice]. Athīna: Nomikī Vivliothīkī/Ellīnikī Etaireia Orologias.

van Brussel, Laura, Arda Tezcan, and Lieve Macken. 2018. A Fine-grained Error Analysis of NMT, PBMT and RBMT Output for English-to-Dutch. In Proceedings of the 11th Conference on Language Resources and Evaluation (LREC 2018), eds. Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis and Takenobu Tokunaga, 3799–3804. Paris: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/index.html (accessed December 28, 2022).

Van de Cruys, Tim. 2011. Two Multivariate Generalizations of Pointwise Mutual Information. In Proceedings of the Workshop on Distributional Semantics and Compositionality (DiSCo’2011), 16–20. Stroudsburg: Association for Computational Linguistics.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. DOI: https://doi.org/10.48550/arXiv.1706.03762.

Vaswani, Ashish, Samy Bengio, Eugene Brevdo, Francois Chollet, Aidan N. Gomez, Stephan Gouws, Llion Jones, Łukasz Kaiser, Nal Kalchbrenner, Niki Parmar, Ryan Sepassi, Noam Shazeer, and Jakob Uszkoreit. 2018. Tensor2Tensor for neural machine translation. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: MT Researchers’ Track), eds. Colin Cherry and Graham Neubig, 193–99. Association for Machine Translation in the Americas. https://aclanthology.org/W18-1819.pdf (accessed December 28, 2022).

Wiesmann, Ewa. 2019. Machine translation in the field of law: A study of the translation of Italian legal texts into German. Comparative Legilinguistics 37: 117–153. DOI: http://dx.doi.org/10.14746/cl.2019.37.4. DOI: https://doi.org/10.14746/cl.2019.37.4

Zhang, Jiajun, Shujie Liu, Mu Li, Ming Zhou, and Chengqing Zong. 2014. Bilingually-constrained Phrase Embeddings for Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Volume 1: Long Papers, ACL 2014, June 22–27, Baltimore, 111–21. Stroudsburg: Association for Computational Linguistics. DOI: https://doi.org/10.3115/v1/P14-1011. DOI: https://doi.org/10.3115/v1/P14-1011

Licencja

Utwór dostępny jest na licencji Creative Commons Uznanie autorstwa 4.0 Międzynarodowe.

When submitting a paper the author agrees to the following publishing agreement and processing personal data.

PUBLICATION AGREEMENT, COPYRIGHT LICENSE, PERSONAL DATA PROCESSING CONSENT

This is a publication agreement and copyright license (“Agreement”) regarding a written manuscript currently submitted via Pressto Platform

(“Article”) to be published in Comparative Legilinguistics International Journal for Legal Communication (“Journal”).

The parties to this Agreement are:

the Author or Authors of the submitted article (individually, or if more than one author, collectively, “Author”) and Comparative Legilinguistics International Journal for Legal Communication (“Publisher”), address al. Niepodległości 4, 61-874 Poznań, represented by its editor in chief Aleksandra Matulewska.

§1. LICENSE OF COPYRIGHT

a) The Author and the Publisher agree that the Author grants a Creative Commons Attribution 4.0 International License, which is incorporated herein by reference and is further specified at Creative Commons — Attribution 4.0 International — CC BY 4.0 copyright license in the Article to the general public.

b) The Author grants to the Publisher a royalty-free, worldwide nonexclusive license to publish, reproduce, display, distribute, translate and use the Article in any form, either separately or as part of a collective work, including but not limited to a nonexclusive license to publish the Article in an issue of the Journal, copy and distribute individual reprints of the Article, authorize reproduction of the entire Article in another publication, and authorize reproduction and distribution of the Article or an abstract thereof by means of computerized retrieval systems (such as Westlaw, Lexis and SSRN). The Author retains ownership of all rights under copyright in the Article, and all rights not expressly granted in this Agreement.

c) The Author grants to the Publisher the power to assign, sublicense or otherwise transfer any and all licenses expressly granted to the Publisher under this Agreement.

d) Republication. The Author agrees to require that the Publisher be given credit as the original publisher in any republication of the Article authorized by the Author. If the Publisher authorizes any other party to republish the Article under the terms of paragraphs 1c and 1 of this Agreement, the Publisher shall require such party to ensure that the Author is credited as the Author.

§2. EDITING OF THE ARTICLE

a) The Author agrees that the Publisher may edit the Article as suitable for publication in the Journal. To the extent that the Publisher’s edits amount to copyrightable works of authorship, the Publisher hereby assigns all right, title, and interest in such edits to the Author.

§3. WARRANTIES

a) The Author represents and warrants that to the best of the Author’s knowledge the Article does not defame any person, does not invade the privacy of any person, and does not in any other manner infringe upon the rights of any person. The Author agrees to indemnify and hold harmless the Publisher against all such claims.

b) The Author represents and warrants that the Author has full power and authority to enter into this Agreement and to grant the licenses granted in this Agreement.

c) The Author represents and warrants that the Article furnished to the Publisher has not been published previously. For purposes of this paragraph, making a copy of the Article accessible over the Internet, including, but not limited to, posting the Article to a database accessible over the Internet, does not constitute prior publication so long as the as such copy indicates that the Article is not in final form, such as by designating such copy to be a “draft,” a “working paper,” or “work-in-progress”. The Author agrees to hold harmless the Publisher, its licensees and distributees, from any claim, action, or proceeding alleging facts that constitute a breach of any warranty enumerated in this paragraph.

§4. TERM

a) The agreement was concluded for an unspecified time.

§5. PAYMENT

a) The Author agrees and acknowledges that the Author will receive no payment from the Publisher for use of the Article or the licenses granted in this Agreement.

b) The Publisher agrees and acknowledges that the Publisher will not receive any payment from the Author for publication by the Publisher.

§6. ENTIRE AGREEMENT

a) This Agreement supersedes any and all other agreements, either oral or in writing, between the Author and the Publisher with respect to the subject of this Agreement. This Agreement contains all of the warranties and agreements between the parties with respect to the Article, and each party acknowledges that no representations, inducements, promises, or agreements have been made by or on behalf of any party except those warranties and agreements embodied in this Agreement.

b) In all cases not regulated by this Agreement, legal provisions of Polish Copyright Act and Polish Civil Code shall apply.

c) Any disputes arising from the enforcement of obligations connected with this Agreement shall be resolved by a court competent for the headquarters of the Publisher.

d) Any amendments or additions to the Agreement must be made in writing and signed by authorised representative of both parties, otherwise being ineffective.

e) This Agreement is signed electronically and the submission of the article via the PRESSto platform is considered as the conclusion of the Agreement by the Author and the Publisher.

f) Clause for consent to the processing of personal data - general

g) The Author shall give his or her consent to the processing of their personal data in accordance with the Act of 10 May 2018 on the protection of personal data and Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of persons physical in connection with the processing of personal data and on the free movement of such data, and repealing Directive 95/46 / EC (General Data Protection Regulation) for the purpose and in connection with making publications available on the PRESSto scientific journals platform and DeGruyter platform, guaranteeing the security of services rendered, and improving them.

I HAVE READ AND AGREE FULLY WITH THE TERMS OF THIS AGREEMENT.

The Author The Publisher

Attention mechanism and skip-gram embedded phrases: short and long-distance dependency n-grams for legal corpora

Słowa kluczowe

Jak cytować

Pobierz cytowania

Abstrakt

Pobrania

Bibliografia

Licencja