Semi-automatic ontological alignment of digitized books parallel corpora

Algirdas Laukaitis; Neda Laukaitytė

doi:10.3846/mla.2021.15034

DOI: https://doi.org/10.3846/mla.2021.15034

Abstract

In this paper, we present a method for general ontology management integration with an alignment of digitized books paraphrase corpus, which have been compiled from bilingual parallel corpus. We show that our method can improve ontology development and consistency checking when we add semantic parsing and machine translation to the process of general knowledge management. Additionally, we argue that the focus on one’s favorite books gives a factor of gamification for knowledge management process. A new formalism of semantic parsing ontological alignments is introduced and its use for ontology development and consistency checking is discussed. It is shown that existing general ontologies requires much more axioms than it is currently available in order to explain unaligned content of books. Proactive learning approach is suggested as part of the solution to improve development of ontology predicates and axioms. WordNet, FrameNet and SUMO ontologies are used as a starting knowledge base of paraphrase corpus semantic alignment method.

Article in English.

Lygiagretaus skaitmeninių knygų rinkinio dalinis automatinis sugretinimas, naudojant ontologijas

Santrauka

Straipsnyje pateiktas bendrosios ontologijos valdymo metodas naudojant parafrazių rinkinius, gautus iš grožinės literatūros knygų. Straipsnyje pateiktas metodas gali pagerinti tolesnį ontologijos plėtimą ir loginio nuoseklumo patikrinimą. Šio metodo funkcionalumas grindžiamas dviem esminėmis technologijomis: semantine teksto analize ir automatiniu kompiuterio vertimu. Svarbus pateikto metodo aspektas – žaidimo elementų naudojimas valdant bendrąsias ontologijas. Šis aspektas užtikrinamas tuo, kad ontologijų valdymo procesas glaudžiai susietas su grožinės literatūros kūriniais. Straipsnyje pateiktas naujas ontologijų suderinimo formalizmas. Tyrimų rezultatai parodė, kad esamos bendrosios ontologijos turi būti papildytos kur kas didesniu kiekiu aksiomų, nei yra šiuo metu, kad būtų galima paaiškinti semantinį nesugretintų parafrazių ekvivalentiškumą. Papildomai straipsnyje pasiūlytas proaktyvus mokymosi metodas, leidžiantis pagerinti ontologijų kūrimo procesą. „WordNet“, „FrameNet“ ir SUMO ontologijos naudojamos kaip pradinės žinių bazės, siekiant pagerinti semantinio sugretinimo metodą.

Reikšminiai žodžiai: tekstų sugretinimas, ontologijų kūrimas ir naudojimas, automatinis mašininis vertimas, natūralios kalbos apdorojimo algoritmai.

Keyword : ontological alignment of corpora, alignment of digitized books, machine translation, natural language processing

How to Cite

Laukaitis, A., & Laukaitytė, N. (2021). Semi-automatic ontological alignment of digitized books parallel corpora. Mokslas – Lietuvos Ateitis / Science – Future of Lithuania, 13. https://doi.org/10.3846/mla.2021.15034

Published in Issue

Jul 2, 2021

Abstract Views

534

PDF Downloads

424

This work is licensed under a Creative Commons Attribution 4.0 International License.

References

Agirre, E., de Lacalle, O. L., & Soroa, A. (2014). Random walks for knowledge-based word sense disambiguation. Computational Linguistics, 40(1), 57–84. https://doi.org/10.1162/COLI_a_00164

Berant, J., Chou, A., Frostig, R., & Liang, P. (2013). Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1533–1544). Association for Computational Linguistics.

Berger, A. L., Della Pietra, V. J., & Della Pietra S. A. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–72.

Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., & Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics, 19(2), 263–311.

Chiang, D. (2007). Hierarchical phrase-based translation. Computational Linguistics, 32(2), 201–228. https://doi.org/10.1162/coli.2007.33.2.201

Kwiatkowski, T., Zettlemoyer, L., Goldwater, S., & Steedman, M. (2011). Lexical generalization in CCG grammar induction for semantic parsing. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 24(18), 1512–1523.

Laukaitis, A., & Vasilecas, O. (2008). Multi-alignment templates induction. Informatica, 19(4), 535–554.
https://doi.org/10.15388/Informatica.2008.229

Laukaitis, A., Plikynas, D., & Ostasius, E. (2018). Sentence level alignment of digitized books parallel corpora. Informatica, 29(4), 693–710. https://doi.org/10.15388/Informatica.2018.188

Laukaitis, A., Vasilecas, O., Laukaitis, R., & Plikynas, D. (2011). Semi-automatic bilingual corpus creation with zero entropy alignments. Informatica, 22(2), 223–224.
https://doi.org/10.15388/Informatica.2011.323

Marcu, D., & Wong, W. (2002). A phrase-based, joint probability model for statistical machine translation. Proceedings of the Conference on Empirical Methods in Natural Language Processing, 10, 133–139. https://doi.org/10.3115/1118693.1118711

McCallum, A., & Nigam, K. (1998). Employing EM and poolbased active learning for text classification. In Proceedings of the International Conference on Machine Learning (pp. 359– 367). Morgan Kaufmann.

Mitchell, T. M., Cohen, W., Hruschka, E., Talukdar, P., Betteridge, J., Carlson, A., Mishra, B. D., Gardner, M., Kisiel, B., Krishnamurthy, J., Lao, N., Mazaitis, K., Mohamed, T., Nakashole, N., Platanios, E., Ritter, A., Samadi, M., Settles, B., … Welling, J. (2018). Never-ending learning. Communications of the ACM, 61(5), 103–115. https://doi.org/10.1145/3191513

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 10.
https://doi.org/10.1145/1459352.1459355

Niles, I., & Pease, A. (2001). Towards a standard upper ontology. Proceedings of the International conference on Formal Ontology in Information Systems, 2001, 2–9.
https://doi.org/10.1145/505168.505170

Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–15. https://doi.org/10.1162/089120103321337421

Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449. https://doi.org/10.1162/0891201042544884

Settles, B., & Craven, M. (2008). An analysis of active learning strategies for sequence labeling tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 1070–1079). Association for Computational Linguistics. https://doi.org/10.3115/1613715.1613855

Thompson, C. A., Califf, M. E., & Mooney, R. J. (1999). Active learning for natural language parsing and information extraction. In Proceedings of the 16th International Conference on Machine Learning (pp. 406–414). Morgan Kaufmann Publishers.

Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, (2), 45–66.

Varga, D., Halacsy, P., Kornai, A., Nagy, V., Nemeth, L., & Tron, V. (2007). Parallel corpora for medium density languages. Amsterdam Studies in the Theory and History of Linguistic Science, 4(292), 247. https://doi.org/10.1075/cilt.292.32var

Zelle, J. M., & Mooney, R. J. (1996). Learning to parse database queries using inductive logic programming. Proceedings of the National Conference on Artificial Intelligence, 2, 1050–1055.