The Analysis and Implementation of the Algorithm for Gheg Albanian to English Translation

Session

Computer Science and Communication Engineering

Description

Machine translation has demonstrated great success and efficiency in multilingual communication compared to traditional translation methods across high-resource language pairs. However, low-resource languages, such as the Gheg Albanian dialect, remain underexplored. Gheg Albanian plays an important role in Albanian identity and culture, especially in the northern Albanian-speaking regions, yet it lacks parallel data and translation systems with other languages. This paper tests the neural machine translation algorithm to perform one of the first experiments in this pipeline. A Gheg-English dataset containing 1.2k sentences was manually curated and standard pre-processing techniques were applied, including punctuation and whitespace normalization, outlier and duplicate removal, as well as clitics’ and case-sensitive orthographic normalization to distinguish names from simple nouns. Tokenization was performed utilizing the Marian Tokenizer based on Sentence-Piece sub-word tokenization. A pre-trained MarianMT model for Standard Albanian and English translation was evaluated in its base form and in six fine-tuned forms with varying numbers of epochs, learning rates, and data size. The models were evaluated with BLEU and chrF metrics. Initially, the base model scored poorly on the dataset and the finetuned models showed improvement when trained on the full dataset, however their performance was comparable in that they tended to struggle with more complex dialectal sentences. The performance also degraded with a smaller dataset (30%) due to overfitting, indicating the importance of having a significant amount of data available. While the current models are not feasible for a successful practical translating pipeline, the slight improvements with increased data are promising for future development in Gheg-English machine translation.

Keywords:

Neural machine translation, Gheg Albanian dialect, MarianMT, fine-tuning, low- resource language, transfer learning

Proceedings Editor

Edmond Hajrizi

ISBN

978-9951-982-41-2

Location

UBT Kampus, Lipjan

Start Date

25-10-2025 9:00 AM

End Date

26-10-2025 6:00 PM

DOI

10.33107/ubt-ic.2025.77

This document is currently not available here.

Share

COinS
 
Oct 25th, 9:00 AM Oct 26th, 6:00 PM

The Analysis and Implementation of the Algorithm for Gheg Albanian to English Translation

UBT Kampus, Lipjan

Machine translation has demonstrated great success and efficiency in multilingual communication compared to traditional translation methods across high-resource language pairs. However, low-resource languages, such as the Gheg Albanian dialect, remain underexplored. Gheg Albanian plays an important role in Albanian identity and culture, especially in the northern Albanian-speaking regions, yet it lacks parallel data and translation systems with other languages. This paper tests the neural machine translation algorithm to perform one of the first experiments in this pipeline. A Gheg-English dataset containing 1.2k sentences was manually curated and standard pre-processing techniques were applied, including punctuation and whitespace normalization, outlier and duplicate removal, as well as clitics’ and case-sensitive orthographic normalization to distinguish names from simple nouns. Tokenization was performed utilizing the Marian Tokenizer based on Sentence-Piece sub-word tokenization. A pre-trained MarianMT model for Standard Albanian and English translation was evaluated in its base form and in six fine-tuned forms with varying numbers of epochs, learning rates, and data size. The models were evaluated with BLEU and chrF metrics. Initially, the base model scored poorly on the dataset and the finetuned models showed improvement when trained on the full dataset, however their performance was comparable in that they tended to struggle with more complex dialectal sentences. The performance also degraded with a smaller dataset (30%) due to overfitting, indicating the importance of having a significant amount of data available. While the current models are not feasible for a successful practical translating pipeline, the slight improvements with increased data are promising for future development in Gheg-English machine translation.