The Analysis and Implementation of the Algorithm for Gheg Albanian to English Translation
Session
Computer Science and Communication Engineering
Description
Machine translation has demonstrated great success and efficiency in multilingual communication compared to traditional translation methods across high-resource language pairs. However, low-resource languages, such as the Gheg Albanian dialect, remain underexplored. Gheg Albanian plays an important role in Albanian identity and culture, especially in the northern Albanian-speaking regions, yet it lacks parallel data and translation systems with other languages. This paper tests the neural machine translation algorithm to perform one of the first experiments in this pipeline. A Gheg-English dataset containing 1.2k sentences was manually curated and standard pre-processing techniques were applied, including punctuation and whitespace normalization, outlier and duplicate removal, as well as clitics’ and case-sensitive orthographic normalization to distinguish names from simple nouns. Tokenization was performed utilizing the Marian Tokenizer based on Sentence-Piece sub-word tokenization. A pre-trained MarianMT model for Standard Albanian and English translation was evaluated in its base form and in six fine-tuned forms with varying numbers of epochs, learning rates, and data size. The models were evaluated with BLEU and chrF metrics. Initially, the base model scored poorly on the dataset and the finetuned models showed improvement when trained on the full dataset, however their performance was comparable in that they tended to struggle with more complex dialectal sentences. The performance also degraded with a smaller dataset (30%) due to overfitting, indicating the importance of having a significant amount of data available. While the current models are not feasible for a successful practical translating pipeline, the slight improvements with increased data are promising for future development in Gheg-English machine translation.
Keywords:
Neural machine translation, Gheg Albanian dialect, MarianMT, fine-tuning, low- resource language, transfer learning
Proceedings Editor
Edmond Hajrizi
ISBN
978-9951-982-41-2
Location
UBT Kampus, Lipjan
Start Date
25-10-2025 9:00 AM
End Date
26-10-2025 6:00 PM
DOI
10.33107/ubt-ic.2025.77
Recommended Citation
Hajrizi, Elita and Karahoda, Bertan, "The Analysis and Implementation of the Algorithm for Gheg Albanian to English Translation" (2025). UBT International Conference. 9.
https://knowledgecenter.ubt-uni.net/conference/2025UBTIC/CS/9
The Analysis and Implementation of the Algorithm for Gheg Albanian to English Translation
UBT Kampus, Lipjan
Machine translation has demonstrated great success and efficiency in multilingual communication compared to traditional translation methods across high-resource language pairs. However, low-resource languages, such as the Gheg Albanian dialect, remain underexplored. Gheg Albanian plays an important role in Albanian identity and culture, especially in the northern Albanian-speaking regions, yet it lacks parallel data and translation systems with other languages. This paper tests the neural machine translation algorithm to perform one of the first experiments in this pipeline. A Gheg-English dataset containing 1.2k sentences was manually curated and standard pre-processing techniques were applied, including punctuation and whitespace normalization, outlier and duplicate removal, as well as clitics’ and case-sensitive orthographic normalization to distinguish names from simple nouns. Tokenization was performed utilizing the Marian Tokenizer based on Sentence-Piece sub-word tokenization. A pre-trained MarianMT model for Standard Albanian and English translation was evaluated in its base form and in six fine-tuned forms with varying numbers of epochs, learning rates, and data size. The models were evaluated with BLEU and chrF metrics. Initially, the base model scored poorly on the dataset and the finetuned models showed improvement when trained on the full dataset, however their performance was comparable in that they tended to struggle with more complex dialectal sentences. The performance also degraded with a smaller dataset (30%) due to overfitting, indicating the importance of having a significant amount of data available. While the current models are not feasible for a successful practical translating pipeline, the slight improvements with increased data are promising for future development in Gheg-English machine translation.
