Albanian corpus dataset analysis using Apache Hadoop

Session

Computer Science and Communication Engineering

Description

Nowadays, we are dealing with a very large number of data that are generated from different fields such as medical, economic, socials, etc. Data analysis is one of the most important branches today. There are many companies that offers their services to store this voluminous data such as: Prolifics, Clairvoyant, IBM, HP Enterprise, Teradata, Oracle, SAP, EMC, Amazon, Microsoft, Google, VMware, Splunk, Alteryx[1], The growth of these data is continuing exponentially and as such has made it impossible to handle them by a traditional database system since it exceeds its capacity. When we are talking about a large number of data also knows as Big Data, we are dealing with an increase from Gigabytes, TeraBytes, Peta Bytes, Zeta Bytes, and so on. Processing of data can incorporate multiple operations depending on usage like collecting, classifying, indexing, exploring, gather results, etc. The main problem has to do with the fact that no machine alone or a few machines can process such a large amount of data for a finite period of time. This paper presents an experimental work on big data problems using the Apaches Hadoop approach as a solution. The objective is to work with Hadoop with a glance focus on the MapReduce algorithm and analysis of a data set(Albanian Text Corpus) that will be created particularly for this case. The results gathered from this paper and several analyses show positive outcomes of the above approach to address such big data problems.

Keywords:

Big Data, Hadoop Technology, Hadoop Distributed File System (HDFS), MapReduce, WordCount.

Session Chair

Bertan Karahoda

Session Co-Chair

Besnik Qehaja

Proceedings Editor

Edmond Hajrizi

ISBN

978-9951-437-96-7

Location

Lipjan, Kosovo

Start Date

31-10-2020 10:45 AM

End Date

31-10-2020 12:30 PM

DOI

10.33107/ubt-ic.2020.526

This document is currently not available here.

Share

COinS
 
Oct 31st, 10:45 AM Oct 31st, 12:30 PM

Albanian corpus dataset analysis using Apache Hadoop

Lipjan, Kosovo

Nowadays, we are dealing with a very large number of data that are generated from different fields such as medical, economic, socials, etc. Data analysis is one of the most important branches today. There are many companies that offers their services to store this voluminous data such as: Prolifics, Clairvoyant, IBM, HP Enterprise, Teradata, Oracle, SAP, EMC, Amazon, Microsoft, Google, VMware, Splunk, Alteryx[1], The growth of these data is continuing exponentially and as such has made it impossible to handle them by a traditional database system since it exceeds its capacity. When we are talking about a large number of data also knows as Big Data, we are dealing with an increase from Gigabytes, TeraBytes, Peta Bytes, Zeta Bytes, and so on. Processing of data can incorporate multiple operations depending on usage like collecting, classifying, indexing, exploring, gather results, etc. The main problem has to do with the fact that no machine alone or a few machines can process such a large amount of data for a finite period of time. This paper presents an experimental work on big data problems using the Apaches Hadoop approach as a solution. The objective is to work with Hadoop with a glance focus on the MapReduce algorithm and analysis of a data set(Albanian Text Corpus) that will be created particularly for this case. The results gathered from this paper and several analyses show positive outcomes of the above approach to address such big data problems.