Albanian corpus dataset analysis using Apache Hadoop
Session
Computer Science and Communication Engineering
Description
Nowadays, we are dealing with a very large number of data that are generated from different fields such as medical, economic, socials, etc. Data analysis is one of the most important branches today. There are many companies that offers their services to store this voluminous data such as: Prolifics, Clairvoyant, IBM, HP Enterprise, Teradata, Oracle, SAP, EMC, Amazon, Microsoft, Google, VMware, Splunk, Alteryx[1], The growth of these data is continuing exponentially and as such has made it impossible to handle them by a traditional database system since it exceeds its capacity. When we are talking about a large number of data also knows as Big Data, we are dealing with an increase from Gigabytes, TeraBytes, Peta Bytes, Zeta Bytes, and so on. Processing of data can incorporate multiple operations depending on usage like collecting, classifying, indexing, exploring, gather results, etc. The main problem has to do with the fact that no machine alone or a few machines can process such a large amount of data for a finite period of time. This paper presents an experimental work on big data problems using the Apaches Hadoop approach as a solution. The objective is to work with Hadoop with a glance focus on the MapReduce algorithm and analysis of a data set(Albanian Text Corpus) that will be created particularly for this case. The results gathered from this paper and several analyses show positive outcomes of the above approach to address such big data problems.
Keywords:
Big Data, Hadoop Technology, Hadoop Distributed File System (HDFS), MapReduce, WordCount.
Session Chair
Bertan Karahoda
Session Co-Chair
Besnik Qehaja
Proceedings Editor
Edmond Hajrizi
ISBN
978-9951-437-96-7
Location
Lipjan, Kosovo
Start Date
31-10-2020 10:45 AM
End Date
31-10-2020 12:30 PM
DOI
10.33107/ubt-ic.2020.526
Recommended Citation
Avdimetaj, Fëllanza, "Albanian corpus dataset analysis using Apache Hadoop" (2020). UBT International Conference. 331.
https://knowledgecenter.ubt-uni.net/conference/2020/all_events/331
Albanian corpus dataset analysis using Apache Hadoop
Lipjan, Kosovo
Nowadays, we are dealing with a very large number of data that are generated from different fields such as medical, economic, socials, etc. Data analysis is one of the most important branches today. There are many companies that offers their services to store this voluminous data such as: Prolifics, Clairvoyant, IBM, HP Enterprise, Teradata, Oracle, SAP, EMC, Amazon, Microsoft, Google, VMware, Splunk, Alteryx[1], The growth of these data is continuing exponentially and as such has made it impossible to handle them by a traditional database system since it exceeds its capacity. When we are talking about a large number of data also knows as Big Data, we are dealing with an increase from Gigabytes, TeraBytes, Peta Bytes, Zeta Bytes, and so on. Processing of data can incorporate multiple operations depending on usage like collecting, classifying, indexing, exploring, gather results, etc. The main problem has to do with the fact that no machine alone or a few machines can process such a large amount of data for a finite period of time. This paper presents an experimental work on big data problems using the Apaches Hadoop approach as a solution. The objective is to work with Hadoop with a glance focus on the MapReduce algorithm and analysis of a data set(Albanian Text Corpus) that will be created particularly for this case. The results gathered from this paper and several analyses show positive outcomes of the above approach to address such big data problems.