Buduri_icbme2025-German Reddit Sentiment Analysis Using BERT and ML Techniques

Session

Computer Science and Communication Engineering

Description

This study presents a comprehensive workflow for collecting, preprocessing, and analyzing German-language comments from Reddit for sentiment analysis. Initially, a Pythonbased scraper using the Reddit API was developed to extract topic-specific comments, which were then cleaned by removing special characters, links, and irrelevant tokens. The dataset underwent tokenization, stopword removal, and normalization using stemming and lemmatization to produce a structured corpus suitable for machine learning tasks.Sentiment classification was performed using the German BERT model (oliverguhr/german-sentimentbert), categorizing comments as positive, negative, or neutral. To further evaluate performance, vectorization techniques such as Bag-of-Words and TF-IDF were applied, followed by machine learning classifiers including Logistic Regression, Random Forest, and Naive Bayes. Performance metrics were assessed using confusion matrices, classification reports, and error analysis.Additionally, visualizations were created to highlight the most influential words contributing to positive and negative sentiment classification, as well as graphical representations of prediction errors. This integrated approach demonstrates how structured preprocessing combined with advanced modeling can enhance the accuracy of sentiment analysis on social media data. The methodology provides a solid foundation for monitoring public opinion and extracting insights from user-generated content in German.

Keywords:

Sentiment Analysis, German BERT, Reddit

Proceedings Editor

Edmond Hajrizi

ISBN

978-9951-982-41-2

Location

UBT Kampus, Lipjan

Start Date

25-10-2025 9:00 AM

End Date

26-10-2025 6:00 PM

DOI

10.33107/ubt-ic.2025.73

This document is currently not available here.

Share

COinS
 
Oct 25th, 9:00 AM Oct 26th, 6:00 PM

Buduri_icbme2025-German Reddit Sentiment Analysis Using BERT and ML Techniques

UBT Kampus, Lipjan

This study presents a comprehensive workflow for collecting, preprocessing, and analyzing German-language comments from Reddit for sentiment analysis. Initially, a Pythonbased scraper using the Reddit API was developed to extract topic-specific comments, which were then cleaned by removing special characters, links, and irrelevant tokens. The dataset underwent tokenization, stopword removal, and normalization using stemming and lemmatization to produce a structured corpus suitable for machine learning tasks.Sentiment classification was performed using the German BERT model (oliverguhr/german-sentimentbert), categorizing comments as positive, negative, or neutral. To further evaluate performance, vectorization techniques such as Bag-of-Words and TF-IDF were applied, followed by machine learning classifiers including Logistic Regression, Random Forest, and Naive Bayes. Performance metrics were assessed using confusion matrices, classification reports, and error analysis.Additionally, visualizations were created to highlight the most influential words contributing to positive and negative sentiment classification, as well as graphical representations of prediction errors. This integrated approach demonstrates how structured preprocessing combined with advanced modeling can enhance the accuracy of sentiment analysis on social media data. The methodology provides a solid foundation for monitoring public opinion and extracting insights from user-generated content in German.