Buduri_icbme2025-German Reddit Sentiment Analysis Using BERT and ML Techniques
Session
Computer Science and Communication Engineering
Description
This study presents a comprehensive workflow for collecting, preprocessing, and analyzing German-language comments from Reddit for sentiment analysis. Initially, a Pythonbased scraper using the Reddit API was developed to extract topic-specific comments, which were then cleaned by removing special characters, links, and irrelevant tokens. The dataset underwent tokenization, stopword removal, and normalization using stemming and lemmatization to produce a structured corpus suitable for machine learning tasks.Sentiment classification was performed using the German BERT model (oliverguhr/german-sentimentbert), categorizing comments as positive, negative, or neutral. To further evaluate performance, vectorization techniques such as Bag-of-Words and TF-IDF were applied, followed by machine learning classifiers including Logistic Regression, Random Forest, and Naive Bayes. Performance metrics were assessed using confusion matrices, classification reports, and error analysis.Additionally, visualizations were created to highlight the most influential words contributing to positive and negative sentiment classification, as well as graphical representations of prediction errors. This integrated approach demonstrates how structured preprocessing combined with advanced modeling can enhance the accuracy of sentiment analysis on social media data. The methodology provides a solid foundation for monitoring public opinion and extracting insights from user-generated content in German.
Keywords:
Sentiment Analysis, German BERT, Reddit
Proceedings Editor
Edmond Hajrizi
ISBN
978-9951-982-41-2
Location
UBT Kampus, Lipjan
Start Date
25-10-2025 9:00 AM
End Date
26-10-2025 6:00 PM
DOI
10.33107/ubt-ic.2025.73
Recommended Citation
Buduri, Bledi, "Buduri_icbme2025-German Reddit Sentiment Analysis Using BERT and ML Techniques" (2025). UBT International Conference. 5.
https://knowledgecenter.ubt-uni.net/conference/2025UBTIC/CS/5
Buduri_icbme2025-German Reddit Sentiment Analysis Using BERT and ML Techniques
UBT Kampus, Lipjan
This study presents a comprehensive workflow for collecting, preprocessing, and analyzing German-language comments from Reddit for sentiment analysis. Initially, a Pythonbased scraper using the Reddit API was developed to extract topic-specific comments, which were then cleaned by removing special characters, links, and irrelevant tokens. The dataset underwent tokenization, stopword removal, and normalization using stemming and lemmatization to produce a structured corpus suitable for machine learning tasks.Sentiment classification was performed using the German BERT model (oliverguhr/german-sentimentbert), categorizing comments as positive, negative, or neutral. To further evaluate performance, vectorization techniques such as Bag-of-Words and TF-IDF were applied, followed by machine learning classifiers including Logistic Regression, Random Forest, and Naive Bayes. Performance metrics were assessed using confusion matrices, classification reports, and error analysis.Additionally, visualizations were created to highlight the most influential words contributing to positive and negative sentiment classification, as well as graphical representations of prediction errors. This integrated approach demonstrates how structured preprocessing combined with advanced modeling can enhance the accuracy of sentiment analysis on social media data. The methodology provides a solid foundation for monitoring public opinion and extracting insights from user-generated content in German.
