Machine Learning Sentiment Analysis for Alpine Tour Reports

This repository contains Jupyter Notebooks and resources related to a machine learning project focused on sentiment analysis of alpine tour reports. The primary goal of this project is to classify critical sentences regarding route exposure (exposition) into three categories:

Neutral: Irrelevant for the assessment of exposure.
Positive: Light or well-secured exposure.
Negative: Strong exposure with minimal or no security.

By leveraging modern NLP techniques and pre-trained models, this project aims to enhance the safety and decision-making process for alpine enthusiasts by analyzing subjective route descriptions.

Project Structure

1. Notebooks

01_DataEngineering.ipynb
- Demonstrates the process of data collection and preparation using web scraping and manual labeling.
- Includes exploratory data analysis (EDA) and preparation of a labeled dataset for sentiment analysis.
- Establishes a baseline accuracy of 46% using minimal data and no advanced techniques.
02_ML_Evaluation.ipynb
- Illustrates the consequences of skipping essential ML practices such as cross-validation, hyperparameter tuning, and adequate data preparation.
- Serves as an example of what not to do in a machine learning workflow.
- Focused on showing how insufficient data and missing concepts degrade performance.
03_ML_SentimentAnalyse_Bergsport.ipynb
- Finalized implementation incorporating best practices:
  - Cross-validation
  - Hyperparameter optimization using Optuna
  - Use of multiple pre-trained transformer models for comparative analysis
- Achieves a model accuracy of 97.8% with a robust evaluation pipeline.

Problem Definition

In alpine sports, selecting the right route is critical for safety and success. Current classification systems like the SAC Trekking Scale (T1-T6) and High Alpine Scale (L-AS) often overgeneralize route exposure. This project aims to address the gap by leveraging machine learning to classify sentences describing route exposure.

Objective:

Achieve a classification accuracy >85% and an F1-score >80% on evaluation data.
Create a foundation for an end-to-end application that rates the exposure level of alpine routes.

Dataset

Data Sources

Tour reports were collected via web scraping and are only for education / research purpose!
Reports include subjective route descriptions varying in length, detail, and structure.

Data Pipeline

Web Scraping:
- Extracted route descriptions, titles, and classifications by region using Python.
Data Transformation:
- Processed text using spaCy for tokenization and lemmatization.
- Extracted sentences relevant to exposure using a custom keyword list.
Labeling:
- Sentences were manually labeled into three categories: neutral, positive, or negative.
- Final dataset contains around 1,000 labeled sentences.

Machine Learning Workflow

1. Model Selection

A variety of transformer models were tested, with a focus on those optimized for German NLP tasks:

deepset/gbert-base (selected as the best model)
aari1995/German_Sentiment
oliverguhr/german-sentiment-bert
xlm-roberta-base
Other multilingual or distilled models

2. Baseline Evaluation

Initial tests in 02_ML_Evaluation.ipynb were conducted without advanced techniques, achieving a low accuracy of ~46%.
Highlights the impact of insufficient data and missing practices like cross-validation.

3. Advanced Techniques

Cross-Validation: Implemented K-Fold (5 folds) for robust evaluation.
Hyperparameter Tuning: Used Optuna to optimize learning rate, batch size, and other parameters.
Data Augmentation: Adjusted keyword lists to balance label distributions, focusing on increasing positive and negative samples.

4. Final Results

Model: deepset/gbert-base
Accuracy: 97.8%
F1-Score: 0.892
Achieved through rigorous optimization and sufficient data preparation.

Key Insights

Lessons Learned

Data Quality Matters: Proper labeling and balanced datasets are essential for effective ML models.
Advanced Practices Pay Off: Techniques like cross-validation and hyperparameter tuning significantly improve performance.
Transformer Models Shine: Models like gbert-base excel in understanding German text nuances, thanks to features like Whole Word Masking (WWM).

Challenges

Manual labeling was time-consuming but crucial for training effective models.
Resource-intensive hyperparameter tuning required efficient use of Google Colab and cloud GPUs.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.DS_Store		.DS_Store
01_DataEngineering.ipynb		01_DataEngineering.ipynb
02_ML_Evaluation.ipynb		02_ML_Evaluation.ipynb
03_ML_SentimentAnalyse_Bergsport.ipynb		03_ML_SentimentAnalyse_Bergsport.ipynb
Hikr_Kanton_SZ.html		Hikr_Kanton_SZ.html
LICENSE		LICENSE
README.md		README.md
extracted_data_vs.csv		extracted_data_vs.csv
requirements.txt		requirements.txt
training_data.csv		training_data.csv
training_data_final.csv		training_data_final.csv
training_data_v2.csv		training_data_v2.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Sentiment Analysis for Alpine Tour Reports

Project Structure

1. Notebooks

Problem Definition

Dataset

Data Sources

Data Pipeline

Machine Learning Workflow

1. Model Selection

2. Baseline Evaluation

3. Advanced Techniques

4. Final Results

Key Insights

Lessons Learned

Challenges

Usage

Prerequisites

Running the Notebooks

About

Releases

Packages

Languages

License

TobiasMaissen/DataScience_Sentiment_ML

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Sentiment Analysis for Alpine Tour Reports

Project Structure

1. Notebooks

Problem Definition

Dataset

Data Sources

Data Pipeline

Machine Learning Workflow

1. Model Selection

2. Baseline Evaluation

3. Advanced Techniques

4. Final Results

Key Insights

Lessons Learned

Challenges

Usage

Prerequisites

Running the Notebooks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages