Chocolate & Machine Learning

Above is a Computer Science final project presentation about data analysis of a Kaggle dataset about various chocolate bars.

The dataset had many dimensions, and sub-categories within them that made preprocessing of the data absolutely crucial. It had been approached by others before, but each approach had flaws with either the implementation, or the results.

After preprocessing the data, we trained a variety of machine learning (ML) algorithms with the data.

Bitter chocolate is so unjustly hated!

Development Tasks

  • Training Multiple Machine Learning Algorithms
  • Adjusting Criteria for Better Accuracy
  • Data Pre-Processing
  • Python Programming

Previous Approaches

An image from the slideshow, showing a summary of previous approaches to the data.

While some approaches had already been made, most were either unhelpful or inaccurate due to a lack of preprocessing on the data.

The most important category, memorable characteristics, had sub-categories that went completely ignored by most people working with the dataset.

Our Approach: Preprocessing

An image from the slideshow, showing the removal of some characteristics.

Within the memorable characteristics column, there were over 1,000 different unique strings that were used to describe chocolate bars.

By looking at the elbow graph, we were able to determine a point where removing any characteristics below that amount barely effected the data, while reducing the amount of dimensions by more than five times.

The results of our preprocessing, showing the most frequently occurring characteristics.

Our Approach: ML Algorithms

An image from the slideshow, showing a heatmap for the Decision Tree approach.

Our initial ML algorithms were not very accurate. It was clear that they were each aiming for the largest of the buckets.

Since the vast majority of the ratings were between 2.5 and 3.5, the algorithms simply guessed a rating between those most of the time.

Our Approach: Changes

An image from the slideshow, showing changes made after our initial approaches.

We used K-means clustering in order to better categorize the data. This put the data into groups based on their similarities.

Additionally, we reduced the amount of buckets, and weighted the buckets so that the ‘guessing’ strategy would prove less viable.

Our Approach: The Results

An image from the slideshow, showing a heatmap for the Logistic Regression approach.

The changes we made improved accuracy by about 14%.

Despite the data being difficult to work with due to the distribution of the ratings and the sheer amount of different dimensions from ‘memorable characteristics,’ we were able to greatly improve the accuracy of our ML algorithms.

Finally, after working with so many characteristics, we were able to find results that no other Kaggle user had.

An image depicting the impact each characteristic had on the rating of chocolate bars.

It turns out that a chocolate bar being ‘very bitter’ brings down the rating by about 1.8 on a 5-point scale. Additionally, fruity flavors seem to be the most popular by far, on average bringing up the rating by 0.4 points.