This project focuses on developing an algorithm that automates the accurate labeling of product data. The aim is to address the challenge faced by the hospitality industry, where the data collected from venues like restaurants and pubs, such as menus and customer orders, lacks proper labeling and annotations. By leveraging the power of Machine Learning (ML), the project seeks to provide standardized annotations that can be used across the business for data analysis and insights. Two Naive Bases model were implemented to categorize products into different categories and sub-categories, enabling the creation of a recommender engine based on customers' historical purchase behavior. By accurately labeling products and linking them to customer orders, the models offers the potential for customer segmentation and the generation of insights about their preferences.
To ensure an unbiased classifier, the project seeks publicly available product names from a wide range of sources worldwide. The existing dataset is insufficient and unbalanced, prompting the use of web scraping techniques. The algorithm employs the BeautifulSoup package in Python to scrape product names for various categories from websites such as Wikipedia, BBC Good Food, and Taste Recipes. This approach enables the expansion of the dataset and provides a diverse range of labeled product names for training the model.
Before starting to build our classifier, we need to make sure that we have the right quality and amount of data to train our model. In this step, we are going to webscrape different type of Food and Drinks from several websites.
Next, we will use a heuristic approach to label part of our trainning data before feeding it to the model.
Once the data is ready and preprocessed, we will create a pipeline that vectorise our text data and then send it for training. We also propose a grid search approach to obtain sub-optimal hyperparameters. These models will then be evaluated and modified for higher accuracy.
Once the models have yielded to the desired accuracy, we will deploy the model to our Snowflake data warehouse so that every time new products are added to the database, they are directly updated with their corresponding predictions.
- Data Processing & Manipulation: Numpy, Pandas
- Static and Dynamic Webscrapping: BeautifulSoup, selenium
- ML model: sklearn
- Deploying to Snowflake: snowflake.connector
Both models yields to a >90% accuracy, reducing the cost of manually labelling text data (product names) while providing a model based on probabilities and historical data instead of heuristic approaches.