SQL Queries Optimization with Machine Learning

Presented by :

Anas ZENAGUI

Sohaib MANAH Ismail OUKHA

Supervised by :

Mohamed BEN AHMED

Software Engineering and Smart Intelligent (LSI) Faculty of Sciences and Technologies of Tangier (FSTT) 19 février 2024

We would like to express our sincere gratitude to all individuals who have contributed directly or indirectly to the completion of this project on Oracle SQL query optimization using machine learning.

First and foremost, we extend our heartfelt thanks to our team members, Anas Zena- gui, Ismail Oukha, and Sohaib Manah. Their dedication, collaboration, and commitment have been crucial throughout this project. Each has brought unique expertise, contributing significantly to the success of this ambitious undertaking.

We also express our appreciation to our supervising professor, Mohamed Ben Ahmed, for his enlightened guidance, valuable advice, and unwavering support. His profound knowledge has been a source of inspiration and has greatly enriched our academic experience.

Our thanks also go to all individuals involved in the process, including members of the teaching team, collaborators, or anyone who played a role in the realization of this project.

Finally, our families and friends deserve special acknowledgment for their unconditional support and understanding throughout this project.

Thank you all for your valuable contributions.

Acknowledgments	ii
table of figures	v
Table of content	vi
Introduction	1
1 Literature Review	2
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	2
1.2 Traditional Approaches to SQL Query Optimization . . . . . . . . . . . . . .	2
1.3 Challenges in Traditional Approaches . . . . . . . . . . . . . . . . . . . . . .	2
1.4 Introduction of Machine Learning in Query Optimization . . . . . . . . . . .	3
1.5 Natural Language Processing (NLP) in Query Optimization . . . . . . . . .	3
1.6 Existing Language Models for Query Optimization . . . . . . . . . . . . . .	3
1.7 Integration of Machine Learning and Query Optimization . . . . . . . . . . .	3
1.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	3
2 Methodology	4
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	4
2.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	4
2.2.1 Query Dataset Preprocessing . . . . . . . . . . . . . . . . . . . . . .	4
2.3 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
2.3.1 Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
2.3.2 Encoder-Decoder Architecture . . . . . . . . . . . . . . . . . . . . . .	5
2.4 Training Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
2.4.1 Hyperparameter Tuning . . . . . . . . . . . . . . . . . . . . . . . . .	5
2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	5
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	6
3 Implementation	7
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	7
3.2 Setup and Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .	7
3.3 Fine-Tuning the LLaMA Model . . . . . . . . . . . . . . . . . . . . . . . . .	9

TABLE DES MATIÈRES iv

3.4 Training Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.5 Results and Model Deployment . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.6 Usage Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

Optimization Strategies - Dataset Examples 10
1. Example 1 : Employees Query 10
2. Example 2 : Orders Query 10

Results and Discussion 12

LSI2, 2024

3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

User Authentication and Database Connection 12
User Authentication and Database Connection 13
Inserting and Validating SQL Query Syntax 13
Executing the SQL Query 14
Obtaining Execution Plan 14
Optimizing SQL Query with LLaMA Model 15

The ever-growing volume of data in today’s digital landscape has led to an increasing de- mand for efficient and optimized database queries. Recognizing the significance of enhancing the performance of SQL queries, this project delves into the realm of query optimization using Machine Learning.

The objective of this endeavor is to employ advanced techniques in Natural Language Processing (NLP) and machine learning to fine-tune SQL queries. Leveraging the power of a Language Model-based Machine Learning Approach (LLaMA), we aim to optimize SQL queries for improved execution and response times.

Throughout the project, SQL queries serve as the focal point, representing real-world scenarios where optimization can significantly impact overall system performance. The uti- lization of NextJS for the frontend interface and Flask for the backend ensures a seamless integration of the machine learning model into practical applications.

In the realm of machine learning, we leverage a variety of tools such as Jupyter Notebook for exploratory data analysis and model development. Docker containers facilitate a scalable and reproducible environment, and Airflow ensures streamlined workflow orchestration. The incorporation of Data Version Control (DVC) enhances collaboration and facilitates efficient versioning of datasets and models.

This report provides a comprehensive overview of the methodology employed, tools uti- lized, and the fine-tuning process applied to optimize SQL queries. It aims to showcase the synergy between machine learning, web development, and database optimization to address the challenges posed by complex and resource-intensive queries.

Keywords : SQL queries, Machine Learning, NLP, LLaMA, NextJS, Flask, Jupyter No- tebook, Docker, Airflow, DVC.

chapitre 1

Literature Review

Introduction

The optimization of SQL queries has been a focal point in the realm of database ma- nagement systems, aiming to enhance the efficiency and performance of query execution. This chapter delves into an extensive review of the existing literature on SQL query optimi- zation, shedding light on traditional methodologies and recent advancements, particularly the integration of Machine Learning (ML) techniques. The objective is to provide a robust foundation for our proposed Language Model-based Machine Learning Approach (LLaMA) to query optimization.

Traditional Approaches to SQL Query Optimization

The historical landscape of query optimization predominantly features rule-based and heuristic approaches. In relational database management systems (RDBMS), these methods leverage predefined rules, indices, and query plans to streamline the execution of SQL queries. Cost-based optimization, a cornerstone of traditional approaches, involves estimating the cost of various execution plans and selecting the one deemed most efficient. Algorithms like dynamic programming and greedy algorithms have been pivotal in navigating the expansive search space for optimal query plans.

Challenges in Traditional Approaches

While traditional approaches have proven effective in many scenarios, they grapple with challenges in handling dynamic and complex queries. Adapting to the evolving nature of modern databases and managing the escalating volume of data poses significant hurdles for rule-based systems. Moreover, reliance on handcrafted rules limits adaptability to diverse query patterns and evolving database schemas.

CHAPITRE 1. LITERATURE REVIEW 3

Introduction of Machine Learning in Query Optimi- zation

The integration of Machine Learning marks a paradigm shift in addressing the limitations of traditional query optimization techniques. Researchers have explored ML algorithms to automate the optimization process. Notably, ML models predict the most efficient query plan based on historical data, query statistics, and database schema information. Reinforcement learning and decision tree-based models have emerged as promising candidates for learning optimal execution plans.

Natural Language Processing (NLP) in Query Opti- mization

Recent advancements in Natural Language Processing (NLP) have further enriched the landscape of query optimization. Integrating NLP techniques introduces a more natural interaction with databases, allowing users to input queries in a conversational manner. Em- bedding language understanding capabilities enhances the system’s ability to interpret user intent, contributing to the optimization process.

Existing Language Models for Query Optimization

The exploration of language models within the context of SQL query optimization has witnessed substantial progress. Techniques like transfer learning from pre-trained language models, including BERT and GPT, demonstrate an understanding and generation of SQL queries. These models capture query semantics, enabling them to suggest optimized versions based on context and user requirements.

Integration of Machine Learning and Query Optimi- zation

The integration of ML techniques into query optimization signifies a substantial leap forward. By combining statistical analysis, historical query performance data, and semantic understanding, ML models can adapt to diverse query workloads. However, challenges per- sist, such as ensuring interpretability of ML-generated query plans and maintaining robust performance across various database scenarios.

Conclusion

This exhaustive literature review provides a comprehensive overview of the historical evo- lution of SQL query optimization and the recent infusion of Machine Learning techniques. The exploration of traditional approaches, challenges, and the emergence of ML-based mo- dels sets the stage for the subsequent development and evaluation of our proposed LLaMA approach in the upcoming chapters.

LSI2, 2024

chapitre 2

Methodology

Introduction

Chapter 3 details the methodology employed in the development and evaluation of the Language Model-based Machine Learning Approach (LLaMA) for SQL query optimization. This chapter provides a comprehensive overview of the data collection process, model archi- tecture, training strategy, and evaluation metrics.

Data Collection

The foundation of our LLaMA approach lies in a diverse and representative dataset of SQL queries. We collected queries from various sources, including open-source databases, real-world applications, and synthetic datasets. This multi-source approach ensures that the model generalizes well across different query patterns and complexities. The dataset includes both raw SQL queries and their corresponding optimized versions, forming the basis for supervised learning.

Query Dataset Preprocessing

Before feeding the data into the model, we preprocess the query dataset. The preproces- sing steps involve tokenization, where each query is converted into a sequence of tokens. The tokenized queries are then padded to a consistent length to facilitate uniform input for the model. Additionally, we use a tokenizer to convert tokens into numerical indices, allowing the model to process the data efficiently.

4

Model Architecture

The architecture of our LLaMA model is designed as a sequence-to-sequence (Seq2Seq) model, a popular choice for natural language processing tasks. The model consists of an encoder and a decoder, each equipped with Long Short-Term Memory (LSTM) units. The encoder processes the input query, while the decoder generates the optimized query. Embed- ding layers facilitate the transformation of tokens into dense vectors, capturing the semantic relationships between words.

Embedding Layer

The embedding layer is a crucial component of the model, responsible for transforming discrete tokens into continuous vectors. We experimented with different embedding dimen- sions to find the optimal representation for our dataset. The embedding layer aids in preser- ving the contextual information and semantic relationships within the queries.

Encoder-Decoder Architecture

The encoder processes the tokenized input query and extracts meaningful features, en- coding them into a fixed-size context vector. This context vector serves as the initial state for the decoder. The decoder, in turn, utilizes this context vector to generate the optimized query one token at a time. Attention mechanisms are incorporated to allow the model to focus on specific parts of the input during the decoding process.

Training Strategy

Training the LLaMA model involves optimizing its parameters to minimize the difference between predicted and actual optimized queries. We employ a supervised learning approach, where the model learns from the paired input-output query examples. The loss function used is the sparse categorical cross-entropy, suitable for sequence generation tasks.

Hyperparameter Tuning

Fine-tuning the model involves adjusting hyperparameters to enhance performance. We explore variations in learning rates, batch sizes, and embedding dimensions to find the op- timal configuration. The model is trained iteratively, and its performance is monitored on a validation set to prevent overfitting.

Evaluation Metrics

To assess the effectiveness of our LLaMA approach, we employ multiple evaluation me- trics. These include query accuracy, BLEU score, and execution time. Query accuracy mea- sures how closely the generated optimized queries match the ground truth. BLEU score assesses the similarity between the predicted and actual queries, considering n-gram overlap. Execution time evaluates the efficiency of the model in producing optimized queries.

Conclusion

This chapter outlines the comprehensive methodology adopted for the development and evaluation of our LLaMA approach. From data collection and preprocessing to model ar- chitecture, training strategy, and evaluation metrics, each aspect is meticulously detailed. The subsequent chapter will delve into the experimental results and discussions, providing insights into the performance and applicability of our proposed approach.

chapitre 3

Implementation

Introduction

Chapter 4 outlines the practical implementation of the Language Model-based Machine Learning Approach (LLaMA) for SQL query optimization. The implementation involves utilizing the fine-tuned LLaMA model to provide optimized versions of SQL queries. This chapter covers the codebase, tools, and techniques employed to integrate LLaMA into a practical solution.

Setup and Environment

The implementation utilizes several Python libraries and tools, including the Hugging Face Transformers library for natural language processing tasks and the Datasets library for managing and processing datasets. The code is developed and executed in a Jupyter Notebook environment. The required dependencies are installed using the pip package ma- nager. Additionally, specialized tools such as Accelerate, PEFT, BitsAndBytes, and TRL are incorporated for efficient execution.

“ ‘latex

7

Figure 3.1 – Model Architecture

Listing 3.1 – Setup and Environment

Fine-Tuning the LLaMA Model

The fine-tuning process involves loading the base LLaMA model, specifying quantization configurations using the BitsAndBytes library, and defining training parameters. The model is then fine-tuned on a custom dataset of SQL queries, ensuring that it learns to generate optimized versions.

Listing 3.2 – LLaMA Model Usage Example

Training Parameters

Various training parameters are configured to optimize the learning process. These include the number of training epochs, batch size, learning rate, and optimization algorithm. The training itself is executed using the SFTTrainer class from the TRL library. Essential metrics, such as loss and accuracy, are continuously monitored during the training phase.

Results and Model Deployment

Upon successful completion of the training, the trained LLaMA model is saved for fu- ture use. The associated tokenizer is also preserved for consistency. The implementation incorporates the TensorBoard utility to visualize training metrics and monitor the model’s performance. Furthermore, the chapter includes code snippets that demonstrate how to in- teract with the model using a prompt and SQL query.

Listing 3.3 – Model Tokenizer

Usage Example

The final section of this chapter demonstrates the usage of the implemented LLaMA model. A sample prompt and SQL query are provided, showcasing how the model generates optimized queries. The resulting text is formatted as code, maintaining the structure of the SQL query and the optimized version.

Listing 3.4 – LLaMA Model Usage Example

Optimization Strategies - Dataset Examples

To illustrate the impact of the LLaMA model, we present examples from the dataset along with optimization strategies and the resulting optimized queries.

Example 1 : Employees Query

— Original Query :

Optimization Strategies :
1. Use an index on the department column.
2. Use a more efficient comparison operator.
3. Utilize a subquery.
Optimized Query :

Example 2 : Orders Query

— Original Query :

Optimization Strategies :
1. Use an index on the order_status and customer_id columns.
2. Avoid using SELECT * if not all columns are needed.
3. Ensure proper indexing of order_status and customer_id columns in the orders table.
Optimized Query :

This dataset showcases the effectiveness of LLaMA in generating optimized queries based on specific optimization strategies.

This comprehensive chapter provides detailed insights into the practical implementation of the LLaMA model, covering various aspects such as setup, fine-tuning, training parame- ters, results, model deployment, and practical usage examples. The subsequent chapters will delve deeper into experimental results, analysis, and potential extensions of the implemented LLaMA model.

chapitre 4

Results

and

Discussion

This chapter presents the results obtained from the bettersql project, showcasing the user interactions and the overall performance of the application.

Figure 4.1 – User Authentication and Database Connection

User Authentication and Database Connection

The initial step involves user authentication and establishing a connection to the data- base. Users input their database credentials through a secure authentication process, ensuring a seamless connection.

12

Figure 4.2 – User Authentication and Database Connection

Inserting and Validating SQL Query Syntax

After successfully authenticating, users can insert SQL queries into the system. The application provides a syntax validation feature to ensure that the entered queries are correct and conform to the SQL syntax.

Figure 4.3 – Inserting and Validating SQL Query Syntax

Executing the SQL Query

Upon validation, users can choose to execute the SQL query against the connected data- base. The application sends the query to the database, and the corresponding operation is executed.

Figure 4.4 – Executing the SQL Query

Obtaining Execution Plan

For transparency and analysis purposes, users can retrieve the execution plan of the executed query. This step provides insights into how the database engine processes the query.

Figure 4.5 – Obtaining Execution Plan

Optimizing SQL Query with LLaMA Model

The final step involves optimizing a query using the Language Model-based Machine Learning Approach (LLaMA). Users can choose to optimize their queries, and the application generates an optimized version using the trained LLaMA model.

Figure 4.6 – Optimizing SQL Query with LLaMA Model

Discussion and Analysis

The discussion section analyzes the effectiveness of the LLaMA model in optimizing SQL queries. It considers factors such as query complexity, execution times, and user experience. The screenshots provide a visual representation of the user-friendly interface and the seamless integration of machine learning into the SQL optimization process.

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.dvc		.dvc
Backend		Backend
frontend		frontend
media		media
.dvcignore		.dvcignore
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

itsmeismaill/Query-Optimization

Folders and files

Latest commit

History

Repository files navigation

SQL Queries Optimization with Machine Learning

Supervised by :

Literature Review

Introduction

Traditional Approaches to SQL Query Optimization

Challenges in Traditional Approaches

Introduction of Machine Learning in Query Optimi- zation

Natural Language Processing (NLP) in Query Opti- mization

Existing Language Models for Query Optimization

Integration of Machine Learning and Query Optimi- zation

Conclusion

Methodology

Introduction

Data Collection

Query Dataset Preprocessing

Model Architecture

Embedding Layer

Encoder-Decoder Architecture

Training Strategy

Hyperparameter Tuning

Evaluation Metrics

Conclusion

Implementation

Introduction

Setup and Environment

Fine-Tuning the LLaMA Model

Training Parameters

Results and Model Deployment

Usage Example

Optimization Strategies - Dataset Examples

Example 1 : Employees Query

Example 2 : Orders Query

Results

Discussion

User Authentication and Database Connection

Inserting and Validating SQL Query Syntax

Executing the SQL Query

Obtaining Execution Plan

Optimizing SQL Query with LLaMA Model

Discussion and Analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages