Skip to content

MichaelYang-lyx/Flink-In-One

Repository files navigation

MSBD 5014 IP Project Repository

All Contributors

This repository contains the source code for the MSBD 5014 IP project, as well as various materials I have collected during this semester when learning Flink.

Table of Contents

Data Generation

The dataset is generated using the DBGEN tool, and you can find that in my root directory. You can refer to the detailed documentation here.

You can also find a useful guide on GitHub: TPCH DBGEN Guide.

Important Parameters:

  • -s <scale>: Scale of the database population. Scale 1.0 represents ~1 GB of data.
  • -T <table>: Generate data for a particular table ONLY. Arguments:
    • p -- part/partsupp
    • c -- customer
    • s -- supplier
    • o -- orders/lineitem
    • n -- nation
    • r -- region
    • l -- code (same as n and r)
    • O -- orders
    • L -- lineitem
    • P -- part
    • S -- partsupp

Note for macOS Users:

If you encounter errors such as:

bm_utils.c:71:10: fatal error: 'malloc.h' file not found
varsub.c:44:10: fatal error: 'malloc.h' file not found

Solution: Open bm_utils.c and varsub.c, locate and change the import statement #include <malloc.h> to #include <sys/malloc.h>.

Environment Setup

To run this project, you need to have Maven and MySQL installed on your system.

Project Structure

  • tpc_query: Contains the source code for the TPC-H queries.
  • learning: Contains various materials and resources collected during the semester.

Usage

Running the Project

  1. Navigate to the tpc_query directory.
  2. Configure the environment using pom.xml.
  3. Run Main.java to simulate TPC-H Query 7.

Query Execution

You can choose between querying directly in MySQL or testing the AJU algorithm implemented using Flink.

String choice = "Memory";
if (choice.equals("MySQL")) {
    MySQLSink mySQLSink = new MySQLSink();
    dataSource.addSink(mySQLSink);
} else {
    MemorySink memorySink = new MemorySink();
    dataSource.addSink(memorySink);
}

Currently, Q7 and Q5 are supported. To try other queries, you can add them in the /tpc_query/src/main/java/tpc_query/Query directory following the existing format.

Algorithm

This section provides an overview of the four main algorithms implemented in this project. Each algorithm has an associated image to illustrate the process.

Insert Algorithm

This algorithm handles the insertion of new tuples into the database while maintaining the acyclic foreign-key join structure.

Insert-Update Algorithm

This algorithm deals with the insertion and updating of tuples simultaneously. It ensures that the database remains consistent and acyclic.

Delete Algorithm

This algorithm manages the deletion of tuples from the database, ensuring that the foreign-key constraints and the acyclic nature of the joins are preserved.

Delete-Update Algorithm

This algorithm handles the deletion and updating of tuples in tandem, maintaining the integrity and acyclic structure of the foreign-key joins.

References

  1. Maintaining Acyclic Foreign-Key Joins under Updates
  2. Cquirrel: Continuous Query Processing over Acyclic Relational Schemas
  3. Flink: Apache Flink is an open-source stream processing framework for distributed, high-performing, always-available, and accurate data streaming applications.

Feel free to explore the repository and use the provided resources. If you have any questions or need further assistance, please contact me.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Michael
Michael

💻
Add your contributions

This project follows the all-contributors specification. Contributions of any kind welcome!

About

The learning process of flink

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published