Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge develop in main #15

Merged
merged 18 commits into from
Nov 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/workflows/maven.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
run-name: Build the Zoomer by @${{ github.actor }}
on: push

jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest, macos-latest]
steps:
- uses: actions/checkout@v4
- name: Set up JDK 1.8
uses: actions/setup-java@v4
with:
java-version: '8'
distribution: 'zulu'
- name: Build with Maven
run: |
mvn clean install -DskipTests
133 changes: 73 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
[![Apache License, Version 2.0, January 2004](https://img.shields.io/github/license/apache/maven.svg?label=License)](https://www.apache.org/licenses/LICENSE-2.0)

![example workflow](https://github.com/dbs-leipzig/graph-stream-zoomer/actions/workflows/maven.yml/badge.svg)
[![Apache Flink](https://img.shields.io/badge/Apache%20Flink-E6526F?style=for-the-badge&logo=Apache%20Flink&logoColor=white)](https://flink.apache.org)
# Graph Stream Zoomer
![GraphStreamZoomer_Logo](logo/GraphStreamZoomerLogo_small.png)

Expand All @@ -18,64 +19,6 @@ aggregate functions and a window size. Through the closed operator concept, the
graph stream consisting of summarized vertices and edges. The system is based on Apache Flink® and its
Table API, SQL API and DataStream API, thus providing a distributed execution of the summarization.

### Windowing
*Graph Stream Zoomer* groups the graph using a windowing approach. The user can specify the size of the
window by a `WindowConfig`. Currently, there are just tumbling windows supported, but sliding windows are
planned for the near future.

Example window definition for a 10 seconds tumbling window:

`groupingBuilder.setWindowSize(10, WindowConfig.TimeUnit.SECONDS);`

### Grouping Keys
Vertices as well as edges will be grouped by common characteristics, which we call grouping keys. These
characteristics can be zero, one or multiple of the following:
* Label - groups all vertices/edges sharing the same type label
* Property value (by name) - groups all vertices/edges that contain a property with the specified name and
an equal value. All vertices/edges _without_ a property of this name are grouped as one group. The super
vertex or super edge (the resulting group representative) contains the property and (1) the respective
value or (2) null, for the group that do not have this property
* Time - groups all vertices/edges with a timestamp in the same window -> see _Windowing_ above
* User-defined grouping key - tbd.

### Aggregate functions
Vertices and edges are grouped according to the selected grouping keys. The content of all vertices/edges
that are grouped together can be used to calculate aggregates that will be part of the super vertex /
super edge (the resulting group representative).
* Count - just counts the number of elements that were merged to a group and stores them in a new property
with name `count`
* MinProperty - calculates the minimum value of a given property and stores it to a new property called
`min_{name}`. Just works with numerical property values.
* MaxProperty - calculates the maximum value of a given property and stores it to a new property called
`max_{name}`. Just works with numerical property values.
* AvgProperty - calculates the average value of a given property and stores it to a new property called
`avg_{name}`. Just works with numerical property values.
* User-defined aggregate function - tbd.

## Graph Stream Data Model
The graph stream data model of *Graph Stream Zoomer* is defined as follows.

`DataStream<StreamTriple>` -> the Flink representation of a graph stream

`StreamTriple`
* `StreamVertex`
* `StreamEdge`
* `StreamVertex`

`StreamVertex`
* `id (String)`
* `label (String)`
* `properties (Properties)`
* `event_time (Timestamp)`

`StreamEdge`
* `id (String)`
* `label (String)`
* `properties (Properties)`
* `source_id (String)`
* `target_id (String)`
* `event_time (Timestamp)`

## Usage

### As an own project
Expand Down Expand Up @@ -124,7 +67,77 @@ env.execute();
* LocalExample -> loads a tiny stream from a collection
* TwitterExample -> loads a live twitter message stream (credentials required)
* CitiBikeExample -> uses citibike rental data to create a graph stream
* SocketExample -> loads a graph stream from a socket connection
* RandomGeneratorExample -> loads a random generated graph stream with configurable frequency

## Execution on Apache Flink Cluster

The power of the zoomer relies on the distributed processing coming with Apache Flink. To execute the
zoomer on an Apache Flink cluster, you have to follow three easy steps:

1. set the Flink dependencies in the `pom.xml` to scope `provided` by changing the property from `<flink.scope>compile</flink.scope>` to `<flink.scope>provided</flink.scope>`
2. run `mvn clean package` to build the project and create the file `target/graph-stream-grouping-0.1-SNAPSHOT.jar`
3. on the running flink cluster, deploy your job (e.g. the `RandomGeneratorExample`) via
`bin/flink run -c edu.dbsleipzig.stream.grouping.application.RandomGeneratorExample target/graph-stream-grouping-0.1-SNAPSHOT.jar 10 1000` (10s tumbling window with 1000 elements/sec random input)

## Details

### Windowing
*Graph Stream Zoomer* groups the graph using a windowing approach. The user can specify the size of the
window by a `WindowConfig`. Currently, there are just tumbling windows supported, but sliding windows are
planned for the near future.

Example window definition for a 10 seconds tumbling window:

`groupingBuilder.setWindowSize(10, WindowConfig.TimeUnit.SECONDS);`

### Grouping Keys
Vertices as well as edges will be grouped by common characteristics, which we call grouping keys. These
characteristics can be zero, one or multiple of the following:
* Label - groups all vertices/edges sharing the same type label
* Property value (by name) - groups all vertices/edges that contain a property with the specified name and
an equal value. All vertices/edges _without_ a property of this name are grouped as one group. The super
vertex or super edge (the resulting group representative) contains the property and (1) the respective
value or (2) null, for the group that do not have this property
* Time - groups all vertices/edges with a timestamp in the same window -> see _Windowing_ above
* User-defined grouping key - tbd.

### Aggregate functions
Vertices and edges are grouped according to the selected grouping keys. The content of all vertices/edges
that are grouped together can be used to calculate aggregates that will be part of the super vertex /
super edge (the resulting group representative).
* Count - just counts the number of elements that were merged to a group and stores them in a new property
with name `count`
* MinProperty - calculates the minimum value of a given property and stores it to a new property called
`min_{name}`. Just works with numerical property values.
* MaxProperty - calculates the maximum value of a given property and stores it to a new property called
`max_{name}`. Just works with numerical property values.
* AvgProperty - calculates the average value of a given property and stores it to a new property called
`avg_{name}`. Just works with numerical property values.
* User-defined aggregate function - tbd.

## Graph Stream Data Model
The graph stream data model of *Graph Stream Zoomer* is defined as follows.

`DataStream<StreamTriple>` -> the Flink representation of a graph stream

`StreamTriple`
* `StreamVertex`
* `StreamEdge`
* `StreamVertex`

`StreamVertex`
* `id (String)`
* `label (String)`
* `properties (Properties)`
* `event_time (Timestamp)`

`StreamEdge`
* `id (String)`
* `label (String)`
* `properties (Properties)`
* `source_id (String)`
* `target_id (String)`
* `event_time (Timestamp)`

## Credits
This project has its base in two master thesis. It contains main ideas and code fragments from E. Saalmann
Expand Down
Binary file modified logo/GraphStreamZoomerLogo_small.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading