Image IP Protection

find and retrieve similar or identical images

This is a project I completed during the Insight Data Engineering program. Visi www.data-engineering.xyz to see to upload a picture and find out if there is similar picture in the database (the images source is from Image-net.org)

This project aims at creating a pipeline that users can do generate a database of existing pictures and run query on whether a new uploaded picture(or similar picture) can be found in the database.

A lot of companies has proprietary pictures and they might be used by other organization without permission (e.g companies' logo or ). This pipeline employs both Vantage Point approach (pixel based compaision) and Deep Ranking model (CNN model https://arxiv.org/abs/1404.4661) to make the search faster and more accurate.

The following two pictures show what it looks like when uploading a picture on the website www.data-eningeering.xyz and a similar picture is found.

Pipeline

Batch Job: 100 million pictures (100G) from Image-net.org are ingested from S3 bucket into Spark, which computes will (1)generate VP-trees(store in the pickle file format), hash table(stored in txt file format);
(2)Use trained model "model_w_weight.h5" to create vectors tables and saves the result into the PostgreSQL database.

Query: User upload an image, which computes will (1)generate VP-trees(store in the pickle file format), hash table(stored in txt file format);
(2)Use trained model "model_w_weight.h5" to create vectors tables and saves the result into the PostgreSQL database.

Data Sources

Image-net.org: needs to create an account and request access to the image database with a non-commercial email account.

Environment Setup

Install and configure AWS CLI and Pegasus on your local machine, and clone this repository using git clone https://github.com/lixali/Image_IP_protection.

AWS Tip: Add your local IP to your AWS VPC inbound rules

Pegasus Tip: In $PEGASUS_HOME/install/download_tech, change Zookeeper version to 3.4.12, and follow the notes in docs/pegasus_setup.odt to configure Pegasus

CLUSTER STRUCTURE:

To reproduce my environment, 11 m4.large AWS EC2 instances are needed:

(4 nodes) Spark Cluster - Batch
Postgres Node
Flask Node

To create the clusters, put the appropriate master.yml and workers.yml files in each cluster_setup/<clustername> folder (following the template in cluster_setup/dummy.yml.template), list all the necesary software in cluster_setup/<clustername>/install.sh, and run the cluster_setup/create-clusters.sh script.

PostgreSQL setup

The PostgreSQL database sits on the master node of spark-stream-cluster. Follow the instructions in docs/postgres_install.txt to download and setup access to it.

Configurations

Configuration settings for Kafka, PostgreSQL, AWS S3 bucket, as well as the schemas for the data are stored in the respective files in config/ folder.

Replace the settings in config/s3config.ini with the names and paths for your S3 bucket.

Running Image_IP_protection

Indexing

go to "Image_IP_protection/" folder and run the following command "spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 --master spark://:7077 image_IP_protect.py"

Using Deep ranking model

go to "Image_IP_protection/deep_ranking/" folder and run the following command "spark-submit --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 --master spark://:7077 spark_deploy_model.py"

Flask

go to "Image_IP_protection" folder, and run sudo screen python3 app.py to start the Flask server.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
deep_ranking		deep_ranking
flask-app		flask-app
.gitignore		.gitignore
README.md		README.md
hashing.py		hashing.py
image_IP_protect.py		image_IP_protect.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Image IP Protection

Pipeline

Data Sources

Environment Setup

CLUSTER STRUCTURE:

PostgreSQL setup

Configurations

Running Image_IP_protection

Indexing

Using Deep ranking model

Flask

About

Releases

Packages

Languages

lixali/Image_IP_protection

Folders and files

Latest commit

History

Repository files navigation

Image IP Protection

Pipeline

Data Sources

Environment Setup

CLUSTER STRUCTURE:

PostgreSQL setup

Configurations

Running Image_IP_protection

Indexing

Using Deep ranking model

Flask

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages