Products Crawler

Introduction

A web crawler built with scrapy to crawl e-commerce and drop-shipping websites for products.

The initial motivation was to collect products and their images for building a "Visual Recommender Engine" for an e-commerce app, now I intend on extending the number of spiders to crawl more websites and provide more data resources.

About the crawler:

The crawler uses splash to wait for AJAX requests and load the data.
MongoDB Pipeline is added to store the crawled products
Custom Image Pipeline stores the crawled products' images locally in directories based on the categories

Installation

First clone the repository:

git clone https://github.com/YazanShannak/Products-Crawler.git
Setup a virtual environment with venv (Optional):

python3 -m venv venv
Install the required packages:

pip install -r requirements.txt
Start scrapy splash with Docker:

docker container run -d -p 8050:8050 --name splash scrapinghub/splash:latest
Start MongoDB (with Docker optionally):

docker container run -d -p 27017:27017 --name crawler_db mongo:latest
Change directory to 'products_crawler':

cd products_crawler
Run quotes_test spider for testing:

scrapy crawl quotes_test
Run Ubuy-JO crawler:

scrapy crawl ubuy

Todo

Docker Image and docker-compose for the crawler
Cleanup pipelines
Custom ImagePipeline for S3 (Custom Naming)
Add more sites

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
products_crawler		products_crawler
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Products Crawler

Introduction

Installation

Todo

About

Releases

Packages

Languages

YazanShannak/Products-Crawler

Folders and files

Latest commit

History

Repository files navigation

Products Crawler

Introduction

Installation

Todo

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages